Cister predicts regulatory regions in DNA sequences by searching for clusters of cis-elements. You can just give her a DNA sequence, select which types of cis-elements you want to search for, and go!
These instructions are for using Cister on the Web. There is also a downloadable version that can be run on the command line. Download page
Return to Cister input form
The results are displayed as a plot like this.
The colored lines indicate probabilities that regulatory factors bind to cis-elements at these positions. The black curve indicates the overall probability of being within a cluster of cis-elements bound by their factors. Each color corresponds to a different kind of binding site, as described in the key. Lines in the upper half of the plot indicate cis-elements on the direct strand, and lines in the lower half refer to the complementary strand. In this example, which is the output for the whole genome of the SV40 virus, Cister correctly identifies the promoter at the start of the genome, and makes no false positive predictions.
The program also produces a table of high-scoring cis-elements, like this:
Cister understands GenBank format, and will display annotated protein coding regions (CDS) in the output. Alternatively, fasta format (with comment lines starting with ">") or raw sequence can be entered - digits, spaces and newlines are ignored. Maximum sequence length: 100 kb (download Cister if you want to analyze longer sequences).
For example a GenBank accession number (e.g. NC_001669), an 'accession.version' number (e.g. NC_001669.1), or a GI number (e.g. 9628421). Please note: you may want to check that your identifier refers to a promoter sequence. For example, GenBank accessions from Affymetrix chips may refer to mRNA sequences, which don't include the promoter region.
You may limit the search to a subsequence by entering its start and end coordinates. (The first nucleotide in the sequence has coordinate 1.) The default values of the From and To fields are the start and end of the sequence, respectively.
Cis-elements can be entered as TRANSFAC-style matrices, which look like this:
NA AML-1a XX DE runt-factor AML-1 XX BF T02256; AML1a; Species: human, Homo sapiens. XX P0 A C G T 01 5 1 2 49 T 02 2 2 52 1 G 03 4 14 1 38 T 04 0 0 57 0 G 05 1 0 55 1 G 06 1 4 0 52 T
You can cut-and-paste these directly from the TRANSFAC website. All lines except the name line (beginning with 'NA') and the position-specific nucleotide frequency lines (beginning with digits) are ignored. The name line is required, and should be above the base frequency lines.
Alternatively, each cis-element can have a title line starting with ">" and then the name of the element, followed by 4 numbers per line describing nucleotide frequencies at each position in the cis-element. For example:
>element1 14 4 2 0 0 0 12 8 8 8 1 3 20 0 0 0 3 3 13 1 10 0 10 0 3 3 6 8 >element2 13 1 1 5 ...
These numbers might come from a multiple alignment of experimentally determined cis-elements. The first column indicates the number of adenines observed in each position, the second column the number of cytosines, the third column the number of guanines, and the fourth column the number of thymines. Gaps in the cis-element may be indicated by entering a single "b" on a line. Cister will use background nucleotide frequencies at these positions. This option allows users to specify multipartite cis-elements. In addition, if a transcription factor is known to occlude several bases adjacent to its sequence-specific binding site from binding other factors, this steric hindrance can be modelled by specifying a number of "b" (blank / background) positions above and below the sequence-specific portion of the cis-element definition.
The 2 formats can be mixed. Optionally, there may be an extra line following the name line (in either format) specifying weights for the cis-element on each strand (see the download page for more details).
Cister detects cis-element clusters by using a statistical model (a hidden Markov model) of what it expects these clusters to look like. Basically, the more closely this model matches real clusters, the better Cister will do. The parameters allow the user to vary some aspects of the model, and it is quite possible that different model parameters are suitable for different types of motif cluster.
These 3 parameters should be chosen to resemble what you expect to find in a real functional cis-element cluster. Since the distributions are all geometric, the median is about 70% of the mean.
The background states are programmed to represent the local abundances of the 4 bases in the query sequence. Examining local abundances accounts for the biological reality of heterogeneous base composition, and prevents, for example, many spurious GC-rich motifs being detected in a part of the sequence that happens to be generally GC-rich.
Cister uses the technique of posterior decoding, with this hidden Markov model:
Frith, M. C., Hansen U. and Weng, Z.
Detection of cis-element clusters in higher eukaryotic DNA
Bioinformatics 2001 Oct;17(10):878-889.
If you use cis-element matrices from TRANSFAC, please cite:
Wingender, E., Chen, X., Hehl, R., Karas, H., Liebich, I., Matys, V., Meinhardt, T., Pruss, M., Reuter, I. and Schacherer, F.
TRANSFAC: an integrated system for gene expression regulation
Nucleic Acids Res. 28, 316-319 (2000).