We present 4 different Cis-element search programs for identifying functional sites in DNA sequences. Given a set of DNA sequences that share a common function, these programs can compare them to a library of sequence motifs (e.g. transcription factor binding patterns), and identify which if any of the motifs are statistically overrepresented in the sequence set:
Sequences may be entered in Fasta, raw, or GenBank format. Any non-alphabetic characters in the sequence will be ignored, and any alphabetic characters except A, C, G and T (uppercase or lowercase) will be converted to 'n' and excluded from matching motifs. If GenBank format is used, your program of choice will read and display any 'CDS' (protein-coding region) annotations. Limits: at most 200 sequences, of total length up to 1000 kb.
For example GenBank accession numbers (e.g. NC_001669), 'accession.version' numbers (e.g. NC_001669.1), or GI numbers (e.g. 9628421).
We provide the JASPAR collection of transcription factor binding site patterns. JASPAR is described in the following publication; please give suitable credit:
Another source of motifs is TRANSFAC: the commercial nature of this database prevents us from providing it directly.
User-specified cis-elements can be entered as TRANSFAC-style matrices, which look like this:
NA AML-1a XX DE runt-factor AML-1 XX BF T02256; AML1a; Species: human, Homo sapiens. XX P0 A C G T 01 5 1 2 49 T 02 2 2 52 1 G 03 4 14 1 38 T 04 0 0 57 0 G 05 1 0 55 1 G 06 1 4 0 52 TYou can copy-and-paste these directly from the TRANSFAC website. All lines except the name line (beginning with 'NA') and the nucleotide frequency lines (beginning with digits) are ignored and not required. The name line is required, and should be above the base frequency lines. Alternatively, each cis-element can have a title line starting with ">" and then the name of the element, followed by 4 numbers per line describing nucleotide frequencies at each position in the cis-element. For example:
>element1 0 4 2 14 12 0 0 8 8 0 1 11 20 0 0 0 >element2 13 1 1 5 ...These numbers might come from a multiple alignment of experimentally determined cis-elements. The first column indicates the number of adenines observed in each position, the second column the number of cytosines, the third column the number of guanines, and the fourth column the number of thymines.
The two formats can be mixed.
For Possum, gaps in the cis-element may be indicated by entering a single "b" or "n" on a line. Possum will use local nucleotide abundance at these positions. This option allows users to specify multipartite cis-elements.
Background sequences are required for Rover and Motifish, and recommended for Clover. Which background sets to use depends on which sequences you are studying: they should ideally come from the same taxonomic group as the target sequences, and have similar repetitive element and GC content. We like to cover our bases by using multiple background sets, e.g. for human target sequences, we might use a human chromosome, a set of human CpG islands, and a set of human gene upstream regions as backgrounds.
For Rover and Motifish, ideally every background sequence should be of the same length as each query sequence. Multiple background sets will be combined.
Clover requires background sequences to be much longer than query sequences (at least one sequence in each background set must be longer than the longest query sequence), and it processes each background set separatedly.
score = log[ prob(sequence|motif) / prob(sequence|random) ]
For Clover, this threshold does not affect overall raw score or P-value calculation, just that only motifs with log-likelihood scores higher than this value will be reported in the sequence output.
For Motifish, this threshold is used for initial scanning of all input cis-elements in background sequences. The ultimate threshold was determined for each cis-element such that 10% of background sequences contain at least one instances. Therefore this parameter will indirectly affect the counts of motif-containing query sequences and P-value.
Motifish,Rover and Clover print details for statistically significant motifs (all
Rover and Motifish are contingency table based methods by counting the occurrences of a cis-element above a certain motif score threshold or the number of sequences containing such hits. If the overall P-value calculated by comparing the counts in query sequences and background sequences is lower than this threshold, the cis-element is considered over/under-represented in the query sequences.
Clover calculates an overall "raw score" indicating how strongly the motif is present in
the whole sequence set. Raw scores by themselves are hard to interpret, so Clover provides
options (which we recommend you use) to determine the statistical significance (P-values) of the raw
scores. P-value threshold, if applicable, always nullifies overall raw score cutoff.
Four ways of determining statistical significance are available. The first involves
providing Clover with one or more files of background DNA sequences. Each background file
should contain sequences in FASTA format, with total length much greater than the target
sequence set. For each background set, Clover will repeatedly extract random fragments
matched by length to the target sequences, and calculate raw scores for these fragments.
The proportion of times that the raw score of a fragment set exceeds or equals the raw
score of the target set, e.g. 0.02, is called a
Used by possum only. The local abundances of A, C, G, and T at each point in the sequence will be estimated by looking this far in either direction. Local nucleotide abundances often vary quite significantly along a sequence.
Assuming equal abundance of 1/4 can be useful for analyzing very short sequences. It nullifies the "nucleotide abundance range" option.
This value will be added to all the counts in the cis-element matrices. Pseudocounts are often used in estimating true abundances from a limited number of observations. The default value of 0.375 was obtained by a fitting procedure to all transcription factor binding site matrices in the TRANSFAC database. If your matrices contain probabilities rather than counts, you should probably set this parameter to 0.
To save computing time, you can upload a file containing previously saved text output for visualization purposes, after you input the proper sequence information. Please assure the integrity of your input text file. You should either redirect command-line program output to a text file, or save text output at the end of MotifViz result page into a text file.
Please visit the following pages for directions on downloading command-line programs:
Return to Zlab Gene Regulation Hub
Suggestions to: Yutao Fu
Last modified: Saturday, 09-Feb-2004 14:00:00 EST