MotifViz: Motif Search and Visualization

Instructions | Sample output (Clover) | Gene Regulation Hub | Non-safari version

 1. Pick a program: Rover (motif count) Motifish (sequence count)
Clover Possum Visualization only

2. Query sequences:
Enter DNA sequences or    
GenBank identifiers:
AND/OR upload from a file:         
3. Select motifs:
Use JASPAR matrices
AND/OR enter other matrices
(Get from TRANSFAC)
AND/OR upload from a file:         
4. Background:
Enter background sequences
or GenBank identifiers:
AND/OR upload from a file:         
5. Set Clover options:

Additional background sequences:
Human chromosome 20 (44.1% C+G) - finished sequence
Human 2000 bp upstream of genes (49.8% C+G) - UCSC 08-Jul-2003
Human CpG islands (68.8% C+G, median length = 557 bp) - UCSC 14-Apr-2003

Mouse chromosome 19 (42.8% C+G) - NCBI Build 30
Mouse 2000 bp upstream of genes (47.8% C+G) - UCSC 25-Apr-2003

Drosophila chromosome 2 arm R (43.5% C+G) - BDGP Release 3

Overall raw score threshold:  Individual motif score threshold:
P-value threshold:  Pseudocount:
Shuffles:times on Sequence Sequence dinucleotide Motif.
Mask lower case letters in sequences
Display detailed mapping. (Number of bases per line:)


 



Instructions

Form entries | Gene Regulation Hub

Program of choice

We present 4 different Cis-element search programs for identifying functional sites in DNA sequences. Given a set of DNA sequences that share a common function, these programs can compare them to a library of sequence motifs (e.g. transcription factor binding patterns), and identify which if any of the motifs are statistically overrepresented in the sequence set:

References to these programs can be found in:
1. Frith, M. C., Fu,Y., Yu, L., Chen, J. F., Hansen, U. & Weng, Z. (2004) Detection of Functional DNA Motifs via Statistical Overrepresentation. Nucleic Acids Research (In Print)
2. Haverty, P. M., Hansen, U. & Weng, Z. (2004) Computational Inference of Transcriptional Regulatory Networks from Expression Profiling and Transcription Factor Binding Site Identification. Nucleic Acids Res. 2004 Jan 1;32(1):179-188.

Sequence Format

Sequences may be entered in Fasta, raw, or GenBank format. Any non-alphabetic characters in the sequence will be ignored, and any alphabetic characters except A, C, G and T (uppercase or lowercase) will be converted to 'n' and excluded from matching motifs. If GenBank format is used, your program of choice will read and display any 'CDS' (protein-coding region) annotations. Limits: at most 200 sequences, of total length up to 1000 kb.

GenBank Identifiers

For example GenBank accession numbers (e.g. NC_001669), 'accession.version' numbers (e.g. NC_001669.1), or GI numbers (e.g. 9628421).

Format for User-defined Cis-elements

We provide the JASPAR collection of transcription factor binding site patterns. JASPAR is described in the following publication; please give suitable credit:

Sandelin, A., Alkema, W., Engstrom, P., Wasserman, W.W. and Lenhard, B. (2004). JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res, 32 Database issue, D91-94

Another source of motifs is TRANSFAC: the commercial nature of this database prevents us from providing it directly.

User-specified cis-elements can be entered as TRANSFAC-style matrices, which look like this:

NA   AML-1a
XX
DE   runt-factor AML-1
XX
BF   T02256; AML1a; Species: human, Homo sapiens.
XX
P0      A      C      G      T
01      5      1      2     49      T
02      2      2     52      1      G
03      4     14      1     38      T
04      0      0     57      0      G
05      1      0     55      1      G
06      1      4      0     52      T
You can copy-and-paste these directly from the TRANSFAC website. All lines except the name line (beginning with 'NA') and the nucleotide frequency lines (beginning with digits) are ignored and not required. The name line is required, and should be above the base frequency lines. Alternatively, each cis-element can have a title line starting with ">" and then the name of the element, followed by 4 numbers per line describing nucleotide frequencies at each position in the cis-element. For example:
>element1
0  4 2 14
12 0 0 8
8  0 1 11
20 0 0 0
>element2
13 1 1 5
...
These numbers might come from a multiple alignment of experimentally determined cis-elements. The first column indicates the number of adenines observed in each position, the second column the number of cytosines, the third column the number of guanines, and the fourth column the number of thymines.

The two formats can be mixed.

For Possum, gaps in the cis-element may be indicated by entering a single "b" or "n" on a line. Possum will use local nucleotide abundance at these positions. This option allows users to specify multipartite cis-elements.

Background Sequences

Background sequences are required for Rover and Motifish, and recommended for Clover. Which background sets to use depends on which sequences you are studying: they should ideally come from the same taxonomic group as the target sequences, and have similar repetitive element and GC content. We like to cover our bases by using multiple background sets, e.g. for human target sequences, we might use a human chromosome, a set of human CpG islands, and a set of human gene upstream regions as backgrounds.

For Rover and Motifish, ideally every background sequence should be of the same length as each query sequence. Multiple background sets will be combined.

Clover requires background sequences to be much longer than query sequences (at least one sequence in each background set must be longer than the longest query sequence), and it processes each background set separatedly.

Motif Score Threshold

This is the threshold for sequence-position-specific motif instances scores. The standard log likelihood ratio method is used:
score = log[ prob(sequence|motif) / prob(sequence|random) ]

For Clover, this threshold does not affect overall raw score or P-value calculation, just that only motifs with log-likelihood scores higher than this value will be reported in the sequence output.

For Motifish, this threshold is used for initial scanning of all input cis-elements in background sequences. The ultimate threshold was determined for each cis-element such that 10% of background sequences contain at least one instances. Therefore this parameter will indirectly affect the counts of motif-containing query sequences and P-value.

Statistical Significance

Motifish,Rover and Clover print details for statistically significant motifs (all P-values <= some threshold).

Rover and Motifish are contingency table based methods by counting the occurrences of a cis-element above a certain motif score threshold or the number of sequences containing such hits. If the overall P-value calculated by comparing the counts in query sequences and background sequences is lower than this threshold, the cis-element is considered over/under-represented in the query sequences.

Clover calculates an overall "raw score" indicating how strongly the motif is present in the whole sequence set. Raw scores by themselves are hard to interpret, so Clover provides options (which we recommend you use) to determine the statistical significance (P-values) of the raw scores. P-value threshold, if applicable, always nullifies overall raw score cutoff. Four ways of determining statistical significance are available. The first involves providing Clover with one or more files of background DNA sequences. Each background file should contain sequences in FASTA format, with total length much greater than the target sequence set. For each background set, Clover will repeatedly extract random fragments matched by length to the target sequences, and calculate raw scores for these fragments. The proportion of times that the raw score of a fragment set exceeds or equals the raw score of the target set, e.g. 0.02, is called a P-value. The P-value indicates the probability that the motif's presence in the target set can be explained just by chance. For each motif, a separate P-value is calculated for each background file. The second way of determining statistical significance is to repeatedly shuffle the letters within each target sequence, and use these shuffled sequence sets as controls. P-values are calculated as above. The third way is to create random sequences with the same dinucleotide compositions as each target sequence. The fourth way is to shuffle the motif matrices, and obtain control raw scores by comparing the shuffled motifs to the target sequences. When shuffling a motif, the counts of A, C, G and T within each position are not shuffled, but the positions are shuffled among one another. The shuffling methods suffer from predicting motifs that lie in Alus and other common repetitive elements to be significant.

Nucleotide Abundance Range

Used by possum only. The local abundances of A, C, G, and T at each point in the sequence will be estimated by looking this far in either direction. Local nucleotide abundances often vary quite significantly along a sequence.

Assuming equal abundance of 1/4 can be useful for analyzing very short sequences. It nullifies the "nucleotide abundance range" option.

Pseudocount

This value will be added to all the counts in the cis-element matrices. Pseudocounts are often used in estimating true abundances from a limited number of observations. The default value of 0.375 was obtained by a fitting procedure to all transcription factor binding site matrices in the TRANSFAC database. If your matrices contain probabilities rather than counts, you should probably set this parameter to 0.

Visualization only

To save computing time, you can upload a file containing previously saved text output for visualization purposes, after you input the proper sequence information. Please assure the integrity of your input text file. You should either redirect command-line program output to a text file, or save text output at the end of MotifViz result page into a text file.

Please visit the following pages for directions on downloading command-line programs:

  1. Clover -- http://zlab.bu.edu/clover/
  2. Rover -- http://zlab.bu.edu/rover/
  3. Motifish -- Motifisher.pl (requires Clover to run)
  4. Possum -- Linux executable (more to come!)

Return to Zlab Gene Regulation Hub

Suggestions to: Yutao Fu
Last modified: Saturday, 09-Feb-2004 14:00:00 EST