MotifViz: Motif Search and Visualization

Instructions | Sample output (Clover) | Gene Regulation Hub | Non-safari version

1. Pick a program:	Rover (motif count)		Motifish (sequence count)
	Clover	Possum	Visualization only

2. Query sequences:
Enter DNA sequences or GenBank identifiers:
AND/OR upload from a file:

3. Select motifs:
Use JASPAR matrices
AND/OR enter other matrices: (Get from TRANSFAC)
AND/OR upload from a file:

4. Background:
Enter background sequences or GenBank identifiers:
AND/OR upload from a file:

We present 4 different Cis-element search programs for identifying functional sites in DNA sequences. Given a set of DNA sequences that share a common function, these programs can compare them to a library of sequence motifs (e.g. transcription factor binding patterns), and identify which if any of the motifs are statistically overrepresented in the sequence set:

Possum: simple PWM(Position-specific Weight Matrix) scan, fast but without statistical evaluation.
Motifish: use PWM scan and Fisher Exact Test to compare the number sequences a motif occurs in query sequence set vs background set.
Rover: use PWM scan and binomial estimation to compare the number of motif occurrences in query sequences vs background.
Clover: use a thermodynamic model with permutation or background input for statistical evaluation of Cis-eLement OVERrepresentation, more sophisticated but slow compared to the other 3 programs.

References to these programs can be found in:
1. Frith, M. C., Fu,Y., Yu, L., Chen, J. F., Hansen, U. & Weng, Z. (2004) Detection of Functional DNA Motifs via Statistical Overrepresentation. Nucleic Acids Research (In Print)
2. Haverty, P. M., Hansen, U. & Weng, Z. (2004) Computational Inference of Transcriptional Regulatory Networks from Expression Profiling and Transcription Factor Binding Site Identification. Nucleic Acids Res. 2004 Jan 1;32(1):179-188.

Sequence Format

Sequences may be entered in Fasta, raw, or GenBank format. Any non-alphabetic characters in the sequence will be ignored, and any alphabetic characters except A, C, G and T (uppercase or lowercase) will be converted to 'n' and excluded from matching motifs. If GenBank format is used, your program of choice will read and display any 'CDS' (protein-coding region) annotations. Limits: at most 200 sequences, of total length up to 1000 kb.

GenBank Identifiers

For example GenBank accession numbers (e.g. NC_001669), 'accession.version' numbers (e.g. NC_001669.1), or GI numbers (e.g. 9628421).

Format for User-defined Cis-elements

We provide the JASPAR collection of transcription factor binding site patterns. JASPAR is described in the following publication; please give suitable credit:

Sandelin, A., Alkema, W., Engstrom, P., Wasserman, W.W. and Lenhard, B. (2004). JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res, 32 Database issue, D91-94

Another source of motifs is TRANSFAC: the commercial nature of this database prevents us from providing it directly.

User-specified cis-elements can be entered as TRANSFAC-style matrices, which look like this:

NA   AML-1a
XX
DE   runt-factor AML-1
XX
BF   T02256; AML1a; Species: human, Homo sapiens.
XX
P0      A      C      G      T
01      5      1      2     49      T
02      2      2     52      1      G
03      4     14      1     38      T
04      0      0     57      0      G
05      1      0     55      1      G
06      1      4      0     52      T

You can copy-and-paste these directly from the TRANSFAC website. All lines except the name line (beginning with 'NA') and the nucleotide frequency lines (beginning with digits) are ignored and not required. The name line is required, and should be above the base frequency lines. Alternatively, each cis-element can have a title line starting with ">" and then the name of the element, followed by 4 numbers per line describing nucleotide frequencies at each position in the cis-element. For example:

>element1
0  4 2 14
12 0 0 8
8  0 1 11
20 0 0 0
>element2
13 1 1 5
...

These numbers might come from a multiple alignment of experimentally determined cis-elements. The first column indicates the number of adenines observed in each position, the second column the number of cytosines, the third column the number of guanines, and the fourth column the number of thymines.

The two formats can be mixed.

For Possum, gaps in the cis-element may be indicated by entering a single "b" or "n" on a line. Possum will use local nucleotide abundance at these positions. This option allows users to specify multipartite cis-elements.

Background Sequences

Background sequences are required for Rover and Motifish, and recommended for Clover. Which background sets to use depends on which sequences you are studying: they should ideally come from the same taxonomic group as the target sequences, and have similar repetitive element and GC content. We like to cover our bases by using multiple background sets, e.g. for human target sequences, we might use a human chromosome, a set of human CpG islands, and a set of human gene upstream regions as backgrounds.

For Rover and Motifish, ideally every background sequence should be of the same length as each query sequence. Multiple background sets will be combined.

Clover requires background sequences to be much longer than query sequences (at least one sequence in each background set must be longer than the longest query sequence), and it processes each background set separatedly.

Motif Score Threshold

This is the threshold for sequence-position-specific motif instances scores. The standard log likelihood ratio method is used:

score = log[ prob(sequence|motif) / prob(sequence|random) ]

For Clover, this threshold does not affect overall raw score or P-value calculation, just that only motifs with log-likelihood scores higher than this value will be reported in the sequence output.

For Motifish, this threshold is used for initial scanning of all input cis-elements in background sequences. The ultimate threshold was determined for each cis-element such that 10% of background sequences contain at least one instances. Therefore this parameter will indirectly affect the counts of motif-containing query sequences and P-value.

Statistical Significance

Motifish,Rover and Clover print details for statistically significant motifs (all P-values <= some threshold).

Rover and Motifish are contingency table based methods by counting the occurrences of a cis-element above a certain motif score threshold or the number of sequences containing such hits. If the overall P-value calculated by comparing the counts in query sequences and background sequences is lower than this threshold, the cis-element is considered over/under-represented in the query sequences.

Clover calculates an overall "raw score" indicating how strongly the motif is present in the whole sequence set. Raw scores by themselves are hard to interpret, so Clover provides options (which we recommend you use) to determine the statistical significance (P-values) of the raw scores. P-value threshold, if applicable, always nullifies overall raw score cutoff. Four ways of determining statistical significance are available. The first involves providing Clover with one or more files of background DNA sequences. Each background file should contain sequences in FASTA format, with total length much greater than the target sequence set. For each background set, Clover will repeatedly extract random fragments matched by length to the target sequences, and calculate raw scores for these fragments. The proportion of times that the raw score of a fragment set exceeds or equals the raw score of the target set, e.g. 0.02, is called a P-value. The P-value indicates the probability that the motif's presence in the target set can be explained just by chance. For each motif, a separate P-value is calculated for each background file. The second way of determining statistical significance is to repeatedly shuffle the letters within each target sequence, and use these shuffled sequence sets as controls. P-values are calculated as above. The third way is to create random sequences with the same dinucleotide compositions as each target sequence. The fourth way is to shuffle the motif matrices, and obtain control raw scores by comparing the shuffled motifs to the target sequences. When shuffling a motif, the counts of A, C, G and T within each position are not shuffled, but the positions are shuffled among one another. The shuffling methods suffer from predicting motifs that lie in Alus and other common repetitive elements to be significant.

Nucleotide Abundance Range

Used by possum only. The local abundances of A, C, G, and T at each point in the sequence will be estimated by looking this far in either direction. Local nucleotide abundances often vary quite significantly along a sequence.

Assuming equal abundance of 1/4 can be useful for analyzing very short sequences. It nullifies the "nucleotide abundance range" option.

Pseudocount

This value will be added to all the counts in the cis-element matrices. Pseudocounts are often used in estimating true abundances from a limited number of observations. The default value of 0.375 was obtained by a fitting procedure to all transcription factor binding site matrices in the TRANSFAC database. If your matrices contain probabilities rather than counts, you should probably set this parameter to 0.

Visualization only

To save computing time, you can upload a file containing previously saved text output for visualization purposes, after you input the proper sequence information. Please assure the integrity of your input text file. You should either redirect command-line program output to a text file, or save text output at the end of MotifViz result page into a text file.

Please visit the following pages for directions on downloading command-line programs:

Clover -- http://zlab.bu.edu/clover/
Rover -- http://zlab.bu.edu/rover/
Motifish -- Motifisher.pl (requires Clover to run)
Possum -- Linux executable (more to come!)

Return to Zlab Gene Regulation Hub

Suggestions to: Yutao Fu
Last modified: Saturday, 09-Feb-2004 14:00:00 EST

NONE
ALL AP2 family
ALL bHLH family
ALL bHLH-ZIP family
ALL bZIP family
ALL CAAT-BOX family
ALL ETS family
ALL FORKHEAD family
ALL HMG family
ALL HOMEO family
ALL HOMEO-ZIP family
ALL IPT/TIG domain family
ALL MADS family
ALL NUCLEAR RECEPTOR family
ALL P53 family
ALL PAIRED family
ALL PAIRED-HOMEO family
ALL REL family
ALL RUNT family
ALL T-BOX family
ALL TATA-BOX family
ALL TEA family
ALL TRP-CLUSTER family
ALL Unknown family
ALL ZN-FINGER, C2H2 family
ALL ZN-FINGER, DOF family
ALL ZN-FINGER, GATA family
Agamous, MADS family
AGL3, MADS family
Ahr-ARNT, bHLH family
AML-1, RUNT family
Androgen, NUCLEAR RECEPTOR family
AP2alpha, AP2 family
ARNT, bHLH family
Athb-1, HOMEO-ZIP family
Brachyury, T-BOX family
Broad-complex_1, ZN-FINGER, C2H2 family
Broad-complex_2, ZN-FINGER, C2H2 family
Broad-complex_3, ZN-FINGER, C2H2 family
Broad-complex_4, ZN-FINGER, C2H2 family
Bsap, PAIRED family
bZIP910, bZIP family
bZIP911, bZIP family
c-ETS, ETS family
c-FOS, bZIP family
c-MYB_1, TRP-CLUSTER family
c-REL, REL family
cEBP, bZIP family
CF2-II, ZN-FINGER, C2H2 family
CFI-USP, NUCLEAR RECEPTOR family
Chop-cEBP, bZIP family
COUP-TF, NUCLEAR RECEPTOR family
CREB, bZIP family
deltaEF1, ZN-FINGER, C2H2 family
Dof2, ZN-FINGER, DOF family
Dof3, ZN-FINGER, DOF family
Dorsal_1, REL family
Dorsal_2, REL family
E2F, Unknown family
E4BP4, bZIP family
E74A, ETS family
Elk-1, ETS family
EN-1, HOMEO family
Evi-1, ZN-FINGER, C2H2 family
FREAC-2, FORKHEAD family
FREAC-3, FORKHEAD family
FREAC-4, FORKHEAD family
FREAC-7, FORKHEAD family
GAMYB, TRP-CLUSTER family
GATA-1, ZN-FINGER, GATA family
GATA-2, ZN-FINGER, GATA family
GATA-3, ZN-FINGER, GATA family
Gfi, ZN-FINGER, C2H2 family
Gklf, ZN-FINGER, C2H2 family
Hen-1, bHLH family
HFH-1, FORKHEAD family
HFH-2, FORKHEAD family
HFH-3, FORKHEAD family
HLF, bZIP family
HMG-1, HMG family
HMG-IY, HMG family
HNF-1, HOMEO family
HNF-3beta, FORKHEAD family
Hunchback, ZN-FINGER, C2H2 family
Irf-1, TRP-CLUSTER family
Irf-2, TRP-CLUSTER family
Max, bHLH-ZIP family
MEF2, MADS family
MNB1A, ZN-FINGER, DOF family
MYB.ph3, TRP-CLUSTER family
Myc-Max, bHLH-ZIP family
Myf, bHLH family
MZF_1-4, ZN-FINGER, C2H2 family
MZF_5-13, ZN-FINGER, C2H2 family
n-MYC, bHLH-ZIP family
NF-kappaB, REL family
NF-Y, CAAT-BOX family
Nkx, HOMEO family
NRF-2, ETS family
p50, REL family
p53, P53 family
p65, REL family
Pax-2, PAIRED family
Pax-4, PAIRED-HOMEO family
Pax6, PAIRED family
PBF, ZN-FINGER, DOF family
Pbx, HOMEO family
PPARgamma, NUCLEAR RECEPTOR family
PPARgamma-RXRal, NUCLEAR RECEPTOR family
RORalfa-1, NUCLEAR RECEPTOR family
RORalfa-2, NUCLEAR RECEPTOR family
RREB-1, ZN-FINGER, C2H2 family
RXR-VDR, NUCLEAR RECEPTOR family
S8, HOMEO family
SAP-1, ETS family
Snail, ZN-FINGER, C2H2 family
Sox-5, HMG family
SOX-9, HMG family
SOX17, HMG family
SP1, ZN-FINGER, C2H2 family
SPI-1, ETS family
SPI-B, ETS family
SQUA, MADS family
SRF, MADS family
SRY, HMG family
Staf, ZN-FINGER, C2H2 family
SU_h, IPT/TIG domain family
Tal1beta-E47S, bHLH family
TBP, TATA-BOX family
TCF11-MafG, bZIP family
TEF-1, TEA family
Thing1-E47, bHLH family
Ubx, HOMEO family
USF, bHLH-ZIP family
Yin-Yang, ZN-FINGER, C2H2 family
ALL JASPAR

3. Upload previous output file:
Display detailed mapping. (Number of bases per line:)