Comet

What is it?

COMET stands for Cluster Of Motifs E-value Tool. It finds statistically significant clusters of motifs in a DNA sequence. The motifs are represented using 4 x L matrices, which record the frequencies of the nucleotides A, C, G, and T at each position in the motif. The Web version of Comet allows you to select motifs from a small library that we have constructed, or you can enter your own matrices, for example by copying and pasting them directly from the TRANSFAC database.

Input Form for using Comet on the Web.
Download Comet to use on your own computer.
Our paper on Comet.
Return to Zlab Gene Regulation Hub.

Why search for clusters of motifs?

The most obvious application of Comet is to make predictions about the regulation of gene transcription. In higher eukaryotes, including humans, transcriptional regulation is encoded in clusters of cis-elements that constitute enhancers, silencers, and promoters. Cis-elements possess characteristic sequence patterns or 'motifs', but they are usually too weak for accurate detection of individual cis-elements, although you can try to do so with our program Possum. Searching for motif clusters allows multiple weak signals to synergize to form a stronger signal, with a greater chance of being detected. There is evidence that alternative splicing, mRNA 3' end processing, and mRNA localization are also regulated by clusters of individually weak signals.

How does it work?

In order to be widely applicable, we want a method that tells us in a general sense whether or not a region of the sequence contains an unusually strong concentration of motifs. Comet assigns a positive score to each motif using the standard method of log likelihood ratios, and subtracts a 'gap penalty' linearly proportional to the distances between motifs. Thus each motif cluster receives a score, which is higher if the individual motifs are stronger, but lower if they are further apart. The use of this gap penalty is not as ad-hoc as it may seem at first. The scoring scheme corresponds to a log likelihood ratio of explaining the data given a cluster model versus a background model. The cluster model is for cis-elements to occur in a uniform distribution, with some intensity, whereas the background model consists of random nucleotides. The gap penalty corresponds in a one-to-one fashion with the intensity parameter of the cluster model. That's probably more than you wanted to know.

How is Comet different from Cister?

Our other program Cister also aims to detect clusters of cis-elements, and these two programs are quite similar. Here are the main differences:

Comet requires fewer parameters to be set.
Cister integrates all possible arrangements of cis-elements in the cluster, whereas Comet just considers the most probable arrangement. This is probably an advantage of Cister.
Comet's output is a list of predicted cis-element clusters, whereas Cister's output consists of posterior probability curves indicating the likelihood that each basepair is within a cis-element cluster. Although Cister's curves look nice, we suspect that Comet's output is more straightforward to interpret.
The most important advantage of Comet is that it indicates the statistical significance of its predictions using an E-value. The E-value is the number of times we would expect to see a motif cluster of this strength in the sequence purely by chance. So the lower the better.

Citation:

Frith MC, Spouge JL, Hansen U, Weng Z
Statistical significance of clusters of motifs represented by position specific scoring matrices in nucleotide sequences.
Nucleic Acids Res 2002 Jul 15;30(14):3214-24

New - Datasets used in the Comet paper.

If you use cis-element matrices from TRANSFAC, please cite:

Wingender E, Chen X, Hehl R, Karas H, Liebich I, Matys V, Meinhardt T, Pruss M, Reuter I, Schacherer F
TRANSFAC: an integrated system for gene expression regulation
Nucleic Acids Res 2000 28, 316-319

Input

This section describes some of the options on the Comet Input Form.

Sequence Input Format

A sequence may be entered in FASTA format, with a title line beginning with ">", followed by lines containing the sequence. The title line is not required, and any digits or whitespace characters in the sequence are ignored. Alternatively you can use GenBank format, in which case any annotated coding regions will be displayed in the output. Maximum sequence length: 100 kb (download comet if you want to analyze longer sequences).

GenBank Identifiers

For example a GenBank accession number (e.g. NC_001669), an 'accession.version' number (e.g. NC_001669.1), or a GI number (e.g. 9628421). Please note: you may want to check that your identifier refers to a promoter sequence. For example, GenBank accessions from Affymetrix chips may refer to mRNA sequences, which don't include the promoter region.

Set Subsequence

You may limit the search to a subsequence by entering its start and end coordinates. (The first nucleotide in the sequence has coordinate 1.) The default values of the From and To fields are the start and end of the sequence, respectively.

Format for User-defined Cis-elements

Cis-elements can be entered as TRANSFAC-style matrices, which look like this:

NA   AML-1a
XX
DE   runt-factor AML-1
XX
BF   T02256; AML1a; Species: human, Homo sapiens.
XX
P0      A      C      G      T
01      5      1      2     49      T
02      2      2     52      1      G
03      4     14      1     38      T
04      0      0     57      0      G
05      1      0     55      1      G
06      1      4      0     52      T

You can cut-and-paste these directly from the TRANSFAC website. All lines except the name line (beginning with 'NA') and the nucleotide frequency lines (beginning with digits) are ignored and not required. The name line is required, and should be above the base frequency lines.

Alternatively, each cis-element can have a title line starting with ">" and then the name of the element, followed by 4 numbers per line describing nucleotide frequencies at each position in the cis-element. For example:

>element1
14 4 2 0
0 0 12 8
8 8 1 3
20 0 0 0
3 3 13 1
10 0 10 0
3 3 6 8
>element2
13 1 1 5
...

These numbers might come from a multiple alignment of experimentally determined cis-elements. The first column indicates the number of adenines observed in each position, the second column the number of cytosines, the third column the number of guanines, and the fourth column the number of thymines. Gaps in the cis-element may be indicated by entering a single "b" on a line. Comet will use background nucleotide frequencies at these positions. This option allows users to specify multipartite cis-elements. In addition, if a transcription factor is known to occlude several bases adjacent to its sequence-specific binding site from binding other factors, this steric hindrance can be modelled by specifying a number of "b" (blank / background) positions above and below the sequence-specific portion of the cis-element definition.

The 2 formats can be mixed. Optionally, there may be an extra line following the name line (in either format) specifying weights for the cis-element on each strand (see the download page for more details).

Gap Parameter

Comet makes use of a gap penalty, explained above in the section How does it work? The gap penalty is mathematically related to the average distance between cis-elements in a cluster according to our "model" of cis-element clusters. Since the gap penalty is not a very intuitive quantity, Comet allows you to specify the average distance between cis-elements instead. The default value of 35 bp gives reasonable results.

Window size

To discriminate motifs from background sequence, Comet uses an estimate of the background abundances of A, C, G, and T. Unfortunately, these abundances fluctuate considerably along real DNA sequences. Therefore, Comet counts the local nucleotide abundances in windows of width 2w+1. The default value w = 75 bp works well for human DNA.

E-value threshold

Only clusters with E-values lower than this threshold will be reported.

Pseudocount

This value will be added to all the counts in the cis-element matrices. Pseudocounts are often used in estimating true abundances from a limited number of observations. The default is to use Laplace's Rule of Succession: a pseudocount of 1.