Get Cluster-Buster
    This webpage offers instructions for downloading Cluster-Buster on to your computer and running it from the UNIX command line. By downloading the program you can analyze multiple sequences at once, sequences longer than 100 kb, and include Cluster-Buster in your own analysis pipelines. Note that the downloadable program lacks some features of the web server, for example it does not understand GenBank format for sequences or TRANSFAC format for matrices, and it does not produce graphics for the output. You might also like to try using Cluster-Trainer in conjunction with Cluster-Buster.  
 
  Changes
    2010-02-19: Fixed the source code so it compiles on modern picky systems.
2006-05-09: Added scores for each motif type's contribution to each cluster. Added newest version of JASPAR to the website.
2005-02-16: Fixed the source code for fussy compilers.
2003-11-18: Added ability to specify the gap parameter in the motif file.
2003-10-26: Added more extensive documentation, and provided more choices for the output format.
 
 
  Installation
   
  1. Download Cluster-Buster by clicking on one of these links and saving the file on your computer:
    Cluster-Buster executable for Linux (Redhat 9)
    Cluster-Buster executable for Mac OS X (universal binary, not tested on Intel)
    Cluster-Buster executable for Windows/Cygwin (thanks: David Walsh)
  2. Set execute permission for the file by typing chmod +x cbust-linux (or whatever you saved it as).
  3. Cluster-Buster is now ready to run.
 
 
  Motif Matrices
    We provide the JASPAR motif collection in a format suitable for Cluster-Buster. Please also see the original JASPAR homepage, with more information including citation details. Another matrix collection is TRANSFAC.  
 
  Usage
   

Example usage: cbust -g20 -l mymotifs myseqs.fa

Cluster-Buster requires two inputs: a file of DNA sequences in the standard FASTA format (here is an example), and a file of motifs. Any non-alphabetic characters in the sequences are ignored, and any alphabetic characters except A, C, G, T (uppercase or lowercase) are converted to 'n' and forbidden from matching motifs.

The motif file should contain matrices in the following format:

>element1
0  4 2 14
12 0 0 8
8  0 1 11
20 0 0 0
>element2
13 1 1 5
...
The rows of each matrix correspond to successive positions of the motif, from 5' to 3', and the columns indicate the frequencies of A, C, G, and T, respectively, in each position. These frequencies are usually obtained from alignments of protein-binding sites.

It is possible to assign 'weights' to the motifs indicating how important they are for defining clusters. To do so, place a line like this within the motif definition (below the line beginning '>'):

# WEIGHT 2.7
This motif will carry 2.7-fold more weight than a motif with weight 1 (the default value). It is also possible to specify the gap parameter in the motif file, by including a line like this:
# GAP 22.4
Specifying the gap parameter with the -g option overrides any value given in the motif file. The program Cluster-Trainer can be used to estimate good values for the weights and the gap parameter.

Cluster-Buster compares each matrix to every location in the DNA sequence, and calculates scores reflecting the goodness of the match. It then identifies motif clusters as sequence regions with unusually strong concentrations of high-scoring matches. Each motif cluster receives a score which depends on the scores of its motifs, their weights, the tightness of their clustering, and the manner in which they overlap one another.

The location, score, and sequence of each motif cluster is printed. For high-scoring motif matches within clusters, the motif type, location, strand, score and sequence is printed.

The program's behavior may be modified with the following options. The default values are designed to give sensible results.

-h
Help: print documentation
-c
Cluster score threshold. Print details of all motif clusters with score >= this value.
-m
Motif score threshold. Print details of all motif matches that have score >= this value and occur within printed clusters. This option has no effect on finding motif clusters: it just affects which motifs get printed.
-g
Gap parameter. The expected distance in bp between neighboring motifs in a cluster. Low values will make the program more sensitive to finding very tight clusters even if the motif scores are not so high, and high values will make it more sensitive to finding clusters of high-scoring motifs even if the clustering is not very tight.
-r
Range in bp for counting local nucleotide abundances. Abundances of A, C, G, and T vary significantly along natural DNA sequences. The program estimates the local nucleotide abundances at each position in the sequence by counting them up to this distance in both directions. These abundances affect motif and hence cluster scores: e.g. a GC-rich motif will receive a higher score if found in an AT-rich region, where it is more surprising, than if it is found in a GC-rich region.
-l
Mask lowercase letters in the sequences (i.e. forbid motifs from matching them). Lowercase letters are often used to indicate repetitive regions. Repetitive regions, especially tandem repeats, can produce extremely high-scoring motif clusters, which may or may not be spurious.
-p
Pseudocount. This value gets added to all entries in the motif matrices. Pseudocounts are a standard way of estimating underlying frequencies from a limited number of observations. If your matrices contain probabilities rather than counts, you should probably set this parameter to zero.
-f
Output format.
0: Print the clusters in the first sequence sorted by score, then the clusters in the second sequence sorted by score, etc.
1: Concise version of 0, omitting details of individual motif matches.
2: Sort all clusters by score, regardless of which sequence they come from.
3: Concise version of 2, omitting details of individual motif matches.

 
 
  Output
    Cluster-Buster prints information for each motif cluster that it finds, for example:
CLUSTER 16
>gi|312167|emb|X70518.1|RNPROL R. norvegicus prolactin gene (5' region)  (2014 bases)
Location: 811 to 1045
Score: 6.06
CCAGGTCATCTGTCagtccaaattcagaaacagtaaagccaaaactaaaggtCACAAGCTGCTTcagatgaatgaatccc
caaattaaagaaagtcatcagcaacttcattattattcaccataatgacatcatttaggaaatctctaaaacatgagtgg
aactttggagtgcattaaaaaatgcatttTTGTCACTATGTCCTagagtgctttggGGTCAGAAGAGGCAGGCAG
V$SF1_Q6                 811 -  818   -   1.44       tgacctgg
V$E12_Q6                 814 -  824   -   3.36       gacagatgacc
Myf                      863 -  874   -   1.68       aagcagcttgtg
ERE                     1000 - 1014   -   0.729      aggacatagtgacaa
V$TANTIGEN_B            1027 - 1045   +   2.3        ggtcagaagaggcaggcag
V$CACBINDINGPROTEIN_Q6  1035 - 1043   +   1.52       gaggcaggc
The sequence in which the cluster was found, and the coordinates within the sequence, are indicated. Then the cluster's score is printed (the higher the better), and the sequence covered by the cluster, with motifs in uppercase. Finally, the following information is printed for each motif in the cluster: name, location, strand, score, and sequence.
 
 
  Source Code
    Follow these steps to download and compile the source code:
  1. Click and save: cbust-src.tar.gz
  2. Uncompress: gunzip cbust-src.tar.gz
  3. Un-archive: tar -xvf cbust-src.tar
  4. Change directory: cd cbust-src
  5. Compile (cross your fingers): make
Unfortunately the source code does not compile successfully on all systems. Any suggestions for making the code more portable would be greatly appreciated!
 
 
 
Comments and questions to Martin Frith