Get Cluster-Buster | |||
This webpage offers instructions for downloading Cluster-Buster on to your computer and running it from the UNIX command line. By downloading the program you can analyze multiple sequences at once, sequences longer than 100 kb, and include Cluster-Buster in your own analysis pipelines. Note that the downloadable program lacks some features of the web server, for example it does not understand GenBank format for sequences or TRANSFAC format for matrices, and it does not produce graphics for the output. You might also like to try using Cluster-Trainer in conjunction with Cluster-Buster. | |||
Changes | |||
2010-02-19: Fixed the source code so it compiles on modern picky systems. 2006-05-09: Added scores for each motif type's contribution to each cluster. Added newest version of JASPAR to the website. 2005-02-16: Fixed the source code for fussy compilers. 2003-11-18: Added ability to specify the gap parameter in the motif file. 2003-10-26: Added more extensive documentation, and provided more choices for the output format. |
|||
Installation | |||
|
|||
Motif Matrices | |||
We provide the JASPAR motif collection in a format suitable for Cluster-Buster. Please also see the original JASPAR homepage, with more information including citation details. Another matrix collection is TRANSFAC. | |||
Usage | |||
Example usage: cbust -g20 -l mymotifs myseqs.fa Cluster-Buster requires two inputs: a file of DNA sequences in the standard FASTA format (here is an example), and a file of motifs. Any non-alphabetic characters in the sequences are ignored, and any alphabetic characters except A, C, G, T (uppercase or lowercase) are converted to 'n' and forbidden from matching motifs. The motif file should contain matrices in the following format: >element1 0 4 2 14 12 0 0 8 8 0 1 11 20 0 0 0 >element2 13 1 1 5 ...The rows of each matrix correspond to successive positions of the motif, from 5' to 3', and the columns indicate the frequencies of A, C, G, and T, respectively, in each position. These frequencies are usually obtained from alignments of protein-binding sites. It is possible to assign 'weights' to the motifs indicating how important they are for defining clusters. To do so, place a line like this within the motif definition (below the line beginning '>'): # WEIGHT 2.7This motif will carry 2.7-fold more weight than a motif with weight 1 (the default value). It is also possible to specify the gap parameter in the motif file, by including a line like this: # GAP 22.4Specifying the gap parameter with the -g option overrides any value given in the motif file. The program Cluster-Trainer can be used to estimate good values for the weights and the gap parameter. Cluster-Buster compares each matrix to every location in the DNA sequence, and calculates scores reflecting the goodness of the match. It then identifies motif clusters as sequence regions with unusually strong concentrations of high-scoring matches. Each motif cluster receives a score which depends on the scores of its motifs, their weights, the tightness of their clustering, and the manner in which they overlap one another. The location, score, and sequence of each motif cluster is printed. For high-scoring motif matches within clusters, the motif type, location, strand, score and sequence is printed. The program's behavior may be modified with the following options. The default values are designed to give sensible results.
|
|||
Output | |||
Cluster-Buster prints information for each motif cluster that it finds, for example:
CLUSTER 16 >gi|312167|emb|X70518.1|RNPROL R. norvegicus prolactin gene (5' region) (2014 bases) Location: 811 to 1045 Score: 6.06 CCAGGTCATCTGTCagtccaaattcagaaacagtaaagccaaaactaaaggtCACAAGCTGCTTcagatgaatgaatccc caaattaaagaaagtcatcagcaacttcattattattcaccataatgacatcatttaggaaatctctaaaacatgagtgg aactttggagtgcattaaaaaatgcatttTTGTCACTATGTCCTagagtgctttggGGTCAGAAGAGGCAGGCAG V$SF1_Q6 811 - 818 - 1.44 tgacctgg V$E12_Q6 814 - 824 - 3.36 gacagatgacc Myf 863 - 874 - 1.68 aagcagcttgtg ERE 1000 - 1014 - 0.729 aggacatagtgacaa V$TANTIGEN_B 1027 - 1045 + 2.3 ggtcagaagaggcaggcag V$CACBINDINGPROTEIN_Q6 1035 - 1043 + 1.52 gaggcaggcThe sequence in which the cluster was found, and the coordinates within the sequence, are indicated. Then the cluster's score is printed (the higher the better), and the sequence covered by the cluster, with motifs in uppercase. Finally, the following information is printed for each motif in the cluster: name, location, strand, score, and sequence. |
|||
Source Code | |||
Follow these steps to download and compile the source code:
|
|||
Comments and questions to Martin Frith |