Supporting Data for Cluster-Buster | |||

Cluster-Buster searches for regions of the sequence that resemble a statistical model of a motif cluster more than a model of "background DNA". We use a hidden Markov model of motif clusters, and an independent nucleotides model of background DNA. We aim to find subsequences whose log likelihood ratios, ln [Prob (subsequence | motif cluster model)/Prob (subsequence | background model)], are maximal (i.e. they do not overlap subsequences with higher log likelihood ratios). The so-called Forward algorithm can calculate this log likelihood ratio for all prefixes 1..j, in time proportional to the sequence length. To evaluate the log likelihood ratio for all subsequences, it would be necessary to repeat the Forward algorithm once for each possible start point, requiring time proportional to the sequence lengh squared, which is not feasible for sequences longer than a few kb. Cluster-Buster uses a linear-time heuristic which attempts to return the same motif cluster predictions as the full quadratic-time algorithm. As a test we applied Cluster-Buster and an implementation of the quadratic-time algorithm to a set of 27 short sequences. The two programs returned the exact same 19 clusters. So Cluster-Buster appears to be extremely successful at emulating the exact algorithm. See our publication or the source code for more details of our method. Here are the files related to this test: Set of 27 sequences (a subset of the muscle regulatory region collection of Wasserman and Fickett) Matrix set (also obtained from the above website) Cluster-Buster output Source code for our implementation of the quadratic-time algorithm Source code and executable for Cluster-Buster itself can be downloaded here. |
|||

Comments and questions to Martin Frith |