Cluster-Trainer Introduction | |||
Cluster-Trainer is an auxiliary program for use with Cluster-Buster. It estimates optimal 'motif weights' to use with Cluster-Buster. These weights can also be interpreted as abundances of the motifs (occurences per kb). Given a set of DNA sequences and a set of motif definitions, Cluster-Trainer will estimate how abundant each motif is in the sequences, and the average distance between neighboring motifs. | |||
Installation | |||
|
|||
Usage | |||
Cluster-Trainer requires two inputs: a file of DNA sequences in the standard FASTA format (here is an example), and a file of motifs. Any non-alphabetic characters in the sequences are ignored, and any alphabetic characters except A, C, G, T (uppercase or lowercase) are converted to 'n' and forbidden from matching motifs. The motif file should contain matrices in the following format: >element1 0 4 2 14 12 0 0 8 8 0 1 11 20 0 0 0 >element2 13 1 1 5 ...The rows of each matrix correspond to successive positions of the motif, from 5' to 3', and the columns indicate the frequencies of A, C, G, and T, respectively, in each position. These frequencies are usually obtained from alignments of protein-binding sites. Cluster-Trainer attempts to find the motif weights that cause a 'score' to achieve a maximum value. (This score is a log likelihood ratio, for the positive hypothesis that the sequences contain the motifs at the given abundances versus the null hypothesis of random, independent nucleotides.) The program uses a technique called 'Expectation-Maximization' (E-M), which starts with a random set of weights and iteratively changes them so as to improve the score. Since this technique only guarantees finding local maxima, the program performs several trials from different starting points. If most trials give roughly the same answer, and those that don't give lower-scoring answers, then we have found if not the true global optimum then at least a reproducibly good set of weights. Cluster-Trainer prints the best set of weights that it finds (motif abundances per kb), the corresponding score, and the corresponding average distance between neighboring motifs. These parameters can be fed directly to Cluster-Buster. The program's behavior can be modified with the following options. The defaults are designed to give sensible results in most cases.
|
|||
Problems & Fixes | |||
Cluster-Trainer may assign excessively high weights to motifs that resemble 'low-complexity sequence' (e.g. GC-rich or AT-rich motifs). If this problem occurs, you could try masking low complexity regions in the sequences using programs such as nseg or dust, before applying Cluster-trainer. Alternatively, omit the problematic motifs from training. If all the motifs receive excessively low weights, you could try increasing the -r parameter, or fixing the average distance between neighboring motifs to a plausible value such as 35 with the -f option. | |||
Comments and questions to Martin Frith |