Gapless Local Alignment of Multiple sequences

Introduction

GLAM is a program for discovering functional motifs shared by a set of nucleotide sequences. Examples of functional motifs include transcription factor binding sites, mRNA splicing control elements, signals for mRNA 3'-cleavage and polyadenylation, and anything else you can dream of. GLAM attempts to find these motifs by obtaining the best possible gapless, multiple alignment of segments of the sequences. The 'best' alignment is the one that maximizes the value of a certain formula. At most one segment from each sequence is included in the alignment, and some sequences may be excluded if doing so would improve alignment quality. Currently we do not offer a web server, because GLAM is too compute intensive.

Update: GLAM has children! A-GLAM and GLAM2. A-GLAM has several enhancements for finding gapless nucleotide motifs more effectively. GLAM2 can find gapped motifs and/or protein motifs. These programs largely supersede GLAM.

Publication

Martin C Frith, Ulla Hansen, John L Spouge, Zhiping Weng (2004). Finding functional sequence elements by multiple local alignment. Nucleic Acids Research 32(1):189-200.
Here are the data sets studied in the paper.

Changes

19^th Feb 2010

Fixed the source code so it compiles on modern picky systems.

16^th Feb 2005

Fixed the source code for fussy compilers. (Thanks: Dan Haft)

1^st Apr 2004

Fixed bug that caused occasional crashes. (Thanks: Szymon Kielbasa)

7^th Feb 2004

Fixed crashes caused by temperature underflow. (Thanks: Yutao Fu)

15^th Dec 2003

Added options -g and -k to find suboptimal alignments.
Print stars under conserved columns.
Added option -d to control frequency of width-adjusting moves.
Print strand information.

5^th Nov 2003

Fixed bug with -l option (lowercase masking).

31^st Jul 2003

Added options -l to filter lowercase letters and -e to turn off the E-value calculation.
Changed the option to use the modified Lam schedule from -l to -m.
Changed the default value of -n from 20000 to 10000, so the program appears to run twice as fast.
Changed the default temperature from 1 to 0.9, since that works better in a variety of tests.
Switched from Spouge's to Altschul & Gish's edge correction formula, which makes a minor difference to the E-value calculation. Renamed mu_star to H for consistency with our publication.
Minor changes in the output format, so your parser will be broken.

5^th Feb 2003

Fixed printing of flanking sequences on reverse strands. (Thanks: Kavitha Venkatesan)

Installation

(You are advised to read our paragraph on repetitive elements before using GLAM.)

Download GLAM by clicking one of these links, and saving the file on your computer:
GLAM executable for Linux (Red Hat 7.2/7.3)
GLAM executable for Sun (Solaris 8)
GLAM executable for SGI/IRIX
Set execute permission for the file by typing 'chmod +x glam-linux' (or whatever you saved it as).
GLAM is now ready to run.

Download the GLAM source code.
Uncompress: gunzip glam-src.tar.gz
Un-archive: tar -xvf glam-src.tar
Change directory: cd glam-src
Compile (cross your fingers): make

Unfortunately the source code doesn't compile successfully on all systems. We'd love to hear your suggestions for making it more portable.

Usage

GLAM takes as input a file of nucleotide sequences in fasta format - here is an example. Any non-alphabetic characters in the sequences are ignored, and any alphabetic characters except A, C, G, T (uppercase or lowercase) are converted to 'n' and excluded from alignment. If the file is called 'myseqs.fa', run GLAM on it by typing: 'glam-linux myseqs.fa' (replace 'glam-linux' with whatever you saved it as). The output will look something like this:

GLAM: Gapless Local Alignment of Multiple sequences
Compiled on Dec 11 2003

Run 1... 10345 iterations
Run 2... 15190 iterations
Run 3... 24749 iterations
Run 4... 20294 iterations
Run 5... 20488 iterations
Run 6... 16930 iterations
Run 7... 21583 iterations
Run 8... 23260 iterations
Run 9... 13733 iterations
Run 10... 14805 iterations
Calculating score distribution...
Calculating random walk parameters...

Best alignment found:
Score: 44.7738 bits  Width: 32  Sequences: 5  Runs: 6  E-value: 2.29
FirstSeq     164 GGACTAAGTTACTTAAACTGTTCAGGAGATAC 195  +  (8.36)
2ndSeq       244 GGGCATGGTGACCTTTCGCACTCTGGGCATGC 275  +  (9.91)
3rdSeq       244 GGTCAAGGTCACCGACAGCAGTAAGGGCTGAC 275  +  (11.6)
4thSeq       244 GGGCAAAGTGACTGGACATAGGAGTGGGACAC 275  +  (10.4)
LastSeq       92 GGGCAAAGCAACATAGCGGGGTAGGGTCCTCC  61  -  (7.88)
                 ** ****** ** *  *  * ** ** *   *

Other alignments:
Score: 43.4623 bits  Width: 38  Sequences: 5  Runs: 1  E-value: 5.68
Score: 43.2357 bits  Width: 19  Sequences: 5  Runs: 3  E-value: 6.65

glam myseqs.fa
5 sequences in file
Residue abundances: a=0.248002 c=0.251998 g=0.251998 t=0.248002
Pseudocounts: a=0.372002 c=0.377998 g=0.377998 t=0.372002
Max possible alignment width: 500
K: 0.178944
H: 1.25899

GLAM works by starting from a completely random alignment of the sequences, and making small refinements to it over many iterations, in an attempt to find the best possible alignment. Since this procedure does not guarantee finding the optimum alignment, GLAM repeats it 10 times from different starting points (10 runs). The idea is that if several of the runs converge to the same best alignment, we have increased confidence that it is indeed the optimum alignment. In this example 6 out of 10 runs gave the same best alignment.

The score is GLAM's measure of how strong/well-conserved/striking the alignment is: the higher the better. The E-value indicates how often we would expect an alignment of this score or greater to exist among unrelated sequences just by chance. We hope to find E-values lower than 1, but in this example it is only 2.29, so the alignment does not appear to be statistically significant. The stars indicate conserved columns that contribute positively to the score. The numbers in brackets are the marginal scores (in bits) of each segment in the alignment: i.e. the score gained by including this segment in the alignment rather than excluding it. We might feel more confident that segments with higher marginal scores are true motif instances. The marginal scores won't in general sum to the total alignment score.

Visualization

The program glam_logo.pl draws a sequence logo representation of the best alignment in a GLAM output file (in encapsulated PostScript format). After making the program executable (by typing chmod +x glam_logo.pl), run it with a command like this: glam_logo.pl glam_out_file mypic.eps

Repetitive Elements

One of the chief problems when using GLAM is the presence of ubiquitous repetitive or "low complexity" elements, such as Alus or A-rich tracts. These elements can often be aligned with extremely high statistical significance, perhaps overshadowing more interesting motifs. We suggest two ways to deal with this problem. One is to mask these elements prior to alignment, using programs such as RepeatMasker, which specializes in interspersed repeats, and nseg and dust, which specialize in low complexity elements and simple sequence repeats. The other is to apply GLAM repeatedly using the -g option to uncover further alignments beyond the strongest one.

RepeatMasker - AFA Smit & P Green, unpublished.
nseg - JC Wootton & S Federhen, Methods Enzymol. 1996;266:554-71.
dust - R Tatusov & D Lipman, unpublished.

These programs have parameters that vary the stringency of masking. It will be necessary to experiment with these parameters to get a balance between masking repetitive elements adequately but not masking too many potential motifs.

Options

There are many options for modifying GLAM's behavior. We describe them in approximate order of importance: many of them are rather specialized and you don't need to worry about them.

-h	Help: print documentation.
-n	This important parameter controls the tradeoff between speed and accuracy. If you don't play around with any other parameters, play around with this one. Each alignment run will continue until n (default = 10000) iterations have passed without improving on the best alignment found so far. We like to set n sufficiently high that at least 3 out of 10 runs converge to the same alignment. Low values of n are adequate when the problem size is small, i.e. when the sequences are short and more importantly there are few of them, but high values of n are needed for large problems. In addition, smaller values of n are sufficient when there is a strong alignment to be found, but larger values are necessary when there isn't, e.g. for finding the optimal alignment of random sequences. You'll have to choose n on a case-by-case basis, but to give some examples we have used n=1000 to align 5 x 500bp sequences, and n=20000 to align 20 x 1000bp sequences. For larger problems it may be impossible to converge reproducibly to the same exact alignment in reasonable time, but in these cases you can check that similar motifs are reproducibly obtained.
-r	The number of alignment runs (default = 10).
-1	(The digit, not the letter): just examine the direct strand (default = both strands).
-a	Minimum alignment width (default = 1).
-b	Maximum alignment width (default = 10000).
-z	Require every sequence to participate in the alignment.
-g	Supply a previous GLAM output file, and exclude the best alignment found previously from being recovered again. The previous GLAM output will be appended to the current output: if a file with multiple such outputs is supplied with -g, all best alignments found previously will be excluded.
-k	Prevent all residues participating in previous alignments from participating in this one. The default behavior is that any pair of residues aligned previously may not be aligned this time.
-l	(The letter, not the digit): exclude lowercase letters from being aligned. Lowercase letters are often used to indicate repetitive sequence.
-v	Verbose: if multiple runs return more than one alignment, print all alignments in full.
-f	Print this number of flanking residues, in lowercase, either side of the alignment (default = 0).
-q	Pretend that the background residue abundances equal 1/4, instead of estimating them from the input sequences. This option might be a good idea for aligning very short sequences that are mostly covered by the motif.
-d	Frequency of width-adjusting versus sequence-adjusting moves (default = 1). When the number of sequences is large compared to the sequence length, GLAM has difficulty widening the alignment, and may return excessively narrow alignments. Increasing the frequency of width-adjusting moves compensates for this problem to some extent. In theory this problem can always be overcome by making the -n parameter sufficiently large.

Temperature options: GLAM's alignment strategy has a concept of 'temperature'. At low temperatures it strongly favors refinements that improve alignment quality, and at high temperatures it only weakly favors such refinements. If the temperature is too low it will get stuck in a local optimum, and if the temperature is too high it will never find good alignments. The default strategy (constant t=0.9) seems to work well in a variety of cases.

-t	Initial temperature (default = 0.9).
-c	Cooling factor: multiply temperature by this amount each iteration (default = 1, i.e. constant temperature).
-m	Use the "modified Lam schedule" instead of the default geometric schedule. This is a strategy where the algorithm aims to achieve a target "accept rate", i.e. rate of altering the alignment versus leaving it unchanged per iteration. In the early phase of the algorithm, the target accept rate decays geometrically from 100% to 44%. In the middle phase it remains constant at 44%. In the final phase it decays geometrically to 0%. Whenever the actual accept rate is higher than the target the temperature is multiplied by c, when it is lower than the target the temperature is divided by c. If -m is selected, the -n option indicates the total number of iterations per run, not the number of iterations without improvement.
-w	Print energy (negative alignment score in nats), accept rate, temperature, width, and number of sequences in the alignment after each iteration.
-p	Pseudocount weight (default = 1.5).
-u	Uniform pseudocounts: set each pseudocount equal to p/4. Default = p * (background residue abundances).
-s	Seed for the random number generator (default = 1).
-e	Turn off the E-value calculation. If it's too slow for you.

Example:

glam-linux -n5000 -l -v -f5 myseqs.fa

Known Limitations

The E-values become increasingly conservative as the number of sequences increases.
If the number of sequences is many-fold larger than the sequence length, GLAM has difficulty widening the alignment.
The E-value calculation aborts when given more than about 730 input sequences.

Other Motif Finding Programs

The Zlab Gene Regulation Hub lists many other motif discovery programs. We believe GLAM possesses a unique combination of advantages: 1) Automatic determination of the alignment width. 2) Calculation of the statistical significance of alignments. 3) Ability to find suboptimal alignments. 4) Robustness and rigor: by default GLAM carries out many alignment runs from different starting points. 5) Flexibility: you can search 1 or 2 DNA strands, place limits on the alignment width, and vary details of the refinement scheme.

Return to Zlab Gene Regulation Hub

Suggestions to: Martin Frith