Possum

Detect cis-elements in DNA sequences
Instructions | Gene Regulation Hub

Possum Instructions

Possum predicts cis-elements in DNA sequences using the standard method of Position Specific Scoring Matrices. It measures the resemblance of every sequence fragment to the chosen cis-element matrices by calculating log-likelihood ratio scores (base e), and returns high-scoring sequence fragments.

Sequence Format

Sequences may be entered in Fasta, raw, or GenBank format. Any non-alphabetic characters in the sequence will be ignored, and any alphabetic characters except A, C, G and T (uppercase or lowercase) will be converted to 'n' and excluded from matching motifs. If GenBank format is used, Possum will read and display any 'CDS' (protein-coding region) annotations. Limits: at most 20 sequences, of total length up to 100 kb.

GenBank Identifiers

For example GenBank accession numbers (e.g. NC_001669), 'accession.version' numbers (e.g. NC_001669.1), or GI numbers (e.g. 9628421).

Set Subsequence

Limit the search to a subsequence by entering its start and end coordinates. (The first nucleotide in the sequence has coordinate 1.) This option will be ignored if more than 1 sequence is entered.

Format for User-defined Cis-elements

Cis-elements can be entered as TRANSFAC-style matrices, which look like this:

NA   AML-1a
XX
DE   runt-factor AML-1
XX
BF   T02256; AML1a; Species: human, Homo sapiens.
XX
P0      A      C      G      T
01      5      1      2     49      T
02      2      2     52      1      G
03      4     14      1     38      T
04      0      0     57      0      G
05      1      0     55      1      G
06      1      4      0     52      T

You can copy-and-paste these directly from the TRANSFAC website. All lines except the name line (beginning with 'NA') and the nucleotide frequency lines (beginning with digits) are ignored and not required. The name line is required, and should be above the base frequency lines. Alternatively, each cis-element can have a title line starting with ">" and then the name of the element, followed by 4 numbers per line describing nucleotide frequencies at each position in the cis-element. For example:

>element1
0  4 2 14
12 0 0 8
8  0 1 11
20 0 0 0
>element2
13 1 1 5
...

These numbers might come from a multiple alignment of experimentally determined cis-elements. The first column indicates the number of adenines observed in each position, the second column the number of cytosines, the third column the number of guanines, and the fourth column the number of thymines. Gaps in the cis-element may be indicated by entering a single "b" or "n" on a line. Possum will use background nucleotide frequencies at these positions. This option allows users to specify multipartite cis-elements.

The two formats can be mixed.

Score Threshold

Motifs with log-likelihood scores higher than this value will be reported.

Residue Abundance Range

The local abundances of A, C, G, and T at each point in the sequence will be estimated by looking this far in either direction. Local residue abundances often vary quite significantly along a sequence.

Assume Residue Abundances = 1/4

This option can be useful for analyzing very short sequences. It nullifies the "residue abundance range" option.

Pseudocount

This value will be added to all the counts in the cis-element matrices. Pseudocounts are often used in estimating true abundances from a limited number of observations. The default value of 0.375 was obtained by a fitting procedure to all transcription factor binding site matrices in the TRANSFAC database. If your matrices contain probabilities rather than counts, you should probably set this parameter to 0.

Source code

TATA	Sp1	CRE	ERE	NF-1	E2F	Mef-2	Myf
CCAAT	AP-1	Ets	Myc	GATA	LSF	SRF	Tef