Runtime Structure/User Manual (back to Introduction)


Motif interface

This interface was designed so that diverse sequence motif scoring systems could be applied to real sequences in a similar manner, and so that scoring systems with essentially different methods of evaluation could easily be substituted for one another without changing the user code. The interface is a dictionary that must contain the following (string) keys:

The "name" and "string" keys access motif documentation. The "name" key is intended as a simple label, and the "string" key is intended to point to some printable description of the motif (for instance, a Position-Specific Scoring Matrix might output a tab-delimited table of the scores assigned to each nucleotide identity at each position). The "len" key points to the length of sequence that can be evaluated by the motif scoring system. It specifies the required length of the input string taken by the function to which "apply" points. This function takes a string of the specified length as its only argument and returns a number. That number is the score for the motif's match to the provided sequence.




SEQUENCE

SEQUENCE is the most basic definition of a sequence motif. It defines a motif as a string of characters. Matches to the motif are matches to that string.

The SEQUENCE constructor takes a string as its argument and returns a motif interface which considers the motif to be a perfect match to that sequence.




SEQ_LIST

SEQ_LIST is a definition of a motif where each position of the motif can have one in a set of identities.

The SEQ_LIST constructor takes a list of strings as its argument. Each string in the list represents a position in the motif, and each character in the string represents an acceptable match for a character in that position. The special character '*' anywhere in the position's string represents that ANY character is acceptable in that position. The constructor returns a motif interface reflecting these properties. An example:

['C','T','G','T','T','T','C','A','*','*','*','*','GA']




PSSM

The PSSM constructor will take in a text string which is a table of space- or tab-separated columns depicting nucleotide frequencies and construct a PSSM from it. The format for the text string will be like the following example:

background   .25  .25 .25 .25
table-name    A   T   G   C
0             50 97   200 31
5             12  4   350 7

Requirements of the input string:




TWO_PSSM

TWO_PSSM integrates two PSSMs together, and combines their values with a third value, also derived from a log-odds score; that is the distance between the two PSSMS. The starting position of the sequence read is defined by the user input. The shorter of the two PSSMs is slid through the permissible range of positions of the input sequence; at each position, its distance score and motif score are summed, and the maximum across the set of permitted input values is returned as a sum with the score from the other, static PSSM.

The constructor takes in two PSSMs. These can be generated using the PSSM constructor. They are given in the order in which they are expected to appear (5p first). The remaining input is a string reflecting the length distribution. Its format is similar to that of the PSSM input string, but with integers representing distances between the two PSSMs as opposed to letters. The name in that string will be the name given to the motif. An example is shown below:

name   3   4    5   6     7   8    10  11
blah  47  289 5935 4443 2323 1119 334  23

There is no background line at the top because the background assumption is an even distribution over all available permissable positions. The string 'blah' can be anything but whitespace; it is just a placeholder.

TWO_PSSMs have an additional attribute: two_pssm['positions'](seq) will return the starting positions for pssmA and pssmB as a two-item list of integers. The integers will be the indexes of the starting positions for the two pssms in the input seq, which is a string.




DISTANCE_PSSM

This is like the distance portion of TWO_PSSM, but alone. The constructor takes in a string reflecting the length distribution. Its format is similar to that of the PSSM input string, but with integers representing distances between the two PSSMs as opposed to letters. The name in that string will be the name given to the motif. An example is shown below:

# name   3   4    5   6     7   8    10  11
# blah  47  289 5935 4443 2323 1119 334  23

There is no background line at the top because the background assumption is an even distribution over all available permissable positions. The string 'blah' can be anything but whitespace; it is just a placeholder.

The distance pssm does not fully implement the PSSM interface because the length value is meaningless here. Instead, there are two keys: 'min len' and 'max len', corresponding to the minimum and maximum key values.




touScoreModule Attributes

The touScoreModule provides an implementation of the 21U-RNA-associated upstream motif. The motif is implemented as two PSSMs and a DISTANCE PSSM, bound by the variable names largeMotif, distance, and smallMotif. Several functions are also provided for applying the 21U-RNA motif to real sequences: