Runtime Structure (back to Introduction)

Criteria/Features

Criteria/features are aspects of a hairpin that can be examined in order to gain insight as to whether or not it is a miRNA precursor, such as "number of unpaired nucleotides" or "miRNA 5p nucleotide identity". These are represented following evaluation of a criteria file as entries in a dictionary, fdict, in the global environment. Each feature has a unique name (keys in the dictionary) and is represented as an instance of either string_feature or number_feature, two classes that are implemented in the mirscanModule.py.

Each instance of either feature class requires that the user add two attributes:

Feature.fx: a function for evaluating the feature in question; must take self and args dictionary as only arguments and return a string.
Feature.kl: a list of all possible return values from Feature.fx that are accepted for that feature

There are also four class-defined attributes that satisfy the requirements of a feature interface. They are:

Feature.type: a string labelling the instance as a 'string'-type or 'number'-type feature
Feature.kv: a function for binning string returned by fx. for strings, the arg value x is returned; for numbers, the input value is put into the appropriate bin out of those defined by Feature.kl, and that bin's key is returned.
Feature.ex: a function for taking in the appropriate set of arguments for evaluating a hairpin candidate and returning the appropriate key
Feature.pseudo: a function for adding pseudocounts; the argument is a dictionary whose keys are those of Feature.kl and whose values are numbers corresponding to the counts of hairpins whose Feature.ex result is the given key.

The user may give a Feature instance additional attributes to attach any other values that are necessary for the feature's evaluation.

The global environment created by a criteria file also contains the variable mirscan which is bound to a function for the evaluation of miRNA hairpin candidates in terms of the features implemented in fdict. If the user wants any pre-processed information about the hairpin candidate to be passed to the feature.fx functions, such as a secondary structure or an alignment, then the user must include those data as values in the dictionary args that will be passed to all of the feature.fx functions in lieu of specialized sets of arguments.

Training

Abstractly, training comprises an evaluation of a series of foreground and a series of background miRNA hairpins. To this end, mirscanTrainer.py creates three data structures: 1) a set of features to be evaluated, 2) a set of foreground hairpins (the training set; samples of real miRNA hairpins), and 3) a set of background candidate hairpins. Background hairpins will be scored later, and this set is represented as a list of instances of the Candidate class that is implemented in mirscanModule.py. The arguments taken by the constructor are:

name: a non-empty string that is the name for the candidate; should not contain any whitespace.
orgToSeq: a dictionary whose keys are organism/genome names (see the specifications for .fax or .train files) and whose values are hairpin sequences (all-uppercase strings of the characters 'A', 'T', 'C', 'G', and 'N').

The foreground miRNA hairpins also have the start positions of the miRNAs defined in a parallel list. Each item in this list is a dictionary whose keys are the same as orgToSeq and whose values are integers corresponding to the miRNA 5p nucleotide positions in the corresponding hairpin (indexed from 0). The index of each Candidate in the candidates list equals the index of the corresponding start position dictionary in the starts list.

The number of candidates for which each possible value for each feature is returned is kept track of for each dataset (the foreground dataset and the background dataset) in a two-tiered dictionary structure whose first (outer) keys are the names of features, whose second (inner) keys are the items of that feature's Feature.kl list, and whose values are the corresponding hairpin counts.

Scoring

The runtime structure during scoring is very similar to that of training. The mirscan function that is implemented in the criteria file scores a set of candidate miRNA hairpins using a scoring matrix that is represented as a two-tiered dictionary structure whose first (outer) keys are the names of features, whose second (inner) keys are the items of that feature's Feature.kl list, and whose values are the floating-point scores that will be added for that value being returned by the function.