Formats (back to Introduction)

.fax candidate files

These files superficially resemble Fasta-format files, but the differences are very important. Each entry begins with a carrotted line, as in Fasta files. Whatever text follows the carrot is the name of the entry, and must be unique within the given file. The lines that follow contain the sequences to be evaluated as candidates. This format supports multiple-sequence candidates, so each line that follows may contain only a single sequence. A key indicating the organism/genome assembly of origin preceeds each sequence, and the organism key and sequence are separated by a tab.

Example:

>mir-2a-2
dm2	ATCTAAGCCTCATCAAGTGGTTGTGATATGGATACCCAACGCATATCACAGCCAGCTTTGATGAGCTAGGAT
dp3	ATCTAAGCCTCATCAAGTGGTTGTGATATGGATACCCAACGCATATCACAGCCAGCTTTGATGAGCTAGGAT
>mir-2b-1
dm2	CTTCAACTGTCTTCAAAGTGGCAGTGACATGTTGTCAACAATATTCATATCACAGCCAGCTTTGAGGAGCGTTGCGG
dp3	CTGCGACGCTCTTTAAAGTGGCGGTGACGTGTTGGTAATAATATTCATATCACAGCCAGCTTTGAGGAGCGTTGCGG
>mir-3
dm2	GATCCTGGGATGCATCTTGTGCAGTTATGTTTCAATCTCACATCACTGGGCAAAGTGTGTCTCAAGATC
dp3	GATCCTGGGATGCATTTTGTGCAGTTATGTCTACGTGATCATCCTCATCACTGGGCAAAGTGTGTCTCAGGAT

Requirements:

-	All keys are unique within a file and may not contain spaces.
-	The set of organism/genome assembly keys is consistent across all entries in a file, and is also consistant across all files that are to be used for training or scoring in a single round of evaluation/elimination.
-	Lines beginning with a '#' will be ignored.

.train foreground files

These files define a foreground and a background set of miRNA hairpins. The foreground set is defined within the file, while the background set is defined by references withing this file to a set of .fax-format files. Each line of the file is started with a combination of letters:

b lines indicate a background file. The 'b' and the filename are separated by a space or tab. 'b' lines are required if the file is to be used for training, but not if it is only being used to store candidates.
cn lines indicate the name of a miRNA/candidate. The 'cn' is separated from the name by a space or tab. The name may not contain spaces. The 'cn' line for a miRNA/candidate must preceed its 'cm' and 'ch' lines.
cm lines contain the organism key and sequence identity of a mature miRNA. The 'cm', organism key and sequence are separated by spaces or tabs. The sequence may be in upper- or lower-case nucleotide letters: 'A', 'T', 'C', 'G', 'U', and 'N' ('N' indicates an unknown identity). 'U' and 'T' will be treated equivalently. The sequence of the mature miRNA/candidate must be found in the candidate's 'ch' entry (hairpin sequence) with the same organism key.
ch lines contain the organism key and sequence identy of a miRNA hairpin precursor. The 'ch', organism key and sequence are separated by spaces or tabs. The sequence may be in upper- or lower-case nucleotide letters: 'A', 'T', 'C', 'G', 'U', and 'N' ('N' indicates an unknown identity). 'U' and 'T' will be treated equivalently. The hairpin sequence must contain the mature miRNA/candidate sequence found in the candidate's 'cm' entry with the same organism key.

Example:

b backgroundFile1.fax
b backgroundFile2.fax
b backgroundFile3.fax
cn      mir-2a-2
cm      dm2     uaucacagccagcuuugaugagc
ch      dm2     ATCTAAGCCTCATCAAGTGGTTGTGATATGGATACCCAACGCATATCACAGCCAGCTTTGATGAGCTAGGAT
cm      dp3     uaucacagccagcuuugaugagc
ch      dp3     AUCUAAGCCUCAUCAAGUGGUUGUGAUAUGGAUACCCAACGCAUAUCACAGCCAGCUUUGAUGAGCUAGGAU
cn      mir-2b-1
cm      dm2     uaucacagccagcuuugaggagc
ch      dm2     CTTCAACTGTCTTCAAAGTGGCAGTGACATGTTGTCAACAATATTCATATCACAGCCAGCTTTGAGGAGCGTTGCGG
cm      dp3     uaucacagccagcuuugaggagc
ch      dp3     CUGCGACGCUCUUUAAAGUGGCGGUGACGUGUUGGUAAUAAUAUUCAUAUCACAGCCAGCUUUGAGGAGCGUUGCGG
cn      mir-3
cm      dm2     ucacugggcaaagugugucuca
ch      dm2     GATCCTGGGATGCATCTTGTGCAGTTATGTTTCAATCTCACATCACTGGGCAAAGTGTGTCTCAAGATC
cm      dp3     ucacugggcaaagugugucuca
ch      dp3     GAUCCUGGGAUGCAUUUUGUGCAGUUAUGUCUACGUGAUCAUCCUCAUCACUGGGCAAAGUGUGUCUCAGGAU

Requirements:

-	All names are unique within a file and may not contain spaces.
-	The set of organism/genome assembly keys is consistent across all entries in a file, and is also consistant across all of the listed background files.
-	Each miRNA/candidate entry begins with a 'cn' line and is followed by a series of 'ch' and 'cm' lines.
-	Each organism key for each miRNA/candidate must have both a 'cm' and a 'ch' line.
-	Note that if this format is being used to store candidates and not for training, the 'cm' and 'b' lines may be left out.
-	Lines beginning with a '#' will be ignored.

.matrix scoring matrix files

These files store the scores that are associated with each possible returned value for each feature that a particular criteria file evaluates. In the file, every non-indented line has the name of a feature/criterion. The tab-indented lines that follow each have an allowed values for that feature, then another tab, then the score associated with that value for that feature. These are the required fields, but the .matrix files generated by mirscanTrainer.py also contains additional information in additional tab-delimited columns; for each value, the number of instances in the foreground and background sets, respectively, then the frequencies of the value in the foreground and background sets, respecctively, are provided.

Example:

# number = 20, fcount = 24
# training file: TestData/dme-dps.train
# Fri Oct 26 00:37:06 2007
# org keys:     dm2     dp3
nuc9_s1
        A       0.593   6       3       0.25    0.15
        T       -0.384  7       8       0.292   0.4
        C       0.371   5       3       0.208   0.15
        G       -0.214  6       6       0.25    0.3
        N       -0.214  0       0       0.0     0.0
nuc9_s2
        A       1.008   6       2       0.25    0.1
        T       -0.536  7       9       0.292   0.45
        C       0.371   5       3       0.208   0.15
        G       -0.214  6       6       0.25    0.3
        N       -0.214  0       0       0.0     0.0
bp_matrix_C_n-8
        paired  0.952   15      6       0.625   0.3
        unpaired        -0.826  9       14      0.375   0.7
bp_matrix_C_n-7
        paired  0.952   15      6       0.625   0.3
        unpaired        -0.826  9       14      0.375   0.7
bp_matrix_C_n-6
        paired  0.344   14      9       0.583   0.45
        unpaired        -0.367  10      11      0.417   0.55
loop_dis_G
        0       -2.659  0       8       0.0     0.4
        1       -1.589  0       0       0.0     0.0
        2       -0.944  0       1       0.0     0.05
        3       -0.43   0       0       0.0     0.0
        4       -0.43   0       0       0.0     0.0
        5       -0.835  0       1       0.0     0.05
        6       -0.411  0       0       0.0     0.0
        7       -0.207  0       0       0.0     0.0
        8       -0.43   0       0       0.0     0.0
        9       -0.996  0       1       0.0     0.05
        10      -0.986  0       1       0.0     0.05
        11      -0.372  0       0       0.0     0.0
        12      -0.004  1       1       0.042   0.05
        13      0.582   1       0       0.042   0.0
        14      0.837   1       0       0.042   0.0
        15      1.023   3       1       0.125   0.05
        16      1.674   3       0       0.125   0.0
        17      1.785   5       0       0.208   0.0
        18      0.624   0       1       0.0     0.05
        19      1.466   4       0       0.167   0.0
        20      1.266   1       0       0.042   0.0
        21      1.229   2       0       0.083   0.0
        22      0.815   1       0       0.042   0.0
        23      0.081   0       0       0.0     0.0
        24      -0.135  1       1       0.042   0.05
        25      0.03    0       0       0.0     0.0
        26      0.257   1       0       0.042   0.0
        27      -0.587  0       1       0.0     0.05
        28      -0.388  0       0       0.0     0.0
        29      -0.185  0       0       0.0     0.0
        30      -0.163  0       0       0.0     0.0
        31      -0.163  0       0       0.0     0.0
        32      -0.163  0       0       0.0     0.0
        33      -0.229  0       0       0.0     0.0
        34      -0.807  0       0       0.0     0.0
        35      -1.614  0       3       0.0     0.15

Requirements:

-	All permitted values associated with a feature must be present.
-	Feature names must match fdict keys in the corresponding criteria file. There must be a 1:1 correspondence between features in the .matrix file and in the criteria file.
-	Columns must be tab-separated, and value lines must be indented with tabs.
-	Value lines must immediately follow their feature name lines, and each feature name may only be listed once per file.
-	Lines beginning with a '#' will be ignored.

.scr scoring result files

This file stores the results of miRNA/candidate scoring. Each line of the file corresponds to a particular candidate, whose name appears at the beginning of the line. Following the name, each criterion/feature name is listed, followed by the score given for that feature. In addition to the features described in the criteria file, the following values are given:

totscore: the sum of all the feature scores.
loc_[org]: the predicted position for the 5p nucleotide of the mature miRNA in the hairpin with organism key [org], indexed starting from 1. One of these values is given for each organism key.

Example:

mir-2a-2 totscore 54.832 loc_dp3 10 loc_dm2 10 loop_dis_G 0.582 nuc9_s1 -0.384 nuc9_s2 -0.536 bp_matrix_C_n-8 0.952 bp_matrix_C_n-7 0.952 bp_matrix_C_n-6 0.344 
mir-2b-1 totscore 46.112 loc_dp3 50 loc_dm2 50 loop_dis_G 1.466 nuc9_s1 0.593 nuc9_s2 1.008 bp_matrix_C_n-8 -0.826 bp_matrix_C_n-7 0.952 bp_matrix_C_n-6 0.344 
mir-3 totscore 42.026 loc_dp3 48 loc_dm2 43 loop_dis_G 1.023 nuc9_s1 0.371 nuc9_s2 0.371 bp_matrix_C_n-8 -0.826 bp_matrix_C_n-7 -0.826 bp_matrix_C_n-6 -0.367

Requirements:

-	Names cannot contain spaces.
-	One line per candidate.
-	Score sum and location of the predicted mature miRNA in each hairpin of the candidate must be specified as described above.
-	Scores must be given in floating point decimal form.
-	Lines beginning with a '#' will be ignored.

.py criteria files

These files are formatted for evaluation the Python interpreter. They must generate two variables, bound to the indicated values, in the global environment:

mirscan: bound to a function that applies scores to a provided set of candidates, or can be used to examine feature value frequencies to derive scores.
fdict: bound to a dictionary whose keys are the names of features and whose values are the feature objects themselves. The contents of this dictionary will determine what aspects of the miRNA candidates are evaluated and scored.

Each value in fdict must be an instance of either string_feature or number_feature classes, both of which are provided in mirbaseModule.py. For each instance, the user must also define the following attributes:

fx: a function that takes in a singel argument, args, and returns either a number or a string (depending on which class of feature it is). args is a dictionary defined in the mirscan function; it must be set up to contain all data necessary for fx to evaluate a candidate.
kl: a list of all the acceptable return values from fx.

Prototype criteria files for one- or two-sequence candidates can be downloaded as part of the All scripts file package. The mirscan function can be modified by the user in either of these prototypes to support multi-sequence criteria. A commented-out prototype feature instance's implementation is also provided in each file. For examples of feature objects, see the scoring matrices used in Ruby et al., Genome Res. 2007 that are available for download as part of the Sample files file package.