These files superficially resemble Fasta-format files, but the differences are very important. Each entry begins with a carrotted line, as in Fasta files. Whatever text follows the carrot is the name of the entry, and must be unique within the given file. The lines that follow contain the sequences to be evaluated as candidates. This format supports multiple-sequence candidates, so each line that follows may contain only a single sequence. A key indicating the organism/genome assembly of origin preceeds each sequence, and the organism key and sequence are separated by a tab.
Example:>mir-2a-2 dm2 ATCTAAGCCTCATCAAGTGGTTGTGATATGGATACCCAACGCATATCACAGCCAGCTTTGATGAGCTAGGAT dp3 ATCTAAGCCTCATCAAGTGGTTGTGATATGGATACCCAACGCATATCACAGCCAGCTTTGATGAGCTAGGAT >mir-2b-1 dm2 CTTCAACTGTCTTCAAAGTGGCAGTGACATGTTGTCAACAATATTCATATCACAGCCAGCTTTGAGGAGCGTTGCGG dp3 CTGCGACGCTCTTTAAAGTGGCGGTGACGTGTTGGTAATAATATTCATATCACAGCCAGCTTTGAGGAGCGTTGCGG >mir-3 dm2 GATCCTGGGATGCATCTTGTGCAGTTATGTTTCAATCTCACATCACTGGGCAAAGTGTGTCTCAAGATC dp3 GATCCTGGGATGCATTTTGTGCAGTTATGTCTACGTGATCATCCTCATCACTGGGCAAAGTGTGTCTCAGGATRequirements:
- | All keys are unique within a file and may not contain spaces. |
- | The set of organism/genome assembly keys is consistent across all entries in a file, and is also consistant across all files that are to be used for training or scoring in a single round of evaluation/elimination. |
- | Lines beginning with a '#' will be ignored. |
These files define a foreground and a background set of miRNA hairpins. The foreground set is defined within the file, while the background set is defined by references withing this file to a set of .fax-format files. Each line of the file is started with a combination of letters:
b backgroundFile1.fax b backgroundFile2.fax b backgroundFile3.fax cn mir-2a-2 cm dm2 uaucacagccagcuuugaugagc ch dm2 ATCTAAGCCTCATCAAGTGGTTGTGATATGGATACCCAACGCATATCACAGCCAGCTTTGATGAGCTAGGAT cm dp3 uaucacagccagcuuugaugagc ch dp3 AUCUAAGCCUCAUCAAGUGGUUGUGAUAUGGAUACCCAACGCAUAUCACAGCCAGCUUUGAUGAGCUAGGAU cn mir-2b-1 cm dm2 uaucacagccagcuuugaggagc ch dm2 CTTCAACTGTCTTCAAAGTGGCAGTGACATGTTGTCAACAATATTCATATCACAGCCAGCTTTGAGGAGCGTTGCGG cm dp3 uaucacagccagcuuugaggagc ch dp3 CUGCGACGCUCUUUAAAGUGGCGGUGACGUGUUGGUAAUAAUAUUCAUAUCACAGCCAGCUUUGAGGAGCGUUGCGG cn mir-3 cm dm2 ucacugggcaaagugugucuca ch dm2 GATCCTGGGATGCATCTTGTGCAGTTATGTTTCAATCTCACATCACTGGGCAAAGTGTGTCTCAAGATC cm dp3 ucacugggcaaagugugucuca ch dp3 GAUCCUGGGAUGCAUUUUGUGCAGUUAUGUCUACGUGAUCAUCCUCAUCACUGGGCAAAGUGUGUCUCAGGAURequirements:
- | All names are unique within a file and may not contain spaces. |
- | The set of organism/genome assembly keys is consistent across all entries in a file, and is also consistant across all of the listed background files. |
- | Each miRNA/candidate entry begins with a 'cn' line and is followed by a series of 'ch' and 'cm' lines. |
- | Each organism key for each miRNA/candidate must have both a 'cm' and a 'ch' line. |
- | Note that if this format is being used to store candidates and not for training, the 'cm' and 'b' lines may be left out. |
- | Lines beginning with a '#' will be ignored. |
These files store the scores that are associated with each possible returned value for each feature that a particular criteria file evaluates. In the file, every non-indented line has the name of a feature/criterion. The tab-indented lines that follow each have an allowed values for that feature, then another tab, then the score associated with that value for that feature. These are the required fields, but the .matrix files generated by mirscanTrainer.py also contains additional information in additional tab-delimited columns; for each value, the number of instances in the foreground and background sets, respectively, then the frequencies of the value in the foreground and background sets, respecctively, are provided.
Example:# number = 20, fcount = 24 # training file: TestData/dme-dps.train # Fri Oct 26 00:37:06 2007 # org keys: dm2 dp3 nuc9_s1 A 0.593 6 3 0.25 0.15 T -0.384 7 8 0.292 0.4 C 0.371 5 3 0.208 0.15 G -0.214 6 6 0.25 0.3 N -0.214 0 0 0.0 0.0 nuc9_s2 A 1.008 6 2 0.25 0.1 T -0.536 7 9 0.292 0.45 C 0.371 5 3 0.208 0.15 G -0.214 6 6 0.25 0.3 N -0.214 0 0 0.0 0.0 bp_matrix_C_n-8 paired 0.952 15 6 0.625 0.3 unpaired -0.826 9 14 0.375 0.7 bp_matrix_C_n-7 paired 0.952 15 6 0.625 0.3 unpaired -0.826 9 14 0.375 0.7 bp_matrix_C_n-6 paired 0.344 14 9 0.583 0.45 unpaired -0.367 10 11 0.417 0.55 loop_dis_G 0 -2.659 0 8 0.0 0.4 1 -1.589 0 0 0.0 0.0 2 -0.944 0 1 0.0 0.05 3 -0.43 0 0 0.0 0.0 4 -0.43 0 0 0.0 0.0 5 -0.835 0 1 0.0 0.05 6 -0.411 0 0 0.0 0.0 7 -0.207 0 0 0.0 0.0 8 -0.43 0 0 0.0 0.0 9 -0.996 0 1 0.0 0.05 10 -0.986 0 1 0.0 0.05 11 -0.372 0 0 0.0 0.0 12 -0.004 1 1 0.042 0.05 13 0.582 1 0 0.042 0.0 14 0.837 1 0 0.042 0.0 15 1.023 3 1 0.125 0.05 16 1.674 3 0 0.125 0.0 17 1.785 5 0 0.208 0.0 18 0.624 0 1 0.0 0.05 19 1.466 4 0 0.167 0.0 20 1.266 1 0 0.042 0.0 21 1.229 2 0 0.083 0.0 22 0.815 1 0 0.042 0.0 23 0.081 0 0 0.0 0.0 24 -0.135 1 1 0.042 0.05 25 0.03 0 0 0.0 0.0 26 0.257 1 0 0.042 0.0 27 -0.587 0 1 0.0 0.05 28 -0.388 0 0 0.0 0.0 29 -0.185 0 0 0.0 0.0 30 -0.163 0 0 0.0 0.0 31 -0.163 0 0 0.0 0.0 32 -0.163 0 0 0.0 0.0 33 -0.229 0 0 0.0 0.0 34 -0.807 0 0 0.0 0.0 35 -1.614 0 3 0.0 0.15Requirements:
- | All permitted values associated with a feature must be present. |
- | Feature names must match fdict keys in the corresponding criteria file. There must be a 1:1 correspondence between features in the .matrix file and in the criteria file. |
- | Columns must be tab-separated, and value lines must be indented with tabs. |
- | Value lines must immediately follow their feature name lines, and each feature name may only be listed once per file. |
- | Lines beginning with a '#' will be ignored. |
This file stores the results of miRNA/candidate scoring. Each line of the file corresponds to a particular candidate, whose name appears at the beginning of the line. Following the name, each criterion/feature name is listed, followed by the score given for that feature. In addition to the features described in the criteria file, the following values are given:
mir-2a-2 totscore 54.832 loc_dp3 10 loc_dm2 10 loop_dis_G 0.582 nuc9_s1 -0.384 nuc9_s2 -0.536 bp_matrix_C_n-8 0.952 bp_matrix_C_n-7 0.952 bp_matrix_C_n-6 0.344 mir-2b-1 totscore 46.112 loc_dp3 50 loc_dm2 50 loop_dis_G 1.466 nuc9_s1 0.593 nuc9_s2 1.008 bp_matrix_C_n-8 -0.826 bp_matrix_C_n-7 0.952 bp_matrix_C_n-6 0.344 mir-3 totscore 42.026 loc_dp3 48 loc_dm2 43 loop_dis_G 1.023 nuc9_s1 0.371 nuc9_s2 0.371 bp_matrix_C_n-8 -0.826 bp_matrix_C_n-7 -0.826 bp_matrix_C_n-6 -0.367Requirements:
- | Names cannot contain spaces. |
- | One line per candidate. |
- | Score sum and location of the predicted mature miRNA in each hairpin of the candidate must be specified as described above. |
- | Scores must be given in floating point decimal form. |
- | Lines beginning with a '#' will be ignored. |
These files are formatted for evaluation the Python interpreter. They must generate two variables, bound to the indicated values, in the global environment:
Each value in fdict must be an instance of either string_feature or number_feature classes, both of which are provided in mirbaseModule.py. For each instance, the user must also define the following attributes:
Prototype criteria files for one- or two-sequence candidates can be downloaded as part of the All scripts file package. The mirscan function can be modified by the user in either of these prototypes to support multi-sequence criteria. A commented-out prototype feature instance's implementation is also provided in each file. For examples of feature objects, see the scoring matrices used in Ruby et al., Genome Res. 2007 that are available for download as part of the Sample files file package.