Supplemental Data for:
Lewis et al., Cell 120, pp. 15–20
Experimental Procedures and False-Positive Estimates
Algorithmic Description for TargetScanS
The TargetScanS algorithm is a simplified version of the TargetScan algorithm
(Lewis et al., 2003) that searches multiple alignments to identify conserved
W-C hexamer seed matches to the designated seed region of the miRNA (bases 2–7),
flanked by either a W-C match to the m8 position of the miRNA or a conserved
adenosine in the t1 position of the target, designated as the t1A anchor.
Identification of Conserved 3' Pairing
A modified version of the TargetScan algorithm was used for the identification
of predicted target sites with conserved pairing to the 3' portion of the miRNA
(Figure 2). This algorithm proceeded by first finding matches to the miRNA
seed that are W-C except at one position at which they have either a conserved
G:U wobble or a conserved mismatched to the seed. This initial match was then
extended with W-C pairing to m8 and by identifying the presence of a t1A anchor
(W-C pairing to m1 was not scored). To identify pairing to the 3' portion of
the miRNA, the miRNA and target site candidate that included the seed match
and 15 mRNA bases upstream of this match were co-folded using subroutines imported
from the RNAlib C program library of the Vienna RNA package (Hofacker et al.,
1994), while incorporating constraints on the pairing of the miRNA seed. This
routine was repeated using aligned sequences from each vertebrate genome. Predicted
target sites were accepted in the 3'-pairing analysis if in each genome they
contained a contiguous helix of at least six basepairs (allowing for a single
G:U wobble), even if in the different genomes this helix involved different
3' residues of the miRNA.
MicroRNA Datasets
A set of 117 human miRNAs (Supplemental Table S1) representing 148 human miRNA
genes with membership in 62 conserved vertebrate miRNA families was assembled
using the Rfam miRNA registry (http://www.sanger.ac.uk/Software/Rfam)
and established criteria (Ambros et al., 2003). MicroRNA families were defined
by grouping miRNAs that share a common conserved seed region spanning nucleotides
2–7 (although in any analysis involving m8 matches, miR-101 and miR-144
miRNAs, which have the same seed, were regarded as separate families because
they differ at m8, bringing to 63 the total number of families for these analyses).
One representative from each miRNA family was required to be conserved with
no more than one mismatch to the sequence for mouse, rat, dog, and chicken sequences
from the UCSC genome browser multiz-8-way whole-genome alignments (Karolchik
et al., 2003; Blanchette et al., 2004). In a number of cases, a related miRNA
from the chicken genome could not be found in the multiz-8-way alignments, and
a chicken miRNA from the miRNA registry that satisfied aforementioned alignment
criteria was used in its place. To account for documented 5' heterogeneity of
miR-124 (Lagos-Quintana et al., 2002), two forms of miR-124 were included separately
among the miRNA families: the longer, less frequently observed form contains
an additional 5' U relative to the shorter form and is listed as miR-124u (Supplemental
Tables S1, S2, and S3).
MicroRNA seed sequences corresponding to selected sets of families from the
miRNA dataset were used in the analyses in Figures 1B–1G. These sets are
listed in Supplemental Table S1 and described below:
Panels 1B, 1F, and 1G: all 63 miRNAs families described above (62 in the
case of Seed or Seed + t1A searches).
Panel 1C: 48 miRNA families representing only those sequences that correspond
to seed regions (nucleotides 2–7) with no overlapping relationship with
a shifted seed sequence of another miRNA family (from the same superfamily)
were chosen to ensure the proper register of seed matches and conservation in
the surrounding bases.
Panel 1D: 9 miRNA families corresponding to miRNA sequences that have
a conserved m1 nucleotide other than U and do not have the same seed sequence
as an miRNA with a U at m1.
Panel 1E: 36 miRNA families corresponding to miRNA sequences that have
a conserved m9 nucleotide other than U and do not have the same seed sequence
as an miRNA with a U at m9.
Genomic Datasets
The chromosomal coordinates of the 3' UTRs of all human genes from the “known
genes” dataset (dated 7/2004) of the UCSC genome browser annotation database
(http://genome.ucsc.edu) were used to define an initial dataset of human 3'
UTR sequences. This set was augmented by taking the union of these regions with
analogous regions defining the 3' UTRs of overlapping human Refseq mRNAs (Pruitt
et al., 2003). The corresponding sequence coordinates were retrieved from the
UCSC annotation database multiz-8-way multiple alignments (Blanchette et al.,
2004), containing aligned orthologous sequences from recent assemblies of the
mouse (mm5, 5/2004), rat (rn3, 6/2003), dog (canFam1, 7/2004), chicken (galGal2,
2/2004), Fugu (fr1, 8/2002), and zebrafish genomes (danRer1, 11/2003). When
an annotated 3' UTR sequence overlapped an open reading frame, the overlap was
masked to prevent contamination of the 3' UTR dataset with protein-coding sequences.
Using the knownCanonical database from the UCSC genome browser, alternate isoforms
of a common gene were identified. The longest 3' UTR among each set of multiple
isoforms was chosen. The resulting dataset contained 17,850 aligned mammalian
3' UTRs (human/mouse/rat/dog) and 10,938 aligned vertebrate 3' UTRs (human/mouse/rat/dog/chicken).
For a few genes, a longer aligned mammalian 3' UTR isoform was not conserved
in chicken while a shorter 3' UTR isoform was conserved to chicken. These
rare cases resulted in a specific vertebrate 3' UTR isoform not being included
in the mammalian set.
The 5' UTR and ORF datasets were compiled using genomic coordinates from the
human RefSeq mRNA database of the UCSC genome browser to retrieve regions of
the multiz-8-way alignments in a manner analogous to the construction of the
3' UTR datasets. Isoforms of a common gene were identified after mapping
the knownCanonical database to RefSeq. As before, the single longest sequence
was chosen from each set of isoforms. In addition, all ORF sequences were
required to begin with a conserved start codon, and all protein-coding sequence
was masked in the 5' UTR dataset. The resulting dataset of 5' UTRs contained
6623 sequences, and the resulting dataset of ORFs contained 11,830 sequences.
Density of Conservation in 3' UTRs
Analysis of Figure 1F: For each 3' UTR in the dataset used in the 5-genome
analysis, the number of conserved 7-mers was counted and a measure of the density
of conservation in that 3' UTR was calculated by determining the average number
of conserved 7-mers per 1000 nt. The 3' UTRs were sorted by this density measure
and then assigned to bins such that each bin contained a sufficient number of
3' UTRs to give a total of ~8000 conserved 7-mers per bin.
Analysis of Figure 1B (in islands): The aligned 3' UTR sequences within
250 nt 5' and 3' of a conserved seed match were examined to determine a local
density of conservation. All target sites located in a ~500-nt region
with fewer than a total of 50 conserved heptamers in the upstream and downstream
windows were designated as occurring in islands of conservation. For cases where
a 3' UTR boundary occurred within 250 nt, the number of conserved 7-mers per
1000 nt was calculated and those sites with a local density of less than 100
conserved 7-mers per 1000 nt were included in the islands of conservation set.
Generation of Control Sequences
Mononucleotide, dinucleotide, hexamer, heptamer, and octamer counts and frequencies
were determined for all human 3' UTR sequences. Sets of control sequences were
designed for each seed match sequence and each extended-match variant (SeedM
+ m8, SeedM + t1A, etc.) so as to preserve the expected frequency of random
matching between miRNA seed sequences and complementary 3' UTR sequences. All
hexamers, heptamers, heptamers, and octamers were examined to identify suitable
control sequences for each miRNA seed (or augmented seed) that preserve (1)
E(SM), the 1st order Markov probability of the seed match, and (2) O(SM), the
observed count of seed matches in human UTRs within a total margin of
±7.5%. As described in Lewis et al. (2003), for a miRNA seed match heptamer
S1,S2,S3,S4,S5,S6,S7, E(SM) was equal to (PS1·PS1,S2·PS2,S3·PS3,S4·PS4,S5·PS5,S6·PS6,S7)
where PS1 was the frequency of the nucleotide S1 and PSk,Sk+1 was the conditional
frequency of the nucleotide Sk+1 given Sk at the previous position, determined
by counting dinucleotides in the UTR sequences. Sequences corresponding to known
miRNA seeds, as well as sequences known to function in mRNA processing, such
as the polyadenylation signals AAUAAA and AUUAAA and the consensus RNA binding
sequences of the puf protein family (White et al., 2001), were restricted from
use in the control sets. All possible control sequences that met these criteria
were assigned to each distinct miRNA seed sequence represented in the dataset.
For each miRNA analyzed, an estimate of the false-positive predictions was calculated
by averaging the results of each of its control sequences. These averages
were then summed to estimate the number of false positives for a set of miRNAs.
A few miRNAs were assigned only five control sequences, but most had many more.
For the analysis of target sites with pairing to the 3' end of the miRNA, control
sequences were generated by simply merging each controlled seed sequence with
the remaining ~13–16 nt region of the real miRNA that follows the seed.
For the analysis of miRNA target sites with a single G:U pair disrupting the
seed match (Figure 2), control sequences were screened further so as to contain,
on average, a similar number of instances of the 4-mer UGUA as the corresponding
real miRNAs because this 4-mer is the core conserved element of the PUM2 binding
site consensus (White et al., 2002). This extra measure was necessary for the
G:U analysis due to the fact that the set of 6-mers corresponding to seed matches
disrupted with a single G:U are enriched for the UGU 3-mer relative to 6-mers
of comparable abundance in a single genome. The enrichment for UGU is a consequence
of the enrichment for U’s and G’s obtained when specifying that
seed:seed match duplexes must be disrupted by a G:U pair.
To calculate the standard deviation of the number of mRNAs that are predicted
to be targets of a single cohort of control sequences, in which a single cohort
set consisted of one control sequence per real miRNA, a special procedure was
devised that accounted for the varying number of control sequences assigned
to each real miRNA in our set. The total number of cohort sets used was defined
as N, which was equal to the maximum number of cohorts used for a single real
miRNA in the set under consideration. The 1st listed control for each miRNA
was assigned to cohort set 1, the 2nd to cohort set 2, … , the Nth to
cohort set N. When considering cohort set n and a real miRNA with m control
sequences in which m < n, the (n mod m)th control sequence was re-chosen
to be included in the nth cohort set, thus enabling the construction of N total
cohort sets. For each cohort set, the number of predicted target mRNAs
was determined, and the standard deviation of the mean was calculated. This
single standard deviation value corresponds to the length of the error bars
above and below the average noise level in the predicted targets plots (Figures
1B, 1D, 1E, 1G, 2A, and 2B).
Analysis of Gene Ontologies
Gene ontologies were assigned to human genes from the UCSC known genes database
by crossreferencing with GO identifiers listed in the annotation database of
the UCSC genome browser (http://hgdownload.cse.ucsc.edu/).
The Gene Ontology Consortium database (Harris et al., 2004) was retrieved from
http://www.geneontology.org and biological process ontologies were compiled
for all predicted target genes of the miRNAs (five-genome SeedM + m8M + t1A
analysis) and the corresponding octamer control sequences. As in previous signal-to-noise
calculations, the hits to GO categories were averaged to determine a noise estimate
for an octamer control set and GO category.
Discussion Regarding Control Sequences for Estimating False Positives
Different approaches have been used to generate control sequences by which to
estimate the number of false-positive miRNA target predictions. Our current
approach is described above and resembles that which we used previously in mammalian
predictions (Lewis et al., 2003). To evaluate the miRanda predictions, John
et al. (2004) use an alternative approach to generate control sequences, wherein
they shuffle (randomly permute) the miRNA. However, when analyzing vertebrate
genomes, repeating a target prediction procedure with random shuffles is prone
to underestimate the false positives, thereby overestimating the predictive
accuracy of the algorithm (Lewis et al., 2003).
Pitfalls of using random shuffles to estimate the false positives can be illustrated
with the miR-125 seed heptamer (CCCUGAG, 6-nt seed plus m8). This heptamer has
663 reverse-complement matches in human UTRs, whereas, on average, random shuffles
have only 205 hits (Supplemental Table S4). It is unlikely that the difference
of 458 hits represents the number of functional target sites for miR-125. Instead,
this difference can be readily explained as an artifact of the shuffled sequences
containing an oligonucleotide composition that differs from that of the miRNAs.
For example, the miRNAs, like the UTRs and vertebrate genomes as a whole, contain
few CG dinucleotides. Therefore (since CG is palindromic), random shuffles that
create CG dinucleotides have far fewer hits to the UTRs than does the authentic
seed heptamer (Supplemental Table S4). To avoid this artifact that would unduly
favor the assessment of any algorithm that uses pairing or predicted duplex
stability for prediction, it is better to pick controls sequences that match
the relevant features of the authentic miRNAs, including compositional features
(Lewis et al., 2003). For the miR-125 heptamer, the sequence CAGUGCC would be
a more appropriate control than the typical shuffled derivative. The same principles
can be used to generate control sequences that are the same length as the miRNAs
(Lewis et al., 2003).
Comments on Overlapping Predictions
John et al. (2004) report that 46% of 400 target genes predicted by TargetScan
(Lewis et al., 2003) overlapped with the targets predicted by miRanda. However,
in most cases miRanda and TargetScan paired these UTRs to different miRNAs,
and the actual overlap of predicted regulatory relationships was only 7%.
How might the 18% overlap of the miRanda predictions with the mammalian TargetScanS
predictions be rationalized, given that a high fraction of miRanda predictions
were estimated to be false positives? The set of miRanda predictions that overlap
with TargetScanS predictions are enriched for highly conserved UTRs relative
to the full set of 4-genome TargetScanS predictions. Having observed that TargetScanS
is less accurate when applied to UTRs with higher levels of background conservation
(Figure 1F), we would expect this overlapping set to have a signal:noise lower
than the 2.2:1 signal:noise calculated for the set of more than 22,000 mammalian
TargetScanS predictions. In addition to being enriched in lower-confidence TargetScanS
predictions, this overlapping set would be expected to be enriched in higher-confidence
miRanda predictions. This is because the overlap was determined using miRanda
predictions for the subset of miRNAs used in the TargetScanS analysis, which
were the more broadly conserved miRNAs and previously have been shown to yield
higher signal:noise than do the less broadly conserved miRNAs (Lewis et al.,
2003).
The Role of Evolutionary Conservation in miRNA Target Identification
An ultimate goal of miRNA target predictions will be to reliably identify authentic
regulatory interactions without relying on conservation; this would find the
regulatory interactions specific to each species. However, neither TargetScan
nor any of the other target prediction methods are close to achieving this goal.
In the meantime, we favor the approach of choosing the control sequences such
that they give the same number of hits in a single genome as do the authentic
miRNAs, and then relying on conservation of the sites in multiple genomes to
distinguish the signal from the noise (e.g., Figure 1G). Nonetheless,
we recognize that this approach is conservative in that it will overestimate
the noise to some extent, depending on both how well the algorithm predicts
authentic targets in a single genome and the degree to which there has been
selective pressure for transcripts to avoid miRNA complementarity.
References
Ambros, V., Bartel, B., Bartel, D.P., Burge, C.B., Carrington, J.C., Chen, X.,
Dreyfuss, G., Eddy, S.R., Griffiths-Jones, S., Marshall, M., et al. (2003).
A uniform system for microRNA annotation. RNA 9, 277–279.
Blanchette, M., Kent, W.J., Riemer, C., Elnitski, L., Smit, A.F., Roskin, K.M.,
Baertsch, R., Rosenbloom, K., Clawson, H., Green, E.D., et al. (2004). Aligning
multiple genomic sequences with the threaded blockset aligner. Genome Res. 14,
708–715.
Enright, A.J., John, B., Gaul, U., Tuschl, T., Sander, C., and Marks, D.S. (2003).
Genome Biol. 5, R1.
Harris, M. A., Clark, J., Ireland, A., Lomax, J., Ashburner, M., Foulger, R.,
Eilbeck, K., Lewis, S., Marshall, B., Mungall, C., et al. (2004). The Gene Ontology
(GO) database and informatics resource. Nucleic Acids Res. 32 Database issue,
D258–D261.
Hofacker, I.L., Fontana, W., Stadler, P.F., Bonhoeffer, S., Tacker, M., and
Schuster, P. (1994). Fast folding and comparison of RNA secondary structures.
Monatshefte f Chemie 125, 167–188.
John, B., Enright, A.J., Aravin, A., Tuschl, T., Sander, C., and Marks, D.S.
(2004). Human MicroRNA targets. PLoS Biol. 2, e363.
Karolchik, D., Baertsch, R., Diekhans, M., Furey, T. S., Hinrichs, A., Lu, Y.
T., Roskin, K. M., Schwartz, M., Sugnet, C. W., Thomas, D. J., et al. (2003).
The UCSC Genome Browser Database. Nucleic Acids Res. 31, 51–54.
Kaspryzk, A., Keefe, D., Smedley, D., London, D., Spooner, W., et al. (2004)
EnsMart: A generic system for fast and flexible access to biological data. Genome
Res. 14, 160–169.
Lagos-Quintana M., Rauhut R., Yalcin A., Meyer J., Lendeckel W., Tuschl T. (2002).
Identification of tissue-specific microRNAs from mouse. Curr. Biol. 12, 735–739.
Lewis, B.P., Shih, I., Jones-Rhoades, M.W., Bartel, D.P., and Burge, C.B. (2003).
Prediction of mammalian microRNA targets. Cell 115, 787–798.
Pruitt, K.D., Tatusova, T., and Maglott, D.R. (2003). NCBI Reference Sequence
project: update and current status. Nucleic Acids Res. 31, 34–37.
White, E.K., Moore-Jarret, T., Ruley, H.E. (2001). PUM2, a novel murine puf
protein, and its consensus RNA-binding site. RNA 7, 1855–1866.
Supplemental
Figure S1. An Illustration, Modeled after Figure 2 of John et al. (2004),
Showing How Artifactual Signal above Noise Can Be Compounded when Requiring
Multiple Sites per UTR
The height of each bar indicates the number of human transcripts containing
a given number of predicted target sites. The 21-nt segments immediately upstream
or downstream of miRNA hairpins were compiled as a set of fictitious miRNAs
that will yield only false-positive predictions. TargetScanS analysis
was performed using these fictitious RNAs to generate a large set of predicted
target sites (dark blue bars). When using controlled shuffled miRNAs,
this estimate matched the number of predictions from these fictitious miRNAs
(black bars). When using simple shuffles as the control sequences, the
estimate of the noise was substantially lower (light blue bars), giving an errantly
high signal above noise. This artifact was compounded when asking for
more than one conserved hit per UTR.
Supplemental
Table S1. Supplemental Table S1. Human microRNAs Used for Target Prediction
Supplemental Table S2. Supplemental Table S2. Predicted targets conserved in
five vertebrates (Figure 1B)
Supplemental
Table S3. Supplemental Table S3. Predicted Targets Conserved in Human, Mouse,
Rat and Dog
Supplemental
Table S4. Matches for the miR-125 Seed Heptamer and Its Shuffled Derivatives