The sequence upstream of the genomic loci from which 21U-RNAs derive contain two characteristic motifs. The first motif is 34nt long and contains many nucleotides whose frequencies seriously deviate from that of the surrounding genomic sequence. At the core of this "large" motif is an 8nt core whose nucleotide identities have the strongest preference: CTGTTTCA. The second motif is small (6nt) and encompasses the genomic position corresponding to the 5p end of the derivative 21U-RNA. The core of this motif is the 4nt sequence YRNT, where Y is any pyrimidine, R is any purine, N is any nucleotide identity, and T encodes the U at the 5p end of a 21U-RNA. These two motifs are separated in the genome by ~20nt.
The scoring of a 21U-RNA upstream motif requires the combination of three scores: one for the match to the large motif, one for the match to the small motif, and one for the distance between the two. For a particular genomic locus, several scores are possible as the distance between the small and large motif is varied. The score returned by the module provided here is the maximum of the available scores. For scoring a small RNA that mapped to the genome, that is the small motif score (placing the small motif appropriately over the 5p end of the small RNA) added to the maximum sum of distance score and large motif score for all allowed distances upstream of the small motif that the large motif can be placed. For predicting 21U-RNAs, an alternative operation can be performed with all possible placements of the small motif being evaluated given a high-scoring placement for the large motif. This provides a sense for which 21nt sequence is most likely to be expressed as a 21U-RNA as a consequence of the presence of the large motif. However, one instance of the large motif is sometimes observed to generate expression of multiple 21U-RNAs.
The motif scores were calculated based on nucleotide frequencies at each position upstream of each small RNA that mapped with a perfect match to a 21U-RNA-rich portion of C. elegans chromosome IV. The apparent preference for 5p U and 21nt length among this class of RNAs was noted and used to define the relevant portions of chromosome IV from which to gather counts, but neither was explicitly required. Also, pseudocounts were added to each nucleotide frequency that reflected the approximate background nucleotide frequencies (34% A, 34% T, 16% G, 16% C); the number of pseudocounts was set to be the square root of the number of real counts. The score at each position is the base-2 log of the quotient of the foreground frequency + background pseudocounts divided by the background frequency + background pseudocounts.
index A T C G -58 0.3131 0.1347 -0.5629 -0.7366 -57 0.4728 -0.1317 -0.4571 -0.6599 -56 0.7073 -0.4912 -0.2761 -1.1751 -55 0.873 -0.4691 -1.1599 -1.4133 -54 0.9184 -0.4481 -2.1781 -1.1359 -53 0.7061 -0.1089 -1.6642 -1.0082 -52 0.4139 0.2455 -0.7254 -1.7616 -51 0.3161 0.3869 -1.2342 -1.3193 -50 -0.3188 0.8022 -1.2406 -1.2599 -49 -0.5645 0.4823 0.356 -0.8247 -48 -0.8621 0.6782 -0.0328 -0.5081 -47 0.8555 -1.0092 -0.5023 -0.6366 -46 -1.2509 -2.5166 2.2827 -3.0608 -45 -1.3367 1.1333 -1.2534 -1.6096 -44 -2.8371 -2.5685 -3.4609 2.4604 -43 -4.7455 1.4669 -1.9733 -4.5691 -42 -4.5691 1.5215 -4.7756 -5.4173 -41 -4.0972 1.4883 -3.3194 -3.9734 -40 -4.9661 -2.0251 2.4806 -3.6686 -39 1.2954 -2.3261 -3.0274 -1.0356 -38 0.2009 -0.7823 0.6757 -0.037 -37 0.1633 -0.4343 0.089 0.3034 -36 0.3299 0.1125 -1.1842 -0.2486 -35 0.0246 0.5508 -0.8933 -1.1935 -34 -0.7823 -0.3903 -0.0136 1.2646 -33 -1.1758 0.9036 -0.5569 -0.6116 -32 -1.774 0.955 0.1216 -1.247 -31 -0.8132 0.1107 0.5214 0.3905 -30 0.5013 0.2815 -2.4461 -1.0721 -29 -2.7576 1.016 0.2013 -1.0329 -28 1.0106 -0.477 -3.8514 -1.4241 -27 0.4982 0.5254 -3.3884 -3.4026 -26 0.9992 -1.2692 -1.2959 -0.4871 -25 0.5258 0.1859 -1.866 -1.0028
index A T C G -4 -0.3847 0.4554 -0.1808 -0.2749 -3 -2.172 0.7762 0.922 -2.0323 -2 0.969 -3.459 -3.3102 0.845 -1 0.0342 0.2174 -0.4393 -0.2077 0 -4.7791 1.475 -3.1639 -2.7011 1 -0.0594 -1.0039 0.3189 0.9288
distance score 16 -5.9606915688 17 -3.6775367302 18 -2.64130835064 19 1.15932881765 20 1.99214182316 21 1.30328543478 22 -0.337753314521 23 -2.4875316342 24 -3.8335895998 25 -5.26133885557
index A T C G -58 2792 2464 710 628 -57 3122 2044 765 663 -56 3678 1587 869 460 -55 4129 1612 465 388 -54 4262 1636 223 473 -53 3675 2077 324 518 -52 2996 2663 633 302 -51 2798 2940 441 415 -50 1792 3930 439 433 -49 1507 3143 1354 590 -48 1221 3604 1031 738 -47 4079 1100 741 674 -46 926 369 5184 115 -45 871 4951 435 337 -44 290 355 84 5865 -43 57 6246 259 32 -42 68 6488 26 12 -41 105 6340 94 55 -40 45 530 5948 71 -39 5543 425 118 508 -38 2581 1292 1693 1028 -37 2514 1652 1123 1305 -36 2825 2426 457 886 -35 2281 3297 562 454 -34 1292 1704 1045 2553 -33 977 4218 713 686 -32 636 4372 1149 437 -31 1264 2423 1520 1387 -30 3185 2731 183 495 -29 308 4562 1215 509 -28 4545 1603 61 385 -27 3178 3239 89 88 -26 4509 914 422 749 -25 3240 2554 280 520
index A T C G -4 1784 3216 969 907 -3 497 4025 2097 259 -2 4604 187 99 1987 -1 2395 2723 808 951 0 58 6551 111 158 1 2243 1152 1376 2107
distance counts 25 6 24 28 23 82 22 388 21 1225 20 1979 19 1108 18 73 17 32 16 1