Exploring Essential Attributes For Detecting MicroRNA Precursors From Background Sequences
Yun Zheng, Wynne Hsu, Mong Li Lee and Limsoon Wong
Department of Computer Science 
School of Computing
National University of Singapore
Singapore 117543
Email: {zhengy, whsu, leeml, wongls}@comp.nus.edu.sg
Abstract
MicroRNAs (miRNAs) have been shown to play important roles in different biological processes, especially in post-transcriptional gene regulation. The hairpin structure is a key characteristic of the microRNAs precursors (pre-miRNAs). How to encode their hairpin structures is a critical step to correctly detect the pre-miRNAs from background sequences, i.e., pseudo miRNA precursors. In this paper, we have proposed to encode the hairpin structures of the pre-miRNA with a set of features, which captures both the global and local structure characteristics of the pre-miRNAs. Furthermore, we find that four essential attributes, with both global and local features, are discriminatory for classifying human pre-miRNAs and background sequences. The experimental results show that the number of conserved essential features decreases when the phylogenetic distance between the species increases. Specifically, one A-U pair, which produces the U at the start position of most mature miRNAs, in the pre-miRNAs is found to be well conserved in different species for the purpose of biogenesis.

NOTE:

The DFLearner software provided here is free for academic purpose. Please cite the following papers when using the DFLearner software in scientific publications.

Zheng Yun and Kwoh Chee Keong. Dynamic Algorithm for Inferring Qualitative Models of Gene Regulatory Networks. Proceedings of the 3rd Computer Society Bioinformatics conference, CSB 2004.pages: 353-364. IEEE Computer Society Press. 2004 [pdf] [slides]

Zheng Yun and Kwoh Chee Keong. Identifying simple discriminatory gene vectors with an information theory approach. In Proceedings of the 4th Computational Systems Bioinformatics Conference, CSB 2005, pages:12-23. IEEE Computer Society Press, 2005. [pdf] [slides] [supplements]

Supplements
The Discrete Function Learner (DFLearner) software (1.2MB)

Include the software and the data files for some examples in the paper. Current version has been validated on Windows and HP Tru64 Unix platform.

software.zip

online help

The miREncoding software

This software inputs the pre-miRNA sequences together with the secondary structure of them, then encode the sequences into 43 features.

miREncoding.zip
Data sets data.zip
A Brief Introduction to the Discrete Function Learning algorithm Chapter 1 of Online Help of the DFLeaner
The Weka software 

In our experiments, we use the GUI version 3.4.2.

http://www.cs.waikato.ac.nz/~ml/weka/
Original data sets

We use pre-miRNAs in the release 5.0 of the miRBase Sequence Database.

Rfam:miRBase

Selection of Parameters of the Discrete Function Learning algorithm Supplementary Section S1.
Supplementary Figures
The manual binary search of minimum ε value. Figure S1.
The selection of essential attributes with the DFL algorithm. Figure S2.
Supplementary Tables
The summary of data sets. Table S1.
The prediction accuracies of different classification algorithms on the data sets with all features. Table S2.
The prediction accuracies of different classification algorithms on the data sets with 32 local features. Table S3.
The prediction accuracies of different classification algorithms on the data sets with 11 global features. Table S4.
The prediction accuracies of different classification algorithms on the data sets with the features chosen by the DFL algorithm. Table S5.
 
Supplementary Section S1: Selection of Parameters
We discuss how to select the two parameters, the expected cardinality of the EAs K and the ε value, of the Discrete Function Learning algorithm in this section.

S1.1 The Selection of The Expected Cardinality K

We discuss the selection of K in this section. Generally, if a data set has a large number of features, like several thousands, then K can be assigned to a small constant, like 50, since the models with large number of features will be very difficult to understand. If the number of features is small, then the K can be directly specified to the number of features n.

Another usage of K is to control model complexity. If the number of features is more important than accuracy, then a predefined K can be set. Thus, the learned model will have less than or equal to K features.

The expected cardinality K can also be used to incorporate the prior knowledge about the number of relevant features. If we have the prior knowledge about the number of relevant features, then the K can be specified as the predetermined value.

S1.2 The Selection of The ε Value

For a given noisy data set, the missing part of H(Y), as demonstrated in Figure 1.4, is determined, i.e., there exists a specific minimum ε value, ε_m, with which the DFL algorithm can find the original model. If the ε value is smaller than the ε_m, the DFL algorithm will not find the original model. Here, we will introduce two methods to efficiently find the ε_m.

First, the ε_m can automatically be found by a restricted learning process. To efficiently find the minimum ε, we restrict the maximum number of the subsets to be checked to \sum_{i=0}^{K-1} (n -
i) ~ Kn . In other words, we only let the DFL algorithm perform its first round searching,  as shown in Figure S1.3 of Introduction. The searching scope of ε is specified in prior. If the DFL algorithm can not find a model for a noisy data set with the specified minimum ε value, then the ε will be increased with a step of 0.01. The restricted learning will be performed, until the DFL algorithm finds a model with a threshold value of ε, i.e., the ε_m. Since only k * n subsets are checked in the restricted learning process, the time to find ε_m will be O(k * n).

The restricted learning process can also be used to find optimal model. To get optimal model, we change the ε value from 0 to the upper limit of the searching scope, like 0.8, with a step of 0.01. For each ε value, we train a model with the DFL algorithm, then validate its performance with the cross validation.

Second, the ε_m can also be found with a manual binary search method, as shown in Figure S1.6. Since ε \in [0,1), ε is specified to 0.5 in the first try. If the DFL algorithm finds a model with ε value of 0.5, then ε is specified to 0.25 in the second try. Otherwise, if the DFL algorithm can not find a model with a long time, like 10 minutes, then the DFL algorithm can be stopped and ε is specified to 0.75 in the second try. The selection process is carried out until the ε_m value is found so that the DFL algorithm can find a model with it but cannot when ε = ε_m - 0.01. This selection process is also efficient. Since ε \in [0,1), only 5 to 6 tries are needed to find the ε_m on the average.

We use the LED data sets with 10 percent noise to shown the procedure. For this example, in the first try, DFL algorithm finds a model with ε of 0.5. Then, the DFL algorithm cannot find a model with the ε of 0.25 in the second try. Similarly, in the 3rd to 6th tries, the DFL algorithm finds models with the specified ε values, 0.37, 0.31, 0.28 and 0.26. Since we have known in the second try that the DFL algorithm cannot find a model with ε of 0.25. Hence, 0.26 is the minimum ε value for this data set.

Figure S1. The manual binary search of minimum ε value. 

 

 

 
Figure S2. The selection of essential attributes with the DFL algorithm.

To find the best subset of EAs for the data sets, we first set the expected cardinality of the EAs K as 10. Next, we use the DFL algorithm to perform leave-one-out cross validation (LOOCV) on the training set (TR-C) with different ε values, from 0 to 0.8 with a step of 0.01. Then, we find that the DFL algorithm reaches its best prediction performance in the LOOCV when ε within [0.12, 0.13]. When using all samples in training data set, the distributions of attributes are slightly different from those in LOOCV. Thus, we try the DFL classifiers obtained from a wider region of ε within [0.1, 0.15]. Finally, we choose the DFL classifier obtained when ε = 0.11 because it shows overall better prediction accuracies for data set 1 to 4. In this way, a subset of 4 features, {A(((, G·((, length basepair ratio, energy per nucleotide}, is chosen as EAs for the human data sets, D1 to D4. For other data sets, we try the subsets of these 4 EAs and choose those subsets on which the 1NN algorithm introduced in Section 3.5 produces the best prediction performances. The selected EAs are shown in Figure 3 (a).

As shown in Figure S2, there is another region of ε around 0.5 within which the DFL algorithm achieves high prediction accuracy in cross validation. But the DFL models use more features when  ε decreases. Thus, we prefer the features chosen by the DFL algorithm when  ε is in the region of [0.1,0.15].

 

Table S1. The summary of data sets.
Data Set # of Samples Class  
0 TR-C (training data set) 163/168 pre-miRNA/background human

 

 

 

 
1 TE-C1 30 pre-miRNA
2 TE-C2 1000 background
3 CONSERVED-HAIRPIN(T3) 2444 background
4 UPDATED(T4) 39 pre-miRNA
5 Mus musculusi (mmu) 36 pre-miRNA other species

 

 

 

 

 

 

 

 

 

 
6 Rattus norvegicus (rno) 25 pre-miRNA
7 Gallus gallus (gga) 13 pre-miRNA
8 Danio rerio (dre) 6 pre-miRNA
9 Caenorhabditis briggsae (cbr) 73 pre-miRNA
10 Caenorhabditis elegans (cel) 110 pre-miRNA
11 Drosophila pseudoobscura (dps) 71 pre-miRNA
12 Drosophila melanogaster (dme) 71 pre-miRNA
13 Oryza sativa (osa) 96 pre-miRNA
14 Arabidopsis thaliana (ath) 75 pre-miRNA
15 Epstein Barr Virus (ebv) 5 pre-miRNA
Total 4094    

In this research, we use the data sets in literature (Xue et al. 2005) to validate our approach, since it is valuable to compare the published results. These data sets are summarized in Table S1. Data set 0 to 4 is from human, and data set 5 to 15 is from other species, as indicated by their names. Data set 0 is used as the training data set, and data set 1 to 15 are used as testing data sets. There are totally 4094 samples used as testing data sets, with 3444 background sequences and 650 pre-miRNAs.

The sequences of human pre-miRNAs are obtained from miRNA registry database (release 5.0) (Griffith-Jones, 2004). The secondary structures of these 207 pre-miRNA sequences are predicted with the RNAfold (Hofacker, 2003). Then, 193 sequences with only 1 loop are chosen. Next, 163 of them are randomly selected to be positive samples of the training data set, i.e., TR-C in Table S1. The rest 30 samples are used as TE-C1 testing data set.

The background sequences in data set 2 are collected from protein coding regions (CDSs) according to the UCSC refGene annotation tables (Xue et al. 2005). The length of these sequences has the same distribution of human pre-miRNAs. The RNAfold is also used to predict the secondary structure of them. Then, the sequences with multiple loops, the sequences with less than 18 base pairs and the sequences with larger than -15kcal/mol free energy are removed. Finally, there are 8494 sequences in this data sets. Among them, 168 are randomly selected as the negative samples of the TR-C data set, and 1000, different from the 168 used, are randomly chosen as the TE-C2 testing data set.

The data set 3 also consists of background sequences, which are retrieved from the genome region from position 56,000,001 to 57,000,000 on the human chromosome 19 with the UCSC database (Karolchik et al. 2003). A window of 100 nucleotide is used to scan the region and those sequences with a predicted hairpin secondary structure by the RNAfold (Hofacker, 2003) are selected. This produces 2444 background sequences in data set 3. Unlike data set 2, some sequences on data set 3 are likely to be the true pre-miRNAs. Actually, there are 3 known miRNAs (hsa-mir-99b, hsa-let-7e and hsa-mir-125a) in data set 3 (Xue et al. 2005).

Bentwich et al. (2005) reported 89 new pre-miRNAs, of which 1 has multiple loops and is removed. To further remove the similar sequences, BLASTCLUST with S = 80, L = 0.5 and W = 16 is applied to the remaining 88 sequences. Only one representative sequence in each cluster is selected to remove the closely related sequences. This produces 40 non-redundant sequences, which are further checked with respect to the training data set. One of the 40 sequences that has high similarity to the training data set is removed. Finally, only 39 sequences are chosen as the data set 4.

 

Table S2. The prediction accuracies of different classification algorithms on the data sets with all features.
Data Set SVM C45 kNN RIP.
TE-C (pre-miRNA) 100.0 100.0 100.0 93.3
TE-C (background) 96.3 91.7 93.6 89.8
CONSERVED-HAIRPIN 92.1 88.9 86.8 85.8
UPDATED 89.7 89.7 89.7 94.9
Mus musculusi 94.4 83.3 91.7 88.9
Rattus norvegicus 76.0 76.0 80.0 76.0
Gallus gallus 84.6 92.3 69.2 92.3
Danio rerio 83.3 83.3 66.7 66.7
Caenorhabditis briggsae 95.9 89.0 93.2 91.8
Caenorhabditis elegans 86.4 74.5 81.8 78.2
Drosophila pseudoobscura 88.7 74.6 88.7 85.9
Drosophila melanogaster 91.5 70.4 88.7 83.1
Oryza sativa 99.0 97.9 93.8 99.0
Arabidopsis thaliana 97.3 92.0 96.0 94.7
Epstein Barr Virus 60.0 100.0 80.0 80.0
total 93.1 88.9 88.9 87.2

 

Table S3. The prediction accuracies of different classification algorithms on the data sets with 32 local features.
Data Set SVM$ C4.5 kNN RIP. Tri-SVM*
TE-C (pre-miRNA) 93.3 93.3 100.0 90.0 93.3
TE-C (background) 92.2 86.7 84.9 86.7 88.1
CONSERVED-HAIRPIN 88.4 78.2 80.1 79.6 89.0
UPDATED 92.3 87.2 89.7 92.3 92.3
Mus musculusi 91.7 72.2 88.9 86.1 94.4
Rattus norvegicus 84.0 68.0 76.0 72.0 80.0
Gallus gallus 92.3 69.2 84.6 76.9 84.6
Danio rerio 83.3 66.7 83.3 83.3 66.7
Caenorhabditis briggsae 95.9 87.7 94.5 90.4 95.9
Caenorhabditis elegans 86.4 85.5 83.6 88.2 86.4
Drosophila pseudoobscura 88.7 81.7 90.1 88.7 90.1
Drosophila melanogaster 88.7 71.8 85.9 84.5 91.5
Oryza sativa 99.0 99.0 92.7 99.0 94.8
Arabidopsis thaliana 96.0 93.3 89.3 96.0 92.0
Epstein Barr Virus 100.0 80.0 80.0 80.0 100.0
total 89.9 81.4 82.7 83.0 89.1

$ For the local data sets, the prediction accuracies of the SVM algorithm in our study are slightly better than those in literature (Xue et al. 2005). We contribute this to the encoding region in our research. We encode the triplet local features for the whole pre-miRNAs, except the first and last nucleotide. However, only the paired regions of the pre-miRNAs are encoded into triplet features in (Xue et al. 2005). * The Tri-SVM column shows the results from literature (Xue et al. 2005).

Table S4. The prediction accuracies of different classification algorithms on the data sets with 11 global features.
SVM C4.5 kNN RIP.
TE-C (pre-miRNA) 96.7 100.0 100.0 93.3
TE-C (background) 95.1 91.3 93.1 92.4
CONSERVED-HAIRPIN 88.3 87.2 84.2 87.8
UPDATED 89.7 94.9 92.3 92.3
Mus musculusi 94.4 91.7 94.4 94.4
Rattus norvegicus 76.0 76.0 76.0 80.0
Gallus gallus 76.9 84.6 76.9 69.2
Danio rerio 66.7 50.0 66.7 66.7
Caenorhabditis briggsae 97.3 87.7 95.9 84.9
Caenorhabditis elegans 79.1 78.2 84.5 75.5
Drosophila pseudoobscura 85.9 81.7 85.9 73.2
Drosophila melanogaster 91.5 78.9 90.1 78.9
Oryza sativa 100.0 96.9 100.0 96.9
Arabidopsis thaliana 97.3 93.3 96.0 94.7
Epstein Barr Virus 40.0 80.0 40.0 80.0
total 90.3 88.1 87.5 88.4

 

Table S5. The prediction accuracies of different classification algorithms on the data sets with the features chosen by the DFL algorithm.
Data Set SVM C4.5 kNN RIPPER
1 TE-C1 96.7 96.7 96.7 96.7
2 TE-C2 94.2 92.5 94.5 95.2
3 T3 90.4 90.1 90.6 92.3
4 T4 94.9 92.3 97.4 94.9
5 mmu 94.4 94.4 94.4 94.4
6 rno 80.0 84.0 80.0 80.0
7 gga 92.3 100.0 100.0 100.0
8 dre 83.3 83.3 83.3 83.3
9 cbr 89.0 94.5 94.5 94.5
10 cel 88.2 96.4 96.4 96.4
11 dps 91.5 95.8 95.8 95.8
12 dme 88.7 93.0 93.0 93.0
13 osa 96.9 97.9 97.9 97.9
14 ath 100.0 100.0 100.0 100.0
15 ebv 100.0 100.0 80.0 80.0
human sensitivity 95.7 94.2 97.1 95.7
specificity 91.5 90.8 91.7 93.1
accuracy 91.6 90.9 91.8 93.2
other species sensitivity 91.9 95.7 95.5 95.4
total sensitivity 92.3 95.5 95.5 95.4
specificity 91.5 90.8 91.7 93.1
accuracy 91.6 91.6 92.3 93.5

 

References:

1. I. Bentwich, A. Avniel, Y. Karov, R. Aharonov, S. Gilad, O. Barad, A. Barzilai, P. Einat, U. Einav, E. Meiri, E. Sharon, Y. Spector, and Z. Bentwich. Identification of hundreds of conserved and nonconserved human micrornas. Nature Genetics, 37(7):766–70, 2005.

2. S. Griffiths-Jones. The microRNA Registry. Nucl. Acids Res., 32(90001):D109–111, 2004.

3. I.L. Hofacker. Vienna RNA secondary structure server. Nucl. Acids Res., 31(13):3429–3431, 2003.

4. D. Karolchik, R. Baertsch, M. Diekhans, T.S. Furey, A. Hinrichs, Y.T. Lu, K.M. Roskin, M. Schwartz, C.W. Sugnet, D.J. Thomas, R.J. Weber, D. Haussler, and W.J. Kent. The UCSC Genome Browser Database. Nucl. Acids Res., 31(1):51–54, 2003.

5. C. Xue, F. Li, T. He, G.-P. Liu, Y. Li, and X. Zhang. Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine. BMC Bioinformatics, 6(1):310, 2005.

 

All rights reserved by Zheng Yun (2006)