Exploring Essential Attributes For Detecting MicroRNA Precursors

Exploring Essential Attributes For Detecting MicroRNA Precursors From Background Sequences

Yun Zheng, Wynne Hsu, Mong Li Lee and Limsoon Wong

Department of Computer Science
School of Computing
National University of Singapore
Singapore 117543
Email: {zhengy, whsu, leeml, wongls}@comp.nus.edu.sg

Abstract

MicroRNAs (miRNAs) have been shown to play important roles in different biological processes, especially in post-transcriptional gene regulation. The hairpin structure is a key characteristic of the microRNAs precursors (pre-miRNAs). How to encode their hairpin structures is a critical step to correctly detect the pre-miRNAs from background sequences, i.e., pseudo miRNA precursors. In this paper, we have proposed to encode the hairpin structures of the pre-miRNA with a set of features, which captures both the global and local structure characteristics of the pre-miRNAs. Furthermore, we find that four essential attributes, with both global and local features, are discriminatory for classifying human pre-miRNAs and background sequences. The experimental results show that the number of conserved essential features decreases when the phylogenetic distance between the species increases. Specifically, one A-U pair, which produces the U at the start position of most mature miRNAs, in the pre-miRNAs is found to be well conserved in different species for the purpose of biogenesis.

NOTE:

The DFLearner software provided here is free for academic purpose. Please cite the following papers when using the DFLearner software in scientific publications.

Zheng Yun and Kwoh Chee Keong. Dynamic Algorithm for Inferring Qualitative Models of Gene Regulatory Networks. Proceedings of the 3rd Computer Society Bioinformatics conference, CSB 2004.pages: 353-364. IEEE Computer Society Press. 2004 [pdf] [slides]

Zheng Yun and Kwoh Chee Keong. Identifying simple discriminatory gene vectors with an information theory approach. In Proceedings of the 4th Computational Systems Bioinformatics Conference, CSB 2005, pages:12-23. IEEE Computer Society Press, 2005. [pdf] [slides] [supplements]

Supplements

The Discrete Function Learner (DFLearner) software (1.2MB)

Include the software and the data files for some examples in the paper. Current version has been validated on Windows and HP Tru64 Unix platform.

software.zip

online help

The miREncoding software

This software inputs the pre-miRNA sequences together with the secondary structure of them, then encode the sequences into 43 features.

miREncoding.zip

Data sets

data.zip

A Brief Introduction to the Discrete Function Learning algorithm

Chapter 1 of Online Help of the DFLeaner

The Weka software

In our experiments, we use the GUI version 3.4.2.

http://www.cs.waikato.ac.nz/~ml/weka/

Original data sets

We use pre-miRNAs in the release 5.0 of the miRBase Sequence Database.

Rfam:miRBase

Selection of Parameters of the Discrete Function Learning algorithm

Supplementary Section S1.

Supplementary Figures

The manual binary search of minimum ε value.

Figure S1.

The selection of essential attributes with the DFL algorithm.

Figure S2.

Supplementary Tables

The summary of data sets.

Table S1.

The prediction accuracies of different classification algorithms on the data sets with all features.

Table S2.

The prediction accuracies of different classification algorithms on the data sets with 32 local features.

Table S3.

The prediction accuracies of different classification algorithms on the data sets with 11 global features.

Table S4.

The prediction accuracies of different classification algorithms on the data sets with the features chosen by the DFL algorithm.

Table S5.

Supplementary Section S1: Selection of Parameters

We discuss how to select the two parameters, the expected cardinality of the EAs K and the ε value, of the Discrete Function Learning algorithm in this section.

S1.1 The Selection of The Expected Cardinality K

We discuss the selection of K in this section. Generally, if a data set has a large number of features, like several thousands, then K can be assigned to a small constant, like 50, since the models with large number of features will be very difficult to understand. If the number of features is small, then the K can be directly specified to the number of features n.

Another usage of K is to control model complexity. If the number of features is more important than accuracy, then a predefined K can be set. Thus, the learned model will have less than or equal to K features.

The expected cardinality K can also be used to incorporate the prior knowledge about the number of relevant features. If we have the prior knowledge about the number of relevant features, then the K can be specified as the predetermined value.

S1.2 The Selection of The ε Value

For a given noisy data set, the missing part of H(Y), as demonstrated in Figure 1.4, is determined, i.e., there exists a specific minimum ε value, ε_m, with which the DFL algorithm can find the original model. If the ε value is smaller than the ε_m, the DFL algorithm will not find the original model. Here, we will introduce two methods to efficiently find the ε_m.

First, the ε_m can automatically be found by a restricted learning process. To efficiently find the minimum ε, we restrict the maximum number of the subsets to be checked to \sum_{i=0}^{K-1} (n - i) ~ Kn . In other words, we only let the DFL algorithm perform its first round searching, as shown in Figure S1.3 of Introduction. The searching scope of ε is specified in prior. If the DFL algorithm can not find a model for a noisy data set with the specified minimum ε value, then the ε will be increased with a step of 0.01. The restricted learning will be performed, until the DFL algorithm finds a model with a threshold value of ε, i.e., the ε_m. Since only k * n subsets are checked in the restricted learning process, the time to find ε_m will be O(k * n).

The restricted learning process can also be used to find optimal model. To get optimal model, we change the ε value from 0 to the upper limit of the searching scope, like 0.8, with a step of 0.01. For each ε value, we train a model with the DFL algorithm, then validate its performance with the cross validation.

Second, the ε_m can also be found with a manual binary search method, as shown in Figure S1.6. Since ε \in [0,1), ε is specified to 0.5 in the first try. If the DFL algorithm finds a model with ε value of 0.5, then ε is specified to 0.25 in the second try. Otherwise, if the DFL algorithm can not find a model with a long time, like 10 minutes, then the DFL algorithm can be stopped and ε is specified to 0.75 in the second try. The selection process is carried out until the ε_m value is found so that the DFL algorithm can find a model with it but cannot when ε = ε_m - 0.01. This selection process is also efficient. Since ε \in [0,1), only 5 to 6 tries are needed to find the ε_m on the average.

We use the LED data sets with 10 percent noise to shown the procedure. For this example, in the first try, DFL algorithm finds a model with ε of 0.5. Then, the DFL algorithm cannot find a model with the ε of 0.25 in the second try. Similarly, in the 3rd to 6th tries, the DFL algorithm finds models with the specified ε values, 0.37, 0.31, 0.28 and 0.26. Since we have known in the second try that the DFL algorithm cannot find a model with ε of 0.25. Hence, 0.26 is the minimum ε value for this data set.

Figure S1. The manual binary search of minimum ε value.

Figure S2. The selection of essential attributes with the DFL algorithm.

To find the best subset of EAs for the data sets, we first set the expected cardinality of the EAs K as 10. Next, we use the DFL algorithm to perform leave-one-out cross validation (LOOCV) on the training set (TR-C) with different ε values, from 0 to 0.8 with a step of 0.01. Then, we find that the DFL algorithm reaches its best prediction performance in the LOOCV when ε within [0.12, 0.13]. When using all samples in training data set, the distributions of attributes are slightly different from those in LOOCV. Thus, we try the DFL classifiers obtained from a wider region of ε within [0.1, 0.15]. Finally, we choose the DFL classifier obtained when ε = 0.11 because it shows overall better prediction accuracies for data set 1 to 4. In this way, a subset of 4 features, {A(((, G·((, length basepair ratio, energy per nucleotide}, is chosen as EAs for the human data sets, D1 to D4. For other data sets, we try the subsets of these 4 EAs and choose those subsets on which the 1NN algorithm introduced in Section 3.5 produces the best prediction performances. The selected EAs are shown in Figure 3 (a).

As shown in Figure S2, there is another region of ε around 0.5 within which the DFL algorithm achieves high prediction accuracy in cross validation. But the DFL models use more features when ε decreases. Thus, we prefer the features chosen by the DFL algorithm when ε is in the region of [0.1,0.15].

Table S1. The summary of data sets.

	Data Set	# of Samples	Class
0	TR-C (training data set)	163/168	pre-miRNA/background	human
1	TE-C1	30	pre-miRNA
2	TE-C2	1000	background
3	CONSERVED-HAIRPIN(T3)	2444	background
4	UPDATED(T4)	39	pre-miRNA
5	Mus musculusi (mmu)	36	pre-miRNA	other species
6	Rattus norvegicus (rno)	25	pre-miRNA
7	Gallus gallus (gga)	13	pre-miRNA
8	Danio rerio (dre)	6	pre-miRNA
9	Caenorhabditis briggsae (cbr)	73	pre-miRNA
10	Caenorhabditis elegans (cel)	110	pre-miRNA
11	Drosophila pseudoobscura (dps)	71	pre-miRNA
12	Drosophila melanogaster (dme)	71	pre-miRNA
13	Oryza sativa (osa)	96	pre-miRNA
14	Arabidopsis thaliana (ath)	75	pre-miRNA
15	Epstein Barr Virus (ebv)	5	pre-miRNA
	Total	4094

In this research, we use the data sets in literature (Xue et al. 2005) to validate our approach, since it is valuable to compare the published results. These data sets are summarized in Table S1. Data set 0 to 4 is from human, and data set 5 to 15 is from other species, as indicated by their names. Data set 0 is used as the training data set, and data set 1 to 15 are used as testing data sets. There are totally 4094 samples used as testing data sets, with 3444 background sequences and 650 pre-miRNAs.

The sequences of human pre-miRNAs are obtained from miRNA registry database (release 5.0) (Griffith-Jones, 2004). The secondary structures of these 207 pre-miRNA sequences are predicted with the RNAfold (Hofacker, 2003). Then, 193 sequences with only 1 loop are chosen. Next, 163 of them are randomly selected to be positive samples of the training data set, i.e., TR-C in Table S1. The rest 30 samples are used as TE-C1 testing data set.

The background sequences in data set 2 are collected from protein coding regions (CDSs) according to the UCSC refGene annotation tables (Xue et al. 2005). The length of these sequences has the same distribution of human pre-miRNAs. The RNAfold is also used to predict the secondary structure of them. Then, the sequences with multiple loops, the sequences with less than 18 base pairs and the sequences with larger than -15kcal/mol free energy are removed. Finally, there are 8494 sequences in this data sets. Among them, 168 are randomly selected as the negative samples of the TR-C data set, and 1000, different from the 168 used, are randomly chosen as the TE-C2 testing data set.

The data set 3 also consists of background sequences, which are retrieved from the genome region from position 56,000,001 to 57,000,000 on the human chromosome 19 with the UCSC database (Karolchik et al. 2003). A window of 100 nucleotide is used to scan the region and those sequences with a predicted hairpin secondary structure by the RNAfold (Hofacker, 2003) are selected. This produces 2444 background sequences in data set 3. Unlike data set 2, some sequences on data set 3 are likely to be the true pre-miRNAs. Actually, there are 3 known miRNAs (hsa-mir-99b, hsa-let-7e and hsa-mir-125a) in data set 3 (Xue et al. 2005).

Bentwich et al. (2005) reported 89 new pre-miRNAs, of which 1 has multiple loops and is removed. To further remove the similar sequences, BLASTCLUST with S = 80, L = 0.5 and W = 16 is applied to the remaining 88 sequences. Only one representative sequence in each cluster is selected to remove the closely related sequences. This produces 40 non-redundant sequences, which are further checked with respect to the training data set. One of the 40 sequences that has high similarity to the training data set is removed. Finally, only 39 sequences are chosen as the data set 4.

Table S2. The prediction accuracies of different classification algorithms on the data sets with all features.

Data Set	SVM	C45	kNN	RIP.
TE-C (pre-miRNA)	100.0	100.0	100.0	93.3
TE-C (background)	96.3	91.7	93.6	89.8
CONSERVED-HAIRPIN	92.1	88.9	86.8	85.8
UPDATED	89.7	89.7	89.7	94.9
Mus musculusi	94.4	83.3	91.7	88.9
Rattus norvegicus	76.0	76.0	80.0	76.0
Gallus gallus	84.6	92.3	69.2	92.3
Danio rerio	83.3	83.3	66.7	66.7
Caenorhabditis briggsae	95.9	89.0	93.2	91.8
Caenorhabditis elegans	86.4	74.5	81.8	78.2
Drosophila pseudoobscura	88.7	74.6	88.7	85.9
Drosophila melanogaster	91.5	70.4	88.7	83.1
Oryza sativa	99.0	97.9	93.8	99.0
Arabidopsis thaliana	97.3	92.0	96.0	94.7
Epstein Barr Virus	60.0	100.0	80.0	80.0
total	93.1	88.9	88.9	87.2

Table S3. The prediction accuracies of different classification algorithms on the data sets with 32 local features.

Data Set	SVM^$	C4.5	kNN	RIP.	Tri-SVM*
TE-C (pre-miRNA)	93.3	93.3	100.0	90.0	93.3
TE-C (background)	92.2	86.7	84.9	86.7	88.1
CONSERVED-HAIRPIN	88.4	78.2	80.1	79.6	89.0
UPDATED	92.3	87.2	89.7	92.3	92.3
Mus musculusi	91.7	72.2	88.9	86.1	94.4
Rattus norvegicus	84.0	68.0	76.0	72.0	80.0
Gallus gallus	92.3	69.2	84.6	76.9	84.6
Danio rerio	83.3	66.7	83.3	83.3	66.7
Caenorhabditis briggsae	95.9	87.7	94.5	90.4	95.9
Caenorhabditis elegans	86.4	85.5	83.6	88.2	86.4
Drosophila pseudoobscura	88.7	81.7	90.1	88.7	90.1
Drosophila melanogaster	88.7	71.8	85.9	84.5	91.5
Oryza sativa	99.0	99.0	92.7	99.0	94.8
Arabidopsis thaliana	96.0	93.3	89.3	96.0	92.0
Epstein Barr Virus	100.0	80.0	80.0	80.0	100.0
total	89.9	81.4	82.7	83.0	89.1

^$For the local data sets, the prediction accuracies of the SVM algorithm in our study are slightly better than those in literature (Xue et al. 2005). We contribute this to the encoding region in our research. We encode the triplet local features for the whole pre-miRNAs, except the first and last nucleotide. However, only the paired regions of the pre-miRNAs are encoded into triplet features in (Xue et al. 2005). * The Tri-SVM column shows the results from literature (Xue et al. 2005).

Table S4. The prediction accuracies of different classification algorithms on the data sets with 11 global features.

	SVM	C4.5	kNN	RIP.
TE-C (pre-miRNA)	96.7	100.0	100.0	93.3
TE-C (background)	95.1	91.3	93.1	92.4
CONSERVED-HAIRPIN	88.3	87.2	84.2	87.8
UPDATED	89.7	94.9	92.3	92.3
Mus musculusi	94.4	91.7	94.4	94.4
Rattus norvegicus	76.0	76.0	76.0	80.0
Gallus gallus	76.9	84.6	76.9	69.2
Danio rerio	66.7	50.0	66.7	66.7
Caenorhabditis briggsae	97.3	87.7	95.9	84.9
Caenorhabditis elegans	79.1	78.2	84.5	75.5
Drosophila pseudoobscura	85.9	81.7	85.9	73.2
Drosophila melanogaster	91.5	78.9	90.1	78.9
Oryza sativa	100.0	96.9	100.0	96.9
Arabidopsis thaliana	97.3	93.3	96.0	94.7
Epstein Barr Virus	40.0	80.0	40.0	80.0
total	90.3	88.1	87.5	88.4

Table S5. The prediction accuracies of different classification algorithms on the data sets with the features chosen by the DFL algorithm.

	Data Set	SVM	C4.5	kNN	RIPPER
1	TE-C1	96.7	96.7	96.7	96.7
2	TE-C2	94.2	92.5	94.5	95.2
3	T3	90.4	90.1	90.6	92.3
4	T4	94.9	92.3	97.4	94.9
5	mmu	94.4	94.4	94.4	94.4
6	rno	80.0	84.0	80.0	80.0
7	gga	92.3	100.0	100.0	100.0
8	dre	83.3	83.3	83.3	83.3
9	cbr	89.0	94.5	94.5	94.5
10	cel	88.2	96.4	96.4	96.4
11	dps	91.5	95.8	95.8	95.8
12	dme	88.7	93.0	93.0	93.0
13	osa	96.9	97.9	97.9	97.9
14	ath	100.0	100.0	100.0	100.0
15	ebv	100.0	100.0	80.0	80.0
human	sensitivity	95.7	94.2	97.1	95.7
	specificity	91.5	90.8	91.7	93.1
	accuracy	91.6	90.9	91.8	93.2
other species	sensitivity	91.9	95.7	95.5	95.4
total	sensitivity	92.3	95.5	95.5	95.4
	specificity	91.5	90.8	91.7	93.1
	accuracy	91.6	91.6	92.3	93.5

References:

1. I. Bentwich, A. Avniel, Y. Karov, R. Aharonov, S. Gilad, O. Barad, A. Barzilai, P. Einat, U. Einav, E. Meiri, E. Sharon, Y. Spector, and Z. Bentwich. Identification of hundreds of conserved and nonconserved human micrornas. Nature Genetics, 37(7):766–70, 2005.

2. S. Griffiths-Jones. The microRNA Registry. Nucl. Acids Res., 32(90001):D109–111, 2004.

3. I.L. Hofacker. Vienna RNA secondary structure server. Nucl. Acids Res., 31(13):3429–3431, 2003.

4. D. Karolchik, R. Baertsch, M. Diekhans, T.S. Furey, A. Hinrichs, Y.T. Lu, K.M. Roskin, M. Schwartz, C.W. Sugnet, D.J. Thomas, R.J. Weber, D. Haussler, and W.J. Kent. The UCSC Genome Browser Database. Nucl. Acids Res., 31(1):51–54, 2003.

5. C. Xue, F. Li, T. He, G.-P. Liu, Y. Li, and X. Zhang. Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine. BMC Bioinformatics, 6(1):310, 2005.