Originally published in: "Neural Network Prediction of Translation Initiation Sites in Eukaryotes: Perspectives for EST and Genome Analysis". Proceedings of 5th International Conference on Intelligent Systems for Molecular Biology, 226-233, 1997.
# Instances: 13375
# Attributes: 927
# Classes: 2 (positive v.s. negative)
Raw Data Source:
Annotated vertebrates sequences
Extracted Samples:
Positive samples,
negative samples
ARFF-Formatted:
In-frame 3-grams
Description: The original data consists of a selected set of vertebrates genomic sequences extracted from GenBank. It is used to find the Translation Initiation Site (TIS), at which the translation from mRNA to proteins initiates. Since only those sequences with an annotated TIS are included in the data set, a classification model can be built to distinguish true (positive) TIS and false (negative) TIS. As the data set is processed DNA, the TIS site is ATG. In total, there are 3312 sequences (i.e. 3312 true ATGs).
There are various ways to extract sequences and build feature space. Here, we provide one approach: a window centered at each ATG, with both upstream and downstream are 99 bases long, is generated from each ATG. So there are 201 bases indicated by A, T, C and G in each window. If the portion of sequence is shorter than the window end, those bases are denoted by "N". With this strategy, we got 3312 true ATGs, 10191 false ATGs. For classification, a feature vector is generated for each such ATG segment using up-stream in-frame 3-grams and down-stream in-frame 3-grams.
Adapted from the Kent Ridge Biomedical Data Set Repository created by Huiqing Liu and Jinyan Li. The ATG segment extraction and feature vector generation were done by Chuan Hock Koh. -Limsoon, 14/8/2017