Drosophila Cis-Regulatory Database
Accuracy of computational annotations
The computational method of TFBS annotation in uncharacterized CRMs is explained in the following paper:
Vipin Narang, Wing-Kin Sung and Ankush Mittal (2006) “Computational Annotation of Transcription Factor Binding Sites in D. melanogaster Developmental Genes,” 17th International Conference on Genome Informatics (GIW 2006), Yokohama, Japan, December 18-20, 2006.
Accuracy of the computational TFBS annotation has been measured according to sensitivity, specificity and correlation coefficient defined as follows.
, , ,
TP = predicted TFBS overlaps actual TFBS,
FP = predicted TFBS overlaps actual non-TFBS,
TN = predicted non-TFBS overlaps actual non-TFBS,
FN = predicted non-TFBS overlaps actual TFBS
In the physical sense, sensitivity refers to the percentage of actual TFBS that could be successfully predicted, specificity refers to the percentage of actual non-TFBS that could be successfully rejected, and correlation coefficient measures the difference between number of correct and incorrect predictions on a scale of –1 to 1. The overall classification performance is shown through the receiver-operator characteristics (ROC curve). The degree of accuracy of detecting TFBS and rejecting non-TFBS is seen in how much the ROC curve deviates from the diagonal.
The datasets for performance validation were extracted as follows. The overlapping set of 155 CRMs and 778 TFBS was used for training. The rest 288 TFBS, which are not associated with any of the known CRMs, are used for validation. Also, 100 sequence regions that lie between adjacent CRMs have been extracted. Since the 155 CRM annotations used for this purpose are of high quality, the extracted sequences are non-CRMs with a good certainty. These are used as a negative dataset to test degree of false positives in TFBS detection.
In the dataset of 288 sequences containing TFBS, the computational annotation could detect 143 TFBS accurately (sensitivity = 49.6 %) with a high specificity of 95.4%. In the non-CRM sequences, an average of 4.9 TFBS predictions per 1000 bp of sequence was reported. This may be compared to the number of TFBS predicted in CRM regions, which are about 6.8 per 1000 bp. The high prediction accuracy of TFBS in the CRM sequences and the low false positive rate in non-CRM sequences is in support of the approach. With sufficient confidence in the TFBS prediction accuracy, the set of 506 uncharacterized CRMs was annotated. A total of 9218 predictions were made, which amounts to, on an average, 7.96 TFBS per 1000 bp of sequence.
TFBS detection performance