Projects
Protein Flexibility Analysis Protein conformational changes play a critical role in vital biological functions. Due to noise in data, deteremining salient conformational changes accurately and efficiently is a challenging problem. We have developed an efficient algorithm for analyzing conformational changes of a protein, given its structures in two different conformations. A key element of the algorithm is a statistical test that determines the similarity of two protein structures in the presence of noise. Using data from the Protein Data Bank and the Macromolecular Movements Database, we tested the algorithm on proteins that exhibit a range of different conformational changes. The results show that our algorithm can reliably detect salient conformational changes, including well-known examples such as hinge and shear.
Stochastic Roadmap Simulation Many interesting properties of molecular motion are best characterized statistically by considering an ensemble of motion pathways rather than an individual one. Classic simulation techniques, such as the Monte Carlo method and molecular dynamics, generate individual pathways one at a time and are easily "trapped" in the local minima of the energy landscape. They are computationally inefficient if applied in a brute-force fashion to deal with many pathways. We introduce Stochastic Roadmap Simulation (SRS), a randomized technique for sampling molecular motion and exploring the kinetics of such motion by examining multiple pathways simultaneously.
Algorithmic Research in Bioinformatics We are interested in the combinatorial and algorithmic aspects of bioinformatics problems. Our current research include microarray probe design, pattern searching, genome annotation, protein structure prediction, etc. Some of our results: FindProbe, BAYESPROT, GeneNet, CSA and SAGA Programs, LGSFAligner---A Tool to Align Two RNA Secondary Structures, and Similarity Search Using Spaced Seeds.
GENESIS: GENomE Sequencing, Indexing, and Searching Biological databases contain thousands of bio-sequences and also an increase amount of 3D structures of molecules. It would be hard for a biologist to search for required sequences or structures (for analysis, prediction, reasoning, etc). The data size is so large that it is even impossible for naive computer algorithms to do this job. So, we need some "smart" algorithms for both exact matching and similarity matching on sequences and 3D structures. Many techniques have been developed by many researchers to solve the above mentioned problems. But perfect solutions have not been come out yet. Our objective is to develop an optimal solution, at least for some particular bioinformatics applications.
Management and Analysis of Gene Expression Data Gene expression data can be a valuable tool in the understanding of enes, biological networks, and cellular states. One ambitious goal in analyzing expression data is to try to determine how one particular gene is affected by the expression of other genes, thus deriving the gene network. Gene expression data can also be used to determine what genes are expressed as a result of certain cellular conditions. Such kinds of knowledge can help in disease diagnosis. Gene expression data typically contain a large number of columns (genes) and a small number of rows (samples). The high-dimensional properties of gene expression data present great difficulties to existing mining algorithms. We are investigating a number of techniques for mining rules and patterns, identifying clusters and classifying gene expression data.
Computational Systems Biology Computational Systems Biology involves studying cellular functions and its components at varying degrees of granularity. These levels range from the nano-scale molecular structures (atomic level) to entire organs such as heart and lungs (phenotype level). Our research focus is mainly on the functional aspects of cellular components, in the form of biopathways. We are especially interested in modeling and analyzing Signaling Pathways and Gene Regulatory Networks. Currently we have joint projects with the Genome Institute of Singapore and the Department of Biochemistry, NUS, modeling various pathways that are involved in important cell processes such as differentiation and apoptosis. Using these pathways as examples, we hope to be able to develop a set of tools and modeling methodology to produce accurate models that can be validated and can be used to predict new phenomena.
Pattern Spaces: Theory, Techniques, and Applications There are many previous data mining works on frequent itemsets, their closed patterns, and their generators. There are also a number of studies on emerging patterns and their borders. But the mining of odds ratio patterns, relative risk patterns, and patterns having other statistical properties frequently used in analysis of biomedical data, have never been investigated extensively. Thus odds ratio patterns, relative risk patterns, and patterns satisfying other statistical properties deserve our attention. In this project, we would like to (a) study in depth the theoretical properties of these patterns, (b) develop efficient algorithms for their mining, (c) develop efficient algorithms for their incremental maintenance when the underlying databases are updated, (d) investigate ways to build classifiers based on them and develop techniques for visualizing and explaining decisions made by such classifiers; and (e) to apply them to biomedical data.
Graph-Based Protein Function Prediction In this project, we investigate and develop graph-based methods for inferring protein functions without sequence homology. Most approaches in predicting protein function from protein-protein interaction data utilize the observation that a protein often share functions with proteins that interacts with it (its level-1 neighbors). However, proteins that interact with the same proteins (i.e. level-2 neighbors) may also have a greater likelihood of sharing similar physical or biochemical characteristics. We are interested to find out how significant is functional association between level-2 neighbors and how they can be exploited for protein function prediction. We will also investigate how to integrate protein interaction information with other types of information to improve the sensitivity and specificity of protein function prediction, especially in the absence of sequence homology.
Increasing Confidence of Protein Interactomes Progress in high-throughput experimental techniques in the past decade has resulted in a rapid accumulation of protein-protein interaction (PPI) data. However, recent surveys reveal that interaction data obtained by the popular high-throughput assays such as yeast-two-hybrid experiments may contain as much as 50% false positives and false negatives. As a result, further carefully-focused small-scale experiments are often needed to complement the large-scale methods to validate the detected interactions. However, the vast interactomes require much more scalable and inexpensive approaches. Thus it would be useful if the list of protein-protein interactions detected by such high-throughput assays could be prioritized in some way. Advances in computational techniques for assessing the reliability of protein-protein interactions detected by such high-throughput methods are explored in this project, especially those rely only on topological information of the protein interaction network derived from such high-throughput experiments.
Recognition of MicroRNA Precursors and Targets While the first miRNAs were discovered using experimental methods, experimental miRNA identification remains technically challenging and incomplete. This calls for the development of computational approaches to complement experimental approaches to miRNA gene identification. We propose in this project to investigate de novo miRNA precursor prediction methods. We follow the "generation, feature selection, and feature integration" paradigm of constructing recognition models for genomics sequences. We generate and identified features based on information in both primary sequence and secondary structure, and use these features to construct decision models for the recognition of miRNA precursors. In addition, analyzing the binding of miRNA to their mRNA target sites reveals that many different factors determine what constitutes a good fit. We thus intend to investigate these factors in detail and to construct decision models for predicting miRNA targets. Finally, we would like to understand the role of miRNAs in a number of human diseases. In particular, we plan to begin our analysis with genes involved in muscular dystrophy, as this group of genes are among the largest and most complex-structured human genes.
Ligand Binding to PXR The human pregnane X receptor (hPXR) is a nuclear receptor that binds to various ligands, regulating the breakdown of drugs in the human body. To study drug-drug interactions, we investigated a method for predicting potential ligand binding conformations in the binding pocket of hPXR.
RNA Structure Prediction and Comparison This project is related to RNA structure prediction and comparison. There may be more than one structure with the optimum free energy, or there may be many structures within 5% to 10% of the minimum free energy, and these may be topologically very different. Inferring what structure is truly representative of the natural structure requires additional information. When a set of homologous sequences has a certain structure in common, this structure can be deduced by comparing the structures possible from their sequences. This assumption is very reasonable, since it is unlikely that a molecule will undergo a total change in structure during its evolution, and still be functional. The process of random mutation and selection "tests" a large range of possible sequences, and those that do not have the functional structure necessary for survival are discarded. The model made is corroborated by every new sequence determined, which fits the perceived structure.
Tools for Design of Microarray and
Analysis of Gene Expression Profiles
for Disease Diagnosis and Prognosis The development of microarray technology has made possible the simultaneous monitoring of the expression of thousands of genes. This development offers great opportunities in advancing the diagnosis of diseases, the treatment of diseases, and the understanding of gene functions. This project aims to: (1) develop technologies for the design of microarrays; (2) develop tools for the analysis of gene expression profiles, especially for optimization of disease treatment; and (3) apply these tools for optimization of disease treatment, with childhood leukemias as the initial area.
DNA Feature Recognition Correct prediction of transcription start sites, translation initiation sites, gene splice sites, poly-A sites, and other functional sites from DNA sequences are important issues in genomic research. In this project, we investigate these prediction problems using the paradigm of ``feature generation, feature selection, and feature integration''. There are two reasons for our interest in such a paradigm. The first reason is that standard tool boxes can be identified and used for each of the 3 components. For example, any statistical significance test can be used for feature selection. Similarly, any machine learning method can be used for feature integration. The main challenge is in developing a ``standard'' tool box for feature generation suitable for DNA functional sites. The second reason is that features that are critical to the recognition of specific DNA functional sites are explicitly generated and selected in this paradigm. This explicitness is helpful in understanding the underlying biological mechanism of that DNA functional site.
The Protein Interaction Extraction System (PIES) A large part of the information required for biology research can only be found in free-text form, as in MEDLINE abstracts, or in comment fields of relevant reports, as in GenBank feature table annotations. This information is important for many types of analysis, such as classification of proteins into functional groups, discovery of new functional relationships, maintenance of information on material and methods, extraction of protein interaction information, and so on. However, information in free-text form is very difficult for automated systems to use. The project investigates techniques and applications of natural language processing to the extraction of biological information from free text.
The Kleisli Query System Kleisli is a data transformation and integration system that can be used for any application where the data is typed, and has proven especially useful for bioinformatics applications. It extends the conventional flat relational data model supported by the query language SQL to a complex object data model supported by the collection programming language CPL. It also opens up the closed nature of commercial relational data management systems to an easily extensible system that performs complex transformations on autonomous data sources that are heterogeneous and geographically dispersed.