Research Areas
    Activities
   Conferences Hosted by SoC
  
Professional Services
    Research Funding
    Research Publications
    Internship
    Useful Links

 

  Home > Research
   
  Computational Biology
 

The frontier of medical science has rarely been as exciting and full of opportunities as it is today. From basic science to clinical research to health services research, the opportunities made available through impressive recent advances in the biomedical, physical, and computational sciences have brought us to a place of unprecedented opportunities. It is also now widely appreciated that present-day biomedical researchers are confronted by vast amounts of data from genome sequencing, microscopy, high-throughput analytical techniques for DNA, RNA and proteins, and a host of other new experimental technologies. Coupled with enormous advances in computing power, this flow of information should enable scientists to model and understand biological systems in novel ways.

Over the last decade, however, the community has come to the realisation that, to make effective use of the explosion of data, new challenges must be faced especially in the development and application of computational analysis and knowledge discovery technologies and methodologies. The SoC Computational Biology Lab was thus established. The research of our lab will benefit biological and clinical researchers by enabling extraction and inference of new knowledge and value-added information from diverse biomedical data. Our research projects and activities are described below.


Protein Structure Comparison and Database Search

Analysis of three-dimensional (3D) protein structures plays an important role in bioinformatics. Since the functions of a protein is more closely related to its 3D structure than its amino acid sequence, the study of proteins from the structural perspective can give us more valuable information about their functions.

Here, we have two main objectives: (a) develop efficient and effective methods to compare a pair of 3D protein structures, and (b) develop efficient and effective methods to search a database of 3D protein structures. The “efficiency” aspect deals with the speed of retrieval. The “effectiveness” aspect deals with the accuracy of the methods.

We have designed a fine-grained protein structure alignment method named MatAlign. It is a two-step algorithm. First, we represent 3D protein structures as 2D distance matrices, and align these matrices by means of dynamic programming in order to find the initially aligned residue pairs. Second, we refine the initial alignment iteratively into the optimal one according to an objective scoring function. We have implemented MatAlign and evaluated it against two widely used structural comparison tools, DALI and CE. On the benchmark set of 68 protein structure pairs by Fischer et al., MatAlign provides better alignment scores, according to four different criteria, than both DALI and CE in a majority of cases. MatAlign is about four times faster than DALI, and has about the same speed as CE. The current implementation runs on Win32, and the executables can be downloaded from http://www1.i2r.a-star.edu.sg/~azeyar/genesis/MatAlign/MatAlign_v11_Windows.zip.

We have also proposed a rapid protein structure database retrieval system named ProtDex2, in which we adopt the techniques used in information retrieval (IR) systems in order to perform rapid database search without having to access every structure in the database. The retrieval process is based on the inverted index constructed on the feature vectors of the relationships between the secondary structure elements (SSEs) of all the protein structures in the database. Experimental results show that ProtDex2 is very much faster than two commonly used detailed structural comparison methods, DALI and CE, yet does not sacrifice too much accuracy of comparison. When comparing with a fast database scan method, Topscan, ProtDex2 is much faster and also slightly more accurate. The software is available at http://www1.i2r. a-star.edu.sg/~azeyar/genesis/ProtDex2.


Protein Structure and Protein Motion

Protein conformational flexibility plays a critical role in vital biological functions, such as immune protection and enzymatic catalysis. An example is the “flap” motion of HIV protease, a major inhibitory drug target for AIDS therapy. The flaps, located near the reactive site of HIV protease, must open to allow a ligand to bind and then close to form direct contacts with the ligand. Such motion is an essential means for proteins to perform their functions, and therefore provides an important link between structure and function, a central relationship in molecular biology.

We envision that future protein databases will contain not only static structures of proteins, as the Protein Data Bank (PDB) does currently, but also the motion of proteins. Users can submit queries on protein motions to study the relationship between structure and function, just as they submit queries on sequences and static structures today. Such information will greatly enhance our ability to understand and predict ligand-protein and protein-protein interactions during processes such as for pharmaceutical drug design. Towards this goal, we are developing geometric computation tools to explore, analyse and model protein conformational flexibility.

We have developed a new method called Stochastic Roadmap Simulation (SRS) for studying protein folding kinetics and have used it to computationally predict experimental quantities in protein folding. Interesting properties of molecular motion are best characterised statistically by considering an ensemble of motion pathways rather than an individual one. Classic simulation techniques, such as the Monte Carlo method and molecular dynamics, generate individual pathways one at a time and are easily “trapped” in the local minima of the energy landscape. They are computationally inefficient if applied in a brute-force fashion to deal with many pathways. SRS uses a randomised technique for sampling molecular motion and exploring the kinetics of such motion by examining multiple pathways simultaneously.

We have developed an algorithm and the related software called pFlexAna for detection of protein conformational changes from experimental data obtained through, for example, X-ray crystallography. Due to noise in the data, determining salient conformational changes accurately and efficiently is a challenging problem. A key element of pFlexAna is a statistical test that determines the similarity of two protein structures in the presence of noise. Using data from the Protein Data Bank and the Macromolecular Movements Database, we tested the algorithm on proteins that exhibit a range of different conformational changes. Results show that our algorithm can reliably detect salient conformational changes, including wellknown examples such as hinge and shear.

With collaborators from Yong Loo Lin School of Medicine in NUS, we are studying the protein-protein interaction between Bcl-2 and Rac1.

Increasing Reliability of Protein Interactomes

Progress in high-throughput experimental techniques in the past decade has resulted in the rapid accumulation of protein-protein interaction (PPI) data. However, recent surveys reveal that interaction data obtained by popular high-throughput assays such as yeast-two-hybrid experiments may contain as much as 50% false positives and false negatives. As a result, further carefully focused small-scale experiments are often needed to complement the large-scale methods to validate the detected interactions. However, the vast interactomes require much more scalable and inexpensive approaches.  

Here, we have three main objectives: (a) identify properties that characterise true-positive and falsepositive PPIs, (b) develop efficient and effective methods for assessing the reliability of PPIs reported in high-throughput assays, and (c) develop efficient and effective methods for identifying false-negative PPIs and new protein complexes from high-throughput assays.

We have developed several methods for assessing the reliability of a PPI, given a graph derived from highthroughput protein interaction experiments. The pairs of “interacting” proteins that are ranked highly by these methods are shown more likely to be true-positive interacting pairs. The most interesting feature of these methods is that they are able to rank the reliability of an interaction between a pair of proteins using only the topology of the interactions between that pair of proteins and their neighbours within a short radius.

Our methods can be roughly divided into two groups. The first group is represented by the functional similarity weighting (FS-Weight) and CD Distance indices. This group of indices attempt to provide abstract mathematical characterisations of networks of reliable protein-protein interactions. They are based on the hypothesis that true-positive interactions are likely to be characterised by dense cross connections in the derived interaction graph. The second group is represented by the meso-scale motifs (NeMoFinder) indices. This group of indices attempt to provide explicit motifs of network connections that are associated with reliable protein interactions.

The NeMoFinder, CD Distance, and FS-Weight are far superior (30-50% better correlated with function homologeneity, localisation coherence, and gene expression correlation) than previous methods (e.g., interaction generality index) for detecting false positives in protein interaction experiments. CD Distance and FS-Weight are also very fast to compute. Furthermore, with small modifications such as incorporation of interacting motif pairs, these indices can also be used for false negative detection with high confidence.

Protein Function Prediction

Although sequence similarity search has proven useful for protein function prediction in many cases, it has fundamental limitations. First, only a fraction of newly discovered sequences have identifiable homologous genes in the current databases. Second, the most prominent vertebrate organisms in GenBank have only a fraction of their genomes present in finished sequences. New bioinformatics methods should allow inference of protein function using “associative analysis” of functional properties to complement traditional sequence homology-based methods. Associative properties that have been used to infer function not evident from sequence homology include: cooccurrence of proteins in operons or the genome context, proteins sharing common domains in fusion proteins, proteins in the same pathway, proteins with correlated gene expression patterns, etc.

In this project, we investigate and develop methods for inferring protein functions without sequence homology. In particular, (a) we find out how significant functional association between level-2 neighbours is, (b) we investigate how they can be exploited for protein function prediction in a graph-based framework, and (c) we investigate how to integrate protein interaction information with other types of information to improve the sensitivity and specificity of protein function prediction in a graphbased framework. At the end of the project, we expect to have developed a robust and powerful system to predict protein functions, even in the absence of sequence homology.

We have developed FS-Weight, a method for assigning functions to a protein based on the functions of direct and indirect interaction partners of that protein. The FS-Weight method predicts the functions of a protein in two steps: (1) It assigns a weight (the FWWeight) to each of its level-1 and level-2 neighbours by estimating its functional similarity with the protein using the local topology of the interaction network as well as the reliability of experimental sources. (2) It scores each function based on its weighted frequency in these neighbours. FS-Weight makes predictions with better precision and recall compared to other protein interaction based methods (e.g., Neighbour Counting and Chi-Square) for seven model genomes we have tested. FS-Weight is also robust in the presence of noise in the protein interaction data, maintaining consistent prediction performance even when a large number of interactions are randomly added to or deleted in the interaction data.

We have also proposed a method called LaMoFinder to annotate network motifs with biological information associated with the proteins in a protein-protein interaction network. Our method is specifically devised to handle a large labelling space and the sophisticated Gene Ontology scheme in which proteins have been annotated. As a result, we have captured not only the topological shapes of the motifs, but also the biological context in which they occur in the labelled network motifs. We have also demonstrated how the network motifs labelled by LaMoFinder can be used to predict the functions of unknown proteins in the PPI network. The superior performance of our method against other current prediction methods confirms that network motifs are indeed adequately enriched by LaMoFinder for more sophisticated biological applications such as protein function prediction.We have also developed a technique for identifying potential annotation errors in protein sequence databases. These databases are crucial to protein function prediction that uses the principle of “guilt by association” in sequence similarity. However, these databases are known to contain many annotation errors. We have derived a statistical method to filter and assess the reliability of data from these databases. Our experiments show that we can effectively detect mis-annotated sequence data.

Identifying Functional Elements in Human and Other Genomes

Interactions between macromolecules play many essential roles, e.g., metabolic reactions and signal transduction. The interactions can also occur in many combinations, such as protein-protein, protein-DNA and protein-RNA. Protein interactions with DNA and RNA are the primary mechanisms for controlling gene expression. What is needed is a recognition code that maps from the protein sequence to a pattern that describes the family of DNA binding sites – the functional elements. Identification of functional elements in the human genome is fundamental to our understanding of cell functions including how these codes orchestrate the complex network of gene transcription, the transcriptome, and interactions in distinct locations.

We have two objectives here: (a) develop methods for accurate identification of transcription factor binding sites and other regulatory sites, and (b) develop methods for inferring the interactions of transcription factors and other functional elements.

We have developed an algorithm called LocalMotif to detect localised motifs in long regulatory sequences. A novel score function predicts the region of localisation of the motif. This score is combined with other scoring measures including Z-score and relative entropy to detect the motif. The algorithm is optimised for fast processing of long regulatory sequences. Tests on simulated and real datasets confirm that LocalMotif accurately determines the region of localisation of motifs and automatically discovers the biologically relevant motifs, which can be detected by other motiffinding algorithms only when the search is restricted to the relevant interval.

Management and Analysis of Gene Expression Data

The development of microarray technology has made possible the simultaneous monitoring of the expression of thousands of genes. This development offers great opportunities in advancing the diagnosis of diseases, the treatment of diseases, and the understanding of gene functions.

Our main goals are: (a) develop techniques to derive gene regulatory networks from gene expression data, (b) develop technologies for the design of microarrays, and (c) develop tools for the optimisation of disease treatment based on gene expression profiles.

We have developed a new conditional dependence learning algorithm to learn a gene regulatory network from gene expression data. Besides the pairwise correlation of the gene expression profiles of a pair of genes, our algorithm considers three additional factors: (a) collaboration among regulators, (b) formation of regulatory complex, and (c) variable time delay to learn the gene network. Experiments on both artificial and real-life gene expression datasets validate the effectiveness of the algorithm.

We have also proposed an efficient algorithm to identify time-lagged co-regulated gene clusters. The algorithm facilitates localised comparison and processes several genes simultaneously to generate detailed and complete timelagged information for genes/gene clusters. Our results show that the algorithm is not only efficient, but also delivers more reliable and detailed information on time-lagged co-regulation between genes/gene clusters, compared to existing methods such as the event method and the edge detection method.

Analysing Mass Spectra for Identifying Proteins

Proteomics is useful for understanding the expression of proteins in cells at different levels, at different time points, and in different forms. Such an understanding is critical to drug discovery and medical advances. Mass spectrometers are the predominant tool to accomplish some of the primary goals of proteomics.

For example, identification of proteins, determination of expression level of proteins, and determination of post-translational modifications, sites and types. Due to limitations in mass spectrometers and the associated software tools, most of the MS/MS data generated by mass spectrometers are rejected because they are not interpretable by currently available software. Furthermore, the remaining data usually contain many false positives. In addition, the sensitivity and precision of mass spectrometers vary greatly.

Our objective here is two-fold: (a) develop efficient, accurate and robust algorithms for protein identification from tandem mass spectra, and (b) develop efficient, accurate and robust algorithms for alignment of noisy mass spectra.

Most peptide sequencing algorithms currently handle spectra of charge 1 or 2, and have not been designed to handle higher-charge spectra. We have recently given a characterisation of multi-charge spectra by generalising existing models and analysed spectra with charges up to 5. Our analysis shows that higher charge peaks are present and they contribute significantly to prediction of a complete peptide. They also help explain why existing algorithms do not perform well on multi-charge spectra.

We have further proposed an algorithm, GST-SPC, for de novo sequencing of multi-charge mass spectra. It computes a peptide sequence that is optimal with respect to shared peaks count from among all sequences that are derived from multi-charge strong tags. GST-SPC uses a larger set of multi-charge strong tags than existing methods, which improves the theoretical upper bound on performance. Experimental results confirm the improvement.

Annotating the Genome Using Paired-End-diTag

Annotating the whole genome is an expensive and time-consuming task. Recently, a new technology called Paired-End-diTag (PET) has been developed. This technology allows us to find the 20 bps of the two ends of genes/transcripts in a high-throughput and cost-effective way. When used with chromatin immunoprecipitation (ChIP), it can also precisely extract PETs surrounding all transcription factor binding sites in the whole genome. Since PETs are short and the genome is big, the bioinformatics challenge is to have a method that can align the PETs onto the genome accurately and efficiently.

Our main aim is to develop a tool to help biologists process PET data. The tool should be accurate and efficient. In addition, cancer cell-lines may generate some abnormal genes known as fusion genes due to genome rearrangement in cancer cells. Our second aim is to discover these genes.

We have developed a mapping algorithm that can efficiently and accurately align PETs onto the genome. This allows us to discover numerous novel genes and novel transcription factor binding sites in the human genome and the mouse genome. In addition, we have discovered a number of fusion genes in cancer cells. To help biologists, our algorithm is implemented as a pipeline PET-Tool that allows biologists to manage and process PET sequence data.


RNA Secondary Structure Prediction

Due to advances in sequencing technologies, many RNA sequences have been discovered. Moreover, only a few of their structures have been deduced. On the other hand, the chemical and biological properties of many RNAs (such as tRNAs) are determined primarily by their secondary structures. Therefore, determining the secondary structures of RNAs is becoming one of the most important topics in bioinformatics.

We consider two problems related to RNA: (a) inferring the RNA secondary structure of an RNA sequence by comparing it with another RNA sequence whose secondary structure is known, and (b) extracting local sub-structures shared by two RNA secondary structures, including where the sub-structure may be the functional part of the two RNAs. We expect two RNAs having a similar function to generally have a similar structure. We have devised a polynomial time dynamic programming algorithm to infer the secondary structure for an RNA sequence based on the known secondary structure of another functionally similar RNA.

For the local sub-structure problem, we propose modelling the local structural motif as a local gapped subforest. Our model is general enough to describe a structural motif such as the SECIS-motif. We have further proposed a polynomial time algorithm to compute the local gapped subforest of two RNA secondary structures.


Recognition of MicroRNA Precursors and their Targets

MicroRNAs (miRNAs) are small non-coding RNAs of about 22 nucleotides long. They play important regulatory functions in eukaryotic gene expression through mRNA degradation or translation inhibition. The regulatory functions of miRNAs range from cell proliferation, fat metabolism, neuronal patterning in nematodes, neurological diseases, modulation of hematopoietic lineage differentiation in mammals, development, cell death, cancer and control of leaf and flower development in plants.

Our objectives are as follows. (a) Experimental miRNA identification is technically challenging and incomplete. This calls for the development of computational approaches to complement experimental approaches to miRNA gene identification. We propose in this project to investigate de novo miRNA precursor prediction methods. (b) Analysing the binding of miRNAs to their mRNA target sites reveals that many different factors determine what constitutes a good fit. We intend to investigate these factors in detail and to construct decision models for predicting miRNA targets. (c) Finally, we would like to understand the role of miRNAs in a number of human diseases.

We follow the “feature generation, feature selection, and feature integration” paradigm of constructing recognition models for genomics sequences. We have generated and identified features based on information in both primary sequence and secondary structure, and used these features to construct accurate decision models for the recognition of miRNA precursors.

Sequence Indexing and Database Search

One of the important daily tasks of biologists is performing homology search. Currently, heuristic methods such as BLAST are used for this task. Their running time is linear to the size of the database (for instance, the human genome is of size 3 billion bases). As typical sequence databases rapidly increase in size, the performance of the existing tools is set to deteriorate. Thus, it is important to have solutions to improve the performance of database search.

Our aim is to improve the performance of database search by utilising sequence indexing techniques. We consider two directions. The first direction is to create a compressed index. The second direction is to rely on disk-based indexing. For compressed indexing, we have devoted our effort on the compressed suffix array, which is a compressed index of biological sequences. We have developed a number of approximate string matching algorithms utilising the compressed suffix array to speed up database search. Our solution has shown advantages in specialised homology search in the biology domain. For disk-based indexing, we have developed the CPS-tree, which is a compact partitioned suffix tree on disk. For exact pattern search, the performance of the CPS-tree is very good and the pattern search time is independent of the genome length.

Tag SNP Selection and Association Study

An SNP (Single Nucleotide Polymorphism) is a specific location in our genome where different people have different DNA bases. Millions of SNPs in human beings account for the major differences among different individuals. Studying SNPs is helpful to understanding genetic diseases.

There are a number of bioinformatics problems related to SNPs. We target two particular problems: (a) the tag SNP selection problem, and (b) association study. Since there are millions of SNPs, it is difficult to investigate all of them. The tag SNP selection problem is the selection of a subset of informative SNPs to represent the major variations among individuals. Given a set of SNPs for a set of individuals, an association study is the finding of a combination of SNPs which has the potential to cause a genetic disease.

We have developed a method for tag SNP selection. Our method efficiently selects a minimum set of tag SNPs that maximises the independence among the selected SNPs. We have also developed a method for association study. Unlike most of the previous methods which perform single SNP association, our method considers the association of variable-size haplotypes. Through regularised regression analysis, we tackle the problem of multiple degrees of freedom in haplotype test. Our method can handle a large number of haplotypes in association analyses more efficiently and effectively than currently available approaches can.

Phylogenetic Tree and Network Reconstruction

A phylogenetic tree, also known as a cladogram or a dendrogram, is a tree describing several life forms and their relations. A phylogenetic network is a generalisation of a phylogenetic tree. In addition to the ancestral-descendent mutational relationship, phylogenetic networks capture evolutionary events such as horizontal gene transfer and hybridisation. Constructing a phylogenetic tree/network is helpful to the understanding of the history of life and the analysis of rapidly mutating viruses such as HIV. The phylogenetic tree is also an important tool in comparative genomics, where the comparison of multiple species can help predict protein structure, gene expression pattern, etc. It is also relevant as a tool for reconstructing phylogeny.

Our aim is to develop methods for constructing phylogenetic trees and networks. In particular, we are interested in approaches based on combining trees.

We have developed accurate and efficient methods for constructing phylogenetic trees and networks. For phylogenetic networks, we provide the first polynomial time algorithm for combining a dense set of triplets into a galled phylogenetic network. Also, we have generalised the method to reconstruct a network by combining a set of phylogenetic trees. For phylogenetic trees, we study the supertree problem. We have a fixed-parameter polynomial time algorithm to construct a maximum agreement supertree of a set of phylogenetic trees.

Developing Bio-Dynamical Systems and Analysing the Regulatory Behaviour of Biological Pathways

Mathematical modelling has been increasingly recognised as a useful tool in the biomedical sciences. Biological models provide a coherent framework for interpreting data. They can uncover new phenomena to explore, identify key factors of a system, link different levels of details, enable the formalisation of intuitive understanding, screen unpromising hypotheses, predict variable inaccessible to measurement, and expand the range of questions that can meaningfully be asked.

We focus on the study of bio-dynamical systems, which are mathematical models of the dynamic behaviours of biological systems. Specifically, we intend to develop efficient computational representations, analysis techniques, and algorithms for modelling and reasoning about bio-dynamical systems. Our goal is to provide computational tools that enable biologists to design experiments more effectively and thus accelerate the process of biological discovery.

We have developed a decompositional approach to parameter estimation for bio-dynamical systems based on the hybrid functional Petri net. It exploits the structure of a large pathway model and breaks it into smaller components, whose parameters can then be estimated independently. This leads to significant improvements in computational efficiency. We have tested our approach on a detailed model of the Akt and MAPK pathways with two known and one hypothesised crosstalk mechanisms. Our simulation results exhibit good correlation with experimental data, and they have yielded positive evidence in support of the hypothesised crosstalk between the two pathways.

The faculty members involved in computational biology research are:

  • HSU David
  • HSU Wynne
  • LEE Mong Li
  • LEONG Hon Wai
  • OOI Beng Chin
  • SUNG Wing Kin, Ken
  • TAN Kian Lee
  • THIAGARAJAN P. S.
  • TUNG Kum Hoe, Anthony
  • WONG Limsoon


© Copyright 2001-08 National University of Singapore. All Rights Reserved