School of Computing > Research > Computational Biology

Computational Biology : Topics

Simulation and Analysis of Protein Motion

Protein motion and conformational flexibility plays a critical role in vital biological functions, such as immune protection and enzymatic catalysis. An example is the “flap” motion of HIV protease, a major inhibitory drug target for AIDS therapy. The flaps, located near the reactive site of HIV protease, must open to allow a ligand to bind and then close to form direct contacts with the ligand. Such motion is an essential means which proteins rely on to perform their functions, and therefore provides an important link between structure and function, a central relationship in molecular biology.

We envision that future protein databases will contain not only static structures of proteins, as the Protein Data Bank (PDB) does currently, but also the motion of proteins. Users can submit queries on protein motions to study the relationship between structure and function, just as they submit queries on sequences and static structures today. Such information will greatly enhance our ability to understand and predict ligand-protein and protein-protein interactions during processes such as pharmaceutical drug design. Towards this goal, we are developing geometric and probabilistic computation tools to explore, analyze, and model protein motion and conformational flexibility.

Due to noise in data, determining salient conformational changes accurately and efficiently is a challenging problem. We developed an algorithm and related software called pFlexAna for detection of protein conformational changes from experimental data obtained through, e.g., X-ray crystallography. A key element of pFlexAna is a statistical test that determines the similarity of two protein structures in the presence of noise. Using data from the Protein Data Bank and the Macromolecular Movements Database, we tested the algorithm on proteins that exhibit a range of different conformational changes. Results show that our algorithm can reliably detect salient conformational changes, including well-known examples such as hinge and shear.

Molecular dynamics simulation is a well-established method for studying protein motion at the atomic scale. However, it is computationally intensive and generates massive amounts of data, the sheer size of which often becomes an obstacle to biological insights. We proposed using Markov models with hidden states to construct simplified models of protein motion at long timescales, as many important kinetic and dynamic properties of proteins ultimately depend on such motions. The Markovian states in such models represent potentially overlapping probabilistic distributions over a protein’s conformation space. Our method produced a 3-state model that was as equal good as, and yet simpler than, a widely accepted 6-state model for predicting the long-timescale motions of alanine dipeptide. We also used the constructed Markov models to estimate important kinetic and dynamic quantities for protein folding, in particular, mean first-passage time.

 

Prediction of Protein Interactions, Complexes, and Function

Progress in high-throughput experimental techniques in the past decade has resulted in a rapid accumulation of protein-protein interaction (PPI) data. However, interaction data obtained by popular high-throughput assays such as yeast-two-hybrid experiments contain as much as 50% false positives and false negatives. Furthermore, the PPI networks resulting from these assays are still essentially an in vitro scaffold. Further progress in computational analyses techniques and experimental methods is needed to reliably deduce in vivo protein interactions, to distinguish between permanent and transient interactions, to distinguish between direct proteins binding from membership in the same protein complex and to distinguish protein complexes from functional modules.

We aim to advance computational techniques for: (a) assessing the reliability of PPIs reported in high-throughput assays; (b) identifying false-negative PPIs; (c) deducing new protein complexes from high-throughput assays; and (d) inferring protein function for previously unannotated proteins.

To deal with noise in PPI networks produced by high-throughput assays, we developed the “iterated CD distance” method for assessing the reliability of a PPI, given a PPI network derived from high-throughput protein interaction experiments. The method estimated the likelihood of the interaction of a pair of proteins by applying expectation maximization on their shared neighbourhood in the PPI network. The new PPI’s predicted by this method were superior to those made by other PPI-network-based methods (e.g., 30-50% better correlated with localization coherence). Moreover, we showed that protein complex prediction methods would benefit significantly from having their input PPI networks cleansed by this method.

In order to build on the numerous previous works on PPI prediction, we further proposed a probabilistic framework to integrate multiple types of information to predict PPI’s. A variety of existing PPI prediction techniques are integrated, including domain–domain interactions, interaction motifs, paralogous interactions, and protein function similarity. We also used closed itemset mining to identify domain combination pairs associated with interacting proteins to derive complex interaction rules that are not covered by the existing approaches. Evaluation demonstrated that integrating multiple predictions from the different approaches using our framework significantly outperformed any individual prediction. We participated in the DREAM2 protein–protein subnetwork prediction challenge; and our entry outperformed those of other participants by a clear margin.

The large amount of PPI’s produced by high-throughput assays makes it possible to predict protein complexes from PPI networks. We developed an algorithm called CMC (clustering-based on maximal cliques) to discover protein complexes from PPI networks weighted by our iterated CD distance. CMC first generates all the maximal cliques from the PPI networks, and then removes or merges highly overlapped clusters based on their interconnectivity. Tests showed that CMC is an effective approach to protein complex prediction from protein interaction network, achieving higher precision while retaining recall at levels comparable to previous methods.

Besides cleaning PPI networks and predicting protein complexes, we also tried to discover motifs of protein interaction sites. We developed a fast heuristic algorithm to predict interaction motif pairs from a set of protein sequences based on PPI network. Existing algorithms could take days to process a set of 5000 protein sequences with about 20,000 interactions. Our algorithm was able to locate the correct motif pair in such a dataset (actually the yeast interactome) in 45 minutes. Moreover, we derived a lower bound result for the p-value of a motif pair in order for it to be distinguishable from random motif pairs.

Lastly, we developed a scalable and efficient function prediction framework, called IWA (Integrated Weighted Averaging), that integrates diverse information using simple weighting strategies. The simplicity of the approach makes it possible to make predictions based on on-the-fly information fusion. In addition to its great efficiency, IWA performs exceptionally well against existing approaches. In the presence of cross-genome information, IWA makes even better protein function predictions.

 


Identifying Regulatory Signals in Genomes

Interactions between macromolecules play many essential roles---e.g., metabolic reactions and signal transduction---and occur in many combinations, such as protein-protein, protein-DNA, and protein-RNA. Protein interactions with DNA and RNA are the primary mechanisms for controlling gene expressions. What is needed is a recognition code that maps from the protein sequence to a pattern that describes the family of DNA binding sites---the functional elements. Identification of functional elements in the human genome is fundamental to our understanding of cell functions---how these codes orchestrate the complex network of gene transcription, the transcriptome, and interactions in distinct locations.

We aim to: a) Develop methods for accurate identification of transcription factor binding sites and other regulatory sites. (b) Develop methods for inferring the interactions of transcription factors and other functional elements.

The performance of current de novo motif finders is far from satisfactory and no motif finder performs consistently well over all datasets. We identified several key observations on how to utilize the results from individual motif finders and designed a novel ensemble method, MotifVoter, to predict the motifs and binding sites. In terms of sensitivity and precision, MotifVoter outperformed standalone motif finders and ensemble methods significantly on Tompa’s benchmark, Escherichia coli, and ChIP-Chip datasets.

We also developed an algorithm called LocalMotif to detect nucleotide motifs that are localized with respect to a given biological landmark. In LocalMotif, a novel score function was combined with other scoring measures including Z-score and relative entropy to detect the motif. The approach successfully discovered biologically relevant motifs and their intervals of localization in scenarios where the motifs could not be discovered by general motif finding tools. It was useful for discovering multiple co-localized motifs in a set of regulatory sequences, such as those identified by ChIP-Seq.

ChIP-seq is becoming the main approach to the genome-wide study of protein–DNA interactions and histone modifications. However, it is difficult to identify weak ChIP signals from background noise.  We proposed a linear signal–noise model, in which a noise rate was introduced to represent the fraction of noise in a ChIP library. We then developed an iterative algorithm to estimate the noise rate using a control library, and derived a library-swapping strategy for the false discovery rate estimation. These approaches were integrated in a general-purpose framework, named CCAT (Control-based ChIP-seq Analysis Tool), for the significance analysis of ChIP-seq. CCAT predicted significantly more ChIP-enriched sites than the previous methods.

Among different epigenetic modifications, the differential histone modification sites (DHMSs) are of great interest to study the dynamic nature of epigenetic and gene expression regulations among cell types, stages and environmental responses. We proposed  an approach called ChIPDiff for the genome-wide comparison of histone modification sites identified by ChIP-seq. ChIPDiff used a hidden Markov model (HMM) to infer the states of histone modification changes at each genomic location. We evaluated the performance of ChIPDiff by comparing the H3K27me3 modification sites between mouse embryonic stem cell (ESC) and neural progenitor cell (NPC). We demonstrated that the H3K27me3 DHMSs identified were of high sensitivity, specificity and technical reproducibility. ChIPDiff was further applied to uncover the differential H3K4me3 and H3K36me3 sites between different cell states. Interesting biological discoveries were achieved from such comparison in our study.

Lastly, we developed a computational model of promoters of human histone genes, and used the model to identify regions across the human genome having similar structure as promoters of histone genes. Such regions were predicted as potential genomic regulatory regions of genes that might be co-regulated with histone genes. Our study represented one of the most comprehensive computational analyses conducted thus far on a genome-wide scale of promoters of human histone genes. The study suggested a number of other human genes that share a high similarity of promoter structure with the histone genes and thus are highly likely to be co-regulated, and consequently co-expressed, with the histone genes. We also found that there are a large number of intergenic regions across the genome with their structures similar to promoters of histone genes. These regions may be promoters of yet unidentified genes, or may represent remote control regions that participate in regulation of histone and histone-coregulated gene transcription initiation.

Management and Analysis of Gene Expression Data

 

The development of microarray technology has made possible the simultaneous monitoring of the expressions of thousands of genes. This development offers great opportunities in advancing the diagnosis of diseases, the treatment of diseases, and the understanding of gene functions. However, existing works fall short on several issues: they provide little information on the interplay between selected genes; the collection of pathways that can be used, evaluated, and ranked against the observed expression data is limited; and a comprehensive set of rules for reasoning about relevant molecular events has not been compiled and formalized.

Our main goals are to: (a) develop techniques to extract and known pathways from multiple sources; (b) develop techniques to derive gene regulatory networks from gene expression data; (c) develop technologies for the design of microarrays; (d) develop tools for optimization of disease treatment based on gene expression profiles.

In an era of increasingly complex biological datasets, one of the key steps in gene functional analysis comes from clustering genes based on co-expression. Biclustering algorithms can identify gene clusters with local co-expressed patterns, which are more likely to define genes functioning together than global clustering methods. However, these algorithms are not effective in uncovering gene regulatory networks because the mined biclusters lack genes that may be critical in the function but may not be co-expressed with the clustered genes. We introduced a biclustering method called SKeleton Biclustering (SKB), which builds high quality biclusters from microarray data, creates relationships among the biclustered genes based on Gene Ontology annotations, and identifies genes that are missing in the biclusters. SKB thus defines inter-bicluster and intra-bicluster functional relationships. The delineation of functional relationships and incorporation of such missing genes may help biologists to discover biological processes that are important in a given study and provides clues for how the processes may be functioning together.

A second issue in the clustering of gene expression data is the large number of genes---as the number of dimensions in a dataset increase, distance measures used by many clustering algorithms become increasingly meaningless. We proposed to use a sliding-window approach to partition the dimensions in gene expression data to preserve significant clusters. We call this model nCluster model. The sliding-window approach generates more bins than the grid-based approach, thus it incurs higher mining cost. We developed a deterministic algorithm, called Maxn-Cluster, to mine nClusters efficiently. Our experiments showed that (a) the nCluster model could indeed preserve clusters that were shattered by the grid-based approach on synthetic datasets; (b) the nCluster model produced more significant clusters than the grid-based approach on two real gene expression datasets; and (c) MaxnCluster was efficient in mining maximal nClusters.

Another issue is that existing 3D clustering algorithms on gene x sample x time expression data do not consider the time lags between correlated gene expression patterns. Besides, they either ignore the correlation on time subseries, or disregard the continuity of the time series, or only validate pure shifting or pure scaling coherent patterns instead of the general shifting-and-scaling patterns. We proposed an efficient algorithm, LagMiner, to identify time-lagged co-regulated gene clusters. The algorithm could identify interesting clusters satisfying the constraints of regulation, coherence, minimum gene number, minimum sample subspace size and minimum time periods length. Experiments showed that LagMiner was effective, scalable and parameter-robust.

 

Analyzing Mass Spectra for Identifying Proteins

 

Proteomics is useful for understanding the expression of proteins in cells at different levels, at different time points, and in different forms. Such an understanding is critical to drug discovery and medical advances. Mass spectrometers are the predominant tool to accomplish some of the primary goals of proteomics. For example, identification of proteins, determination of expression level of proteins, and determination of post-translational modifications, sites, and types. Due to limitations in mass spectrometers and associated software tools, most of the MS/MS data generated by mass spectrometers are rejected because they are not interpretable by currently available software. Furthermore, the remaining data usually contain many false positives. In addition, the sensitivity and precision of mass spectrometers vary greatly.

Our goals are to: (a) Develop efficient, accurate, and robust algorithms for protein identification from tandem mass spectra. (b) Develop efficient, accurate, and robust algorithms for alignment of noisy mass spectra.

Tandem mass spectrometry is one of the most commonly used techniques in peptide sequencing. We studied general issues in de novo sequencing, and proposed a general preprocessing scheme that performed binning, pseudo peak introduction, and noise removal. We also studied the anti-symmetry problem and assumptions related to it, and proposed a more realistic way to handle it. We integrated our findings on preprocessing and the anti-symmetry problem with some current models for peptide sequencing, resulting in improved accuracies for de novo peptide sequencing.

Another category of approaches for peptide identification consists of database search algorithms. They return peptide sequences that match the parent mass of the experimental spectrum via some scoring functions. They generally do not perform well for peptides with post-translational modifications (PTMs). We improved on our previous algorithm PepSOM by incorporating a de novo algorithm to generate multi-charge strong tags, introducing better scoring functions to score candidate peptides based on these tags. Experiments on spectra with simulated and real PTMs confirmed that our algorithm was fast and accurate for identifying PTMs..

Bioinformatics for Virology Research

Viruses have the potential to spread rapidly and infect a large proportion of the human population. They constitute one of the most serious health threats to the human population. High-throughput sequencing technologies have brought about an explosion of the number of complete genomes of viruses. Yet despite this wealth of viral genome sequence data, these data has been largely used for evolution and epidemiology studies with little direct clinical application. So it is timely to use these data to develop technologies and bioinformatics tools that have a greater impact on clinical decision making.

We aim to address important issues in (a) detection of viruses, (b) resequencing of viruses, and (c) recombination breakpoint analysis of viruses.

Random-tagged primers are a novel technique for amplification of unknown virus sequences. However, blindly using such primers without checking their suitability on the target genome does not guarantee genome-wide amplification of the virus in the patient sample. We developed a model and an associated algorithm (LOMA) to predict amplification efficiency of random-tagged primer. LOMA was much more sensitive and efficient than previous methods in generating random-tagged primers---the resulting coverage by LOMA approached 90%.

We further developed a statistics-based pathogen detection algorithm to be used with DNA microarrays. This algorithm (PDA) analyzed the distribution of probe signal intensities relative to in silico r-signatures (recognition signature probe sets) based on a novel weighted Kullback-Leibler divergence that was more sensitive to tail of the distribution.  This algorithm was able to detect and identify the presence of multiple viruses. In our validation experiments, it even correctly detected viruses in the test samples that were initially missed by gold-standard RT-PCR assays.

Resequencing microarrays have been used to obtain whole-genome primary sequences for orthopoxviruses, biothreat viruses, etc. The reported studies mainly use platform accompanying software that employs probabilistic base-calling algorithms. However, these methods are susceptible to hybridization noise caused by factors such as poor probe quality, poor amplification or mutations. We developed a novel resequencing microarray that is able to accommodate mutation hotspots and an algorithm to resolve multiple mutations in probe sequence in the microarray. The algorithm (EvolSTAR) exploits the characteristic profile of mutations from wild-type sequence calls. Validation experiments showed it performed far better than popular competing methods.

Recombination detection is important before inferring phylogenetic relationships. This will lead to a better understanding of pathogen evolution, more accurate genotyping and advancements in vaccine development. We developed an algorithm (RB-Finder) to detect recombination breakpoint. It used the novel idea of longest common subsequences (ie. contigs) assembled from (k,m)-mers as the unit of similarity measurement . This definition of contigs was not the usual definition, but it was the right one to use in this context as it enforced an even distribution of mismatches in the contigs. This simple but intuitive idea overcame inaccuracy of earlier methods that used base-by-base comparison. In particular, it was able to distinguish regions of high mutation rates from recombination breakpoints, which previous methods were unable to do.

Furthermore, we achieved strong practical impact in these works. For instance, almost all H1N1 Singapore strains have been sequenced using our approach. Presently, Mexico is also trying our approach.

Sequence Indexing and Database Search

 

One of the important daily tasks of biologists is performing homology search. Currently, heuristic methods like BLAST are used for this task, whose running time is linear to the size of the database (for instance, human genome is of size 3 billion bases). As the size of typical sequence databases is increasing rapidly, the performance of these tools is going to deteriorate. Thus it is important to have solutions to improve the performance of database search.

Our aim is to improve performance by utilizing sequence indexing techniques. We consider two directions. The first direction is to create compressed index. The second one is to rely on disk-based indexing.

For compressed indexing, we devoted our effort on compressed suffix array, which is a compressed index of biological sequences. We have developed a number of approximate string matching algorithms utilizing the compressed suffix array to speedup the database search. Our solution has shown advantages in some specialized homology searching problem in the biology domain. For disk-based indexing, we have developed CPS-tree, which is a compact partitioned suffix tree on disk. For exact pattern searching, the performance of CPS-tree is very good and the pattern searching time is independent of the genome length.

 

Tag SNP Selection and Association Study

Human genome harbors millions of common single nucleotide polymorphisms (SNPs) and other types of genetic variations. These genetic variations play an important role in understanding the correlation between genetic variations and human diseases and the body's responses to prescribed drugs. The discovery of such genetic factors contributing to variations in drug response, efficiency, and toxicity has come to be known as pharmacogenomics.

There are a number of bioinformatics problem related to SNPs. We target two particular problems: (a) tag SNP selection problem and (b) association study. Since there are millions of SNPs, it is difficult to investigate all the SNPs. The tag SNP selection problem is the selection of a subset of informative SNPs to represent the major variations among the individuals. Given a set of SNPs for a set of individuals, an association study is the finding of a combination of SNPs which has the potential to cause a genetic disease.

Tag-SNP selection algorithms based on the r2 LD statistic have gained popularity because r2 is directly related to statistical power to detect disease associations. Most of existing r2 based algorithms use pairwise LD. We proposed an efficient algorithm called FastTagger to calculate multi-marker tagging rules and select tag SNPs based on multi-marker LD. FastTagger used several techniques to reduce running time and memory consumption. Results showed that FastTagger was several times faster than existing multi-marker-based tag SNP selection algorithms, and it consumed much less memory at the same time. As a result, FastTagger could work on chromosomes containing more than 100 k SNPs using length-3 tagging rules. FastTagger also produced smaller sets of tag SNPs than existing multi-marker based algorithms. The generated tagging rules could also be used for genotype imputation at >96% accuracy when r2 ≥ 0.9.

We also developed a method for association study. Unlike most of the previous works which performed single SNP association, our method considered the association of variable-size haplotypes. Through regularized regression analysis, we tackled the problem of multiple degrees of freedom in haplotype test. Our method could handle a large number of haplotypes in association analyses more efficiently and effectively than currently available approaches.

 

Reconstructing Phylogenetic Tree, Network, and Gene Clusters

A phylogenetic tree, also known as a cladogram or a dendrogram, is a tree describing several life forms and their relations. Phylogenetic network is a generalization of phylogenetic tree. In addition to the ancestral-descendent mutational relationship, phylogenetic networks capture evolutionary events such as horizontal gene transfer and hybridization. Constructing a phylogenetic tree/network is helpful to understanding the history of life and to analyzing rapidly mutating viruses like HIV. Phylogenetic tree is also an important tool in the comparative genomic. Through comparing multiple species, it helps to predict protein structure, gene expression pattern, etc. Hence, it is important to have efficient and accurate tools for reconstructing phylogeny.

Our aim is to develop methods for constructing phylogenetic tree and network and for detecting gene clusters.

The identification of conserved gene clusters is an important step towards understanding genome evolution and predicting the function of genes. Gene team is a model for conserved gene clusters that takes into account the position of genes on a genome. Existing algorithms for finding gene teams require the user to specify the maximum distance between adjacent genes in a team. However, determining suitable values for this parameter, δ, is non-trivial. Instead of trying to determine a single best value, we proposed constructing thegene team tree (GTT), which is a compact representation of all gene teams for every possible value of δ. Our algorithm for computing the GTT extended existing gene team mining algorithms without increasing their time complexity. We computed the GTT for E. coli and B. subtilis. We also described how to compute the GTT for multi-chromosomal genomes and illustrated this using the GTT for the human and mouse genomes.

Moreover, we also proposed a novel pairwise gene cluster model that combined the notion of bidirectional best hits with the r-window model introduced in 2003 by Durand and Sankoff. The bidirectional best hit (BBH) constraint removed the need to specify the minimum number of shared genes in the r-window model and improved the relevance of the results. We designed a subquadratic time algorithm to compute the set of BBH r-window gene clusters efficiently. We applied our cluster model to the comparative analysis of E. coli K-12 and B. subtilis. An analysis of the most significant BBH r-window gene cluster showed that they were known operons.

We also developed accurate and efficient methods for constructing phylogenetic tree and network. For phylogenetic network, we gave the first polynomial time algorithm for combining a dense set of triplets into a galled phylogenetic network. Further, we generalized the method to reconstruct a network by combining a set of phylogenetic trees. For phylogenetic tree, we studied the supertree problem. We developed a fixed-parameter polynomial time algorithm to construct a maximum agreement supertree of a set of phylogenetic trees.

 

Modeling, simulation and analysis of Biological Pathways

 

Mathematical modeling is being increasingly recognized as a useful tool in the biomedical sciences. They can uncover new phenomena to explore, identify key factors of a system, link different level of details, enable a formalization of intuitive understanding, screen unpromising hypotheses, predict variable inaccessible to measurement, and expand the range of questions that can meaningfully be asked.

We study models of bio-pathways consisting of large networks of bio-chemical reactions whose kinetics are described by Ordinary Differential equations (ODEs). Our goal is to develop computational techniques using which the dynamics defined by such models can be studied in a scalable and efficient manner. In particular, we have been developing techniques for tackling basic analysis problems such as parameter estimation and sensitivity analysis.

Parameter estimation of large bio-pathway models is an important and difficult problem. To reduce the prohibitive computational cost, one approach is to decompose a large model into components and estimate their parameters separately. However, the decomposed components often share common parts that may have conflicting parameter estimates. We proposed to use belief propagation to reconcile these independent estimates in a principled manner and computed new estimates that were globally consistent and fitted well with data. An important advantage of our approach in practice was that it naturally handled incomplete or noisy data. Preliminaryresults based on synthetic data showed good performance.

Systems of ordinary differential equations (ODEs) are often used to model the dynamics of complex biological pathways. However, these ODEs system will generally not admit closed form solutions. Instead, one will have to resort to a large number of numerically generated trajectories to studythe dynamics. This motivated us to construct a discrete state model as a probabilistic approximation of the ODE dynamics by discretizing the value space and the time domain. We then sampled a representative set of trajectories and exploited the discretization and the structure of the signaling pathway to encode these trajectories compactly as a dynamic Bayesian network. As a result, many interesting pathway properties could be analyzed efficiently through standard Bayesian inference techniques. We tested our method on a model of EGF-NGF signaling pathway and the results were very promising. This method is being been applied –in collaboration with biologists- to study the innate immune response system in a comprehensive manner as well DNA damage-repair pathways.

Lastly, pathway model construction is often an inherently incremental process, with new pathway players and interactions continuously being discovered and additional experimental data being generated. So we have also focused on the problem of performing model parameter estimation incrementally by integrating new experimental data into an existing model. We used a probabilistic graphical model known as the factor graph to represent pathway parameter estimates. When new data arrived, the parameter estimates were refined efficiently by belief propagation to the factor graph. A key advantage of our approach was that the factor graph model contained enough information  about the old data, and used only new data to refine the parameter estimates without requiring explicit access to the old data. To test this approach, we applied it to the Akt-MAPK pathways. The results showed that our new approach could obtain parameter estimates that fitted the data well and refined them incrementally when new data arrived.

 

Back

*** Best viewed with Internet Explorer 7 & above, Firefox, Chrome and Safari ***