Areas of research in Computer Science:
The frontier of medical science has rarely been as exciting
and full
of opportunities as it is today. From basic science to clinical
research to health services research, the opportunities made available
through impressive recent advances in the biomedical, physical, and
computational sciences have brought us to a place of unprecedented
opportunities. It is also now widely appreciated that present-day
biomedical researchers are confronted by vast amounts of data from
genome sequencing, microscopy, high-throughput analytical techniques
for DNA, RNA and proteins, and a host of other new experimental
technologies. Coupled with enormous advances in computing power, this
flow of information should enable scientists to model and understand
biological systems in novel ways.
Over the last decade, however, the community has come to the
realisation that, to make effective use of the explosion of data, new
challenges must be faced especially in the development and application
of computational analysis and knowledge discovery technologies and
methodologies. The SoC Computational Biology Lab was thus established.
The research of our lab will benefit biological and clinical
researchers by enabling extraction and inference of new knowledge and
value-added information from diverse biomedical data. Our research
projects and activities are described below.
Protein Structure Comparison and Database Search
Analysis of three-dimensional (3D) protein structures plays an
important role in bioinformatics. Since the functions of a protein is
more closely related to its 3D structure than its amino acid sequence,
the study of proteins from the structural perspective can give us more
valuable information about their functions.
Here, we have two main objectives: (a) develop efficient and
effective methods to compare a pair of 3D protein structures, and (b)
develop efficient and effective methods to search a database of 3D
protein structures. The “efficiency” aspect deals
with the
speed of retrieval. The “effectiveness” aspect
deals with
the accuracy of the methods.
We have designed a fine-grained protein structure alignment
method
named MatAlign. It is a two-step algorithm. First, we represent 3D
protein structures as 2D distance matrices, and align these matrices by
means of dynamic programming in order to find the initially aligned
residue pairs. Second, we refine the initial alignment iteratively into
the optimal one according to an objective scoring function. We have
implemented MatAlign and evaluated it against two widely used
structural comparison tools, DALI and CE. On the benchmark set of 68
protein structure pairs by Fischer et al., MatAlign provides better
alignment scores, according to four different criteria, than both DALI
and CE in a majority of cases. MatAlign is about four times faster than
DALI, and has about the same speed as CE. The current implementation
runs on Win32, and the executables can be downloaded from http://www1.i2r.a-star.edu.sg/~azeyar/genesis/MatAlign/MatAlign_v11_Windows.zip.
We have also proposed a rapid protein structure database
retrieval
system named ProtDex2, in which we adopt the techniques used in
information retrieval (IR) systems in order to perform rapid database
search without having to access every structure in the database. The
retrieval process is based on the inverted index constructed on the
feature vectors of the relationships between the secondary structure
elements (SSEs) of all the protein structures in the database.
Experimental results show that ProtDex2 is very much faster than two
commonly used detailed structural comparison methods, DALI and CE, yet
does not sacrifice too much accuracy of comparison. When comparing with
a fast database scan method, Topscan, ProtDex2 is much faster and also
slightly more accurate. The software is available at http://www1.i2r.
a-star.edu.sg/~azeyar/genesis/ProtDex2.
Protein Structure and Protein Motion
Protein conformational flexibility plays a critical role in
vital biological functions, such as immune protection and enzymatic
catalysis. An example is the “flap” motion of HIV
protease, a major inhibitory drug target for AIDS therapy. The flaps,
located near the reactive site of HIV protease, must open to allow a
ligand to bind and then close to form direct contacts with the ligand.
Such motion is an essential means for proteins to perform their
functions, and therefore provides an important link between structure
and function, a central relationship in molecular biology.
We envision that future protein databases will contain not
only static structures of proteins, as the Protein Data Bank (PDB) does
currently, but also the motion of proteins. Users can submit queries on
protein motions to study the relationship between structure and
function, just as they submit queries on sequences and static
structures today. Such information will greatly enhance our ability to
understand and predict ligand-protein and protein-protein interactions
during processes such as for pharmaceutical drug design. Towards this
goal, we are developing geometric computation tools to explore, analyse
and model protein conformational flexibility.
We have developed a new method called Stochastic Roadmap
Simulation (SRS) for studying protein folding kinetics and have used it
to computationally predict experimental quantities in protein folding.
Interesting properties of molecular motion are best characterised
statistically by considering an ensemble of motion pathways rather than
an individual one. Classic simulation techniques, such as the Monte
Carlo method and molecular dynamics, generate individual pathways one
at a time and are easily “trapped” in the local
minima of the energy landscape. They are computationally inefficient if
applied in a brute-force fashion to deal with many pathways. SRS uses a
randomised technique for sampling molecular motion and exploring the
kinetics of such motion by examining multiple pathways simultaneously.

We have developed an algorithm and the related software called
pFlexAna for detection of protein conformational changes from
experimental data obtained through, for example, X-ray crystallography.
Due to noise in the data, determining salient conformational changes
accurately and efficiently is a challenging problem. A key element of
pFlexAna is a statistical test that determines the similarity of two
protein structures in the presence of noise. Using data from the
Protein Data Bank and the Macromolecular Movements Database, we tested
the algorithm on proteins that exhibit a range of different
conformational changes. Results show that our algorithm can reliably
detect salient conformational changes, including wellknown examples
such as hinge and shear.
With collaborators from Yong Loo Lin School of Medicine in
NUS, we are studying the protein-protein interaction between Bcl-2 and
Rac1.
Increasing Reliability of Protein Interactomes
Progress in high-throughput experimental techniques in the
past decade has resulted in the rapid accumulation of protein-protein
interaction (PPI) data. However, recent surveys reveal that interaction
data obtained by popular high-throughput assays such as
yeast-two-hybrid experiments may contain as much as 50% false positives
and false negatives. As a result, further carefully focused small-scale
experiments are often needed to complement the large-scale methods to
validate the detected interactions. However, the vast interactomes
require much more scalable and inexpensive approaches.
Here, we have three main objectives: (a) identify properties
that characterise true-positive and falsepositive PPIs, (b) develop
efficient and effective methods for assessing the reliability of PPIs
reported in high-throughput assays, and (c) develop efficient and
effective methods for identifying false-negative PPIs and new protein
complexes from high-throughput assays.
We have developed several methods for assessing the
reliability of a PPI, given a graph derived from highthroughput protein
interaction experiments. The pairs of “interacting”
proteins that are ranked highly by these methods are shown more likely
to be true-positive interacting pairs. The most interesting feature of
these methods is that they are able to rank the reliability of an
interaction between a pair of proteins using only the topology of the
interactions between that pair of proteins and their neighbours within
a short radius.
Our methods can be roughly divided into two groups. The first
group is represented by the functional similarity weighting (FS-Weight)
and CD Distance indices. This group of indices attempt to provide
abstract mathematical characterisations of networks of reliable
protein-protein interactions. They are based on the hypothesis that
true-positive interactions are likely to be characterised by dense
cross connections in the derived interaction graph. The second group is
represented by the meso-scale motifs (NeMoFinder) indices. This group
of indices attempt to provide explicit motifs of network connections
that are associated with reliable protein interactions.
The NeMoFinder, CD Distance, and FS-Weight are far superior
(30-50% better correlated with function homologeneity, localisation
coherence, and gene expression correlation) than previous methods
(e.g., interaction generality index) for detecting false positives in
protein interaction experiments. CD Distance and FS-Weight are also
very fast to compute. Furthermore, with small modifications such as
incorporation of interacting motif pairs, these indices can also be
used for false negative detection with high confidence.
Protein Function Prediction
Although sequence similarity search has proven useful for
protein function prediction in many cases, it has fundamental
limitations. First, only a fraction of newly discovered sequences have
identifiable homologous genes in the current databases. Second, the
most prominent vertebrate organisms in GenBank have only a fraction of
their genomes present in finished sequences. New bioinformatics methods
should allow inference of protein function using “associative
analysis” of functional properties to complement traditional
sequence homology-based methods. Associative properties that have been
used to infer function not evident from sequence homology include:
cooccurrence of proteins in operons or the genome context, proteins
sharing common domains in fusion proteins, proteins in the same
pathway, proteins with correlated gene expression patterns, etc.

In this project, we investigate and develop methods for
inferring protein functions without sequence homology. In particular,
(a) we find out how significant functional association between level-2
neighbours is, (b) we investigate how they can be exploited for protein
function prediction in a graph-based framework, and (c) we investigate
how to integrate protein interaction information with other types of
information to improve the sensitivity and specificity of protein
function prediction in a graphbased framework. At the end of the
project, we expect to have developed a robust and powerful system to
predict protein functions, even in the absence of sequence homology.
We have developed FS-Weight, a method for assigning functions
to a protein based on the functions of direct and indirect interaction
partners of that protein. The FS-Weight method predicts the functions
of a protein in two steps: (1) It assigns a weight (the FWWeight) to
each of its level-1 and level-2 neighbours by estimating its functional
similarity with the protein using the local topology of the interaction
network as well as the reliability of experimental sources. (2) It
scores each function based on its weighted frequency in these
neighbours. FS-Weight makes predictions with better precision and
recall compared to other protein interaction based methods (e.g.,
Neighbour Counting and Chi-Square) for seven model genomes we have
tested. FS-Weight is also robust in the presence of noise in the
protein interaction data, maintaining consistent prediction performance
even when a large number of interactions are randomly added to or
deleted in the interaction data.
We have also proposed a method called LaMoFinder to annotate
network motifs with biological information associated with the proteins
in a protein-protein interaction network. Our method is specifically
devised to handle a large labelling space and the sophisticated Gene
Ontology scheme in which proteins have been annotated. As a result, we have captured
not only the topological shapes of the motifs, but also the biological
context in which they occur in the labelled network
motifs. We have also demonstrated how the network
motifs labelled by LaMoFinder can be used to predict
the functions of unknown proteins in the PPI network.
The superior performance of our method against other
current prediction methods confirms that network
motifs are indeed adequately enriched by LaMoFinder
for more sophisticated biological applications such as
protein function prediction.We have also developed a technique for identifying
potential annotation errors in protein sequence
databases. These databases are crucial to protein
function prediction that uses the principle of “guilt by
association” in sequence similarity. However, these
databases are known to contain many annotation
errors. We have derived a statistical method to filter
and assess the reliability of data from these databases.
Our experiments show that we can effectively detect
mis-annotated sequence data.
Identifying Functional Elements in Human and Other Genomes
Interactions between macromolecules play many essential roles, e.g.,
metabolic reactions and signal transduction. The interactions can also
occur in many combinations, such as protein-protein, protein-DNA and
protein-RNA. Protein interactions with DNA and RNA are the primary
mechanisms for controlling gene expression. What is needed is a
recognition code that maps from the protein sequence to a pattern that
describes the family of DNA binding sites – the functional
elements. Identification of functional elements in the human genome is
fundamental to our understanding of cell functions including how these
codes orchestrate the complex network of gene transcription, the
transcriptome, and interactions in distinct locations.
We have two objectives here: (a) develop methods for accurate
identification of transcription factor binding sites and other
regulatory sites, and (b) develop methods for inferring the
interactions of transcription factors and other functional elements.
We have developed an algorithm called LocalMotif to detect localised
motifs in long regulatory sequences. A novel score function predicts
the region of localisation of the motif. This score is combined with
other scoring measures including Z-score and relative entropy to detect
the motif. The algorithm is optimised for fast processing of long
regulatory sequences. Tests on simulated and real datasets confirm that
LocalMotif accurately determines the region of localisation of motifs
and automatically discovers the biologically relevant motifs, which can
be detected by other motiffinding algorithms only when the search is
restricted to the relevant interval.
Management and Analysis of Gene Expression Data
The development of microarray technology has made possible the
simultaneous monitoring of the expression of thousands of genes. This
development offers great opportunities in advancing the diagnosis of
diseases, the treatment of diseases, and the understanding of gene
functions.
Our main goals are: (a) develop techniques to derive gene
regulatory networks from gene expression data, (b) develop technologies
for the design of microarrays, and (c) develop tools for the
optimisation of disease treatment based on gene expression profiles.
We have developed a new conditional dependence learning
algorithm to learn a gene regulatory network from gene expression data.
Besides the pairwise correlation of the gene expression profiles of a
pair of genes, our algorithm considers three additional factors: (a)
collaboration among regulators, (b) formation of regulatory complex,
and (c) variable time delay to learn the gene network. Experiments on
both artificial and real-life gene expression datasets validate the
effectiveness of the algorithm.
We have also proposed an efficient algorithm to identify
time-lagged co-regulated gene clusters. The algorithm facilitates
localised comparison and processes several genes simultaneously to
generate detailed and complete timelagged information for genes/gene
clusters. Our results show that the algorithm is not only efficient,
but also delivers more reliable and detailed information on time-lagged
co-regulation between genes/gene clusters, compared to existing methods
such as the event method and the edge detection method.
Analysing Mass Spectra for Identifying Proteins
Proteomics is useful for understanding the expression of
proteins in cells at different levels, at different time points, and in
different forms. Such an understanding is critical to drug discovery
and medical advances. Mass spectrometers are the predominant tool to
accomplish some of the primary goals of proteomics.

For example, identification of proteins, determination of
expression level of proteins, and determination of post-translational
modifications, sites and types. Due to limitations in mass
spectrometers and the associated software tools, most of the MS/MS data
generated by mass spectrometers are rejected because they are not
interpretable by currently available software. Furthermore, the
remaining data usually contain many false positives. In addition, the
sensitivity and precision of mass spectrometers vary greatly.
Our objective here is two-fold: (a) develop efficient,
accurate and robust algorithms for protein identification from tandem
mass spectra, and (b) develop efficient, accurate and robust algorithms
for alignment of noisy mass spectra.
Most peptide sequencing algorithms currently handle spectra of
charge 1 or 2, and have not been designed to handle higher-charge
spectra. We have recently given a characterisation of multi-charge
spectra by generalising existing models and analysed spectra with
charges up to 5. Our analysis shows that higher charge peaks are
present and they contribute significantly to prediction of a complete
peptide. They also help explain why existing algorithms do not perform
well on multi-charge spectra.
We have further proposed an algorithm, GST-SPC, for de novo
sequencing of multi-charge mass spectra. It computes a peptide sequence
that is optimal with respect to shared peaks count from among all
sequences that are derived from multi-charge strong tags. GST-SPC uses
a larger set of multi-charge strong tags than existing methods, which
improves the theoretical upper bound on performance. Experimental
results confirm the improvement.
Annotating the Genome Using Paired-End-diTag
Annotating the whole genome is an expensive and time-consuming
task. Recently, a new technology called Paired-End-diTag (PET) has been
developed. This technology allows us to find the 20 bps of the two ends
of genes/transcripts in a high-throughput and cost-effective way. When
used with chromatin immunoprecipitation (ChIP), it can also precisely
extract PETs surrounding all transcription factor binding sites in the
whole genome. Since PETs are short and the genome is big, the
bioinformatics challenge is to have a method that can align the PETs
onto the genome accurately and efficiently.

Our main aim is to develop a tool to help biologists process
PET data. The tool should be accurate and efficient. In addition,
cancer cell-lines may generate some abnormal genes known as fusion
genes due to genome rearrangement in cancer cells. Our second aim is to
discover these genes.
We have developed a mapping algorithm that can efficiently and
accurately align PETs onto the genome. This allows us to discover
numerous novel genes and novel transcription factor binding sites in
the human genome and the mouse genome. In addition, we have discovered
a number of fusion genes in cancer cells. To help biologists, our
algorithm is implemented as a pipeline PET-Tool that allows biologists
to manage and process PET sequence data.
RNA Secondary Structure Prediction
Due to advances in sequencing technologies, many RNA sequences
have been discovered. Moreover, only a few of their structures have
been deduced. On the other hand, the chemical and biological properties
of many RNAs (such as tRNAs) are determined primarily by their
secondary structures. Therefore, determining the secondary structures
of RNAs is becoming one of the most important topics in bioinformatics.
We consider two problems related to RNA: (a) inferring the RNA
secondary structure of an RNA sequence by comparing it with another RNA
sequence whose secondary structure is known, and (b) extracting local
sub-structures shared by two RNA secondary structures, including where
the sub-structure may be the functional part of the two RNAs. We expect
two RNAs having a similar function to generally have a similar
structure. We have devised a polynomial time dynamic programming
algorithm to infer the secondary structure for an RNA sequence based on
the known secondary structure of another functionally similar RNA.
For the local sub-structure problem, we propose modelling the
local structural motif as a local gapped subforest. Our model is
general enough to describe a structural motif such as the SECIS-motif.
We have further proposed a polynomial time algorithm to compute the
local gapped subforest of two RNA secondary structures.
Recognition of MicroRNA Precursors and their Targets
MicroRNAs (miRNAs) are small non-coding RNAs of about 22
nucleotides long. They play important regulatory functions in
eukaryotic gene expression through mRNA degradation or translation
inhibition. The regulatory functions of miRNAs range from cell
proliferation, fat metabolism, neuronal patterning in nematodes,
neurological diseases, modulation of hematopoietic lineage
differentiation in mammals, development, cell death, cancer and control
of leaf and flower development in plants.

Our objectives are as follows. (a) Experimental miRNA
identification is technically challenging and incomplete. This calls
for the development of computational approaches to complement
experimental approaches to miRNA gene identification. We propose in
this project to investigate de novo miRNA precursor prediction methods.
(b) Analysing the binding of miRNAs to their mRNA target sites reveals
that many different factors determine what constitutes a good fit. We
intend to investigate these factors in detail and to construct decision
models for predicting miRNA targets. (c) Finally, we would like to
understand the role of miRNAs in a number of human diseases.
We follow the “feature generation, feature
selection, and feature integration” paradigm of constructing
recognition models for genomics sequences. We have generated and
identified features based on information in both primary sequence and
secondary structure, and used these features to construct accurate
decision models for the recognition of miRNA precursors.
Sequence Indexing and Database Search
One of the important daily tasks of biologists is performing
homology search. Currently, heuristic methods such as BLAST are used
for this task. Their running time is linear to the size of the database
(for instance, the human genome is of size 3 billion bases). As typical
sequence databases rapidly increase in size, the performance of the
existing tools is set to deteriorate. Thus, it is important to have
solutions to improve the performance of database search.
Our aim is to improve the performance of database search by
utilising sequence indexing techniques. We consider two directions. The
first direction is to create a compressed index. The second direction
is to rely on disk-based indexing. For compressed indexing, we have
devoted our effort on the compressed suffix array, which is a
compressed index of biological sequences. We have developed a number of
approximate string matching algorithms utilising the compressed suffix
array to speed up database search. Our solution has shown advantages in
specialised homology search in the biology domain. For disk-based
indexing, we have developed the CPS-tree, which is a compact
partitioned suffix tree on disk. For exact pattern search, the
performance of the CPS-tree is very good and the pattern search time is
independent of the genome length.
Tag SNP Selection and Association Study
An SNP (Single Nucleotide Polymorphism) is a specific location
in our genome where different people have different DNA bases. Millions
of SNPs in human beings account for the major differences among
different individuals. Studying SNPs is helpful to understanding
genetic diseases.
There are a number of bioinformatics problems related to SNPs.
We target two particular problems: (a) the tag SNP selection problem,
and (b) association study. Since there are millions of SNPs, it is
difficult to investigate all of them. The tag SNP selection problem is
the selection of a subset of informative SNPs to represent the major
variations among individuals. Given a set of SNPs for a set of
individuals, an association study is the finding of a combination of
SNPs which has the potential to cause a genetic disease.
We have developed a method for tag SNP selection. Our method
efficiently selects a minimum set of tag SNPs that maximises the
independence among the selected SNPs. We have also developed a method
for association study. Unlike most of the previous methods which
perform single SNP association, our method considers the association of
variable-size haplotypes. Through regularised regression analysis, we
tackle the problem of multiple degrees of freedom in haplotype test.
Our method can handle a large number of haplotypes in association
analyses more efficiently and effectively than currently available
approaches can.
Phylogenetic Tree and Network Reconstruction
A phylogenetic tree, also known as a cladogram or a
dendrogram, is a tree describing several life forms and their
relations. A phylogenetic network is a generalisation of a phylogenetic
tree. In addition to the ancestral-descendent mutational relationship,
phylogenetic networks capture evolutionary events such as horizontal
gene transfer and hybridisation. Constructing a phylogenetic
tree/network is helpful to the understanding of the history of life and
the analysis of rapidly mutating viruses such as HIV. The phylogenetic
tree is also an important tool in comparative genomics, where the
comparison of multiple species can help predict protein structure, gene
expression pattern, etc. It is also relevant as a tool for
reconstructing phylogeny.
Our aim is to develop methods for constructing phylogenetic
trees and networks. In particular, we are interested in approaches
based on combining trees.
We have developed accurate and efficient methods for
constructing phylogenetic trees and networks. For phylogenetic
networks, we provide the first polynomial time algorithm for combining
a dense set of triplets into a galled phylogenetic network. Also, we
have generalised the method to reconstruct a network by combining a set
of phylogenetic trees. For phylogenetic trees, we study the supertree
problem. We have a fixed-parameter polynomial time algorithm to
construct a maximum agreement supertree of a set of phylogenetic trees.
Developing Bio-Dynamical Systems and Analysing the Regulatory
Behaviour of Biological Pathways
Mathematical modelling has been increasingly recognised as a
useful tool in the biomedical sciences. Biological models provide a
coherent framework for interpreting data. They can uncover new
phenomena to explore, identify key factors of a system, link different
levels of details, enable the formalisation of intuitive understanding,
screen unpromising hypotheses, predict variable inaccessible to
measurement, and expand the range of questions that can meaningfully be
asked.
We focus on the study of bio-dynamical systems, which are
mathematical models of the dynamic behaviours of biological systems.
Specifically, we intend to develop efficient computational
representations, analysis techniques, and algorithms for modelling and
reasoning about bio-dynamical systems. Our goal is to provide
computational tools that enable biologists to design experiments more
effectively and thus accelerate the process of biological discovery.
We have developed a decompositional approach to parameter
estimation for bio-dynamical systems based on the hybrid functional
Petri net. It exploits the structure of a large pathway model and
breaks it into smaller components, whose parameters can then be
estimated independently. This leads to significant improvements in
computational efficiency. We have tested our approach on a detailed
model of the Akt and MAPK pathways with two known and one hypothesised
crosstalk mechanisms. Our simulation results exhibit good correlation
with experimental data, and they have yielded positive evidence in
support of the hypothesised crosstalk between the two pathways.
The faculty members involved in
computational biology research are:
- HSU David
- HSU Wynne
- LEE Mong Li
- LEONG Hon Wai
- OOI Beng Chin
- SUNG Wing Kin, Ken
- TAN Kian Lee
- THIAGARAJAN P. S.
- TUNG Kum Hoe, Anthony
- WONG Limsoon
|