Some Bioinformatics and Structural Biology Applications That Can Benefit From A Global High-Speed Network

Notes for the APAN Meeting on 21 October 1997

Prasanna Kolatkar
BioInformatics Centre
Singapore
Limsoon Wong
Institute of Systems Science
Singapore

Nature of Many Bioinformatics Problems

Require access to data sources that are
- Highly heterogeneous
- Geographically distributed
- Highly complex
- Constantly evolving
- High in volume
Require solutions that involve multiple carefully sequenced steps
Require information to be passed smoothly between the steps
Require increasing amount of computations
Require increasing amount of visualisation

Example: Querying Protein Patents

If yuo have lots of proteins to work on, you may want to choose those that have not already been patented and have a potentially large number of unpatented superfamily sequences. The picture below depicts how you can access some remote databases and softwares to help you.

Let's look at slightly more detail. One of the questions that you may want to ask to help you to broaden your claims might be: Find sequences in the same superfamily as your protein that are dissimilar to all patented sequences. Here is the simple way you can use our Kleisli Query System to examine remote data to answer this question:

  [ (#title: Z.#title ,
     #accession: Z.#accession,
     #uid: Z.#uid)
  |

! find SCOP domains X that SEQ relates closely

    \X <- process SEQ using scop-blast,

! find rep seqs Y in superfamily of X

    \xinfo <- process <#sidinfo: X.#accession> using scop,
    \sf <- process <#numsid: xinfo.#type.#sf> using scop,
    \sfuid <- scop-accn2uid (sf),
    \Y <- process <#get: sfuid.#uid> using scop-index,

! make sure Y is different from all patented seqs

   set-isempty { x | \x <- process y.#seq using patent-blast},

! make all seqs Z that look like Y

   \Z <- process Y.#seq using nr-blast ];

Here is a screendump of our patent query demo.

Example: Querying Protein Interactions

Suppose you have a collection of protein sequences. You want to make a quick guess on possible pair-wise interaction between these proteins using existing literature. The picture below describes a possible strategy that rely on remote databases based on Kleisli.

The interaction map produced above can then be visualized. A possible (made-up) visualization is shown in the picture below. Links attached to ``bubbles'' point you to literature about your proteins. Links attached to ``triangles'' point you to literature about interaction of proteins connected by the arcs. Size and colour of the bubbles and triangles indidate the amount of information available.

Example: Real-Time Control of Structural Biology Experiments

Many large databases such as Protein Data Bank involve transferring large files with three-dimensional information. In addition massive data sets are collected at synchrotrons from protein crystal diffraction experiments exposed to X-rays.

A researcher with access to high-speed networks can visualize this data in real-time allowing him/her to control the experiment from a remote location like Singapore.

Here is an example of browsing data over the web at Brookhaven Synchrotron. This is a movie of data frames of a protein crystal being exposed to X-rays. A total of 100-900 frames can constitute a data set with each data frame being 8-10MB. If this data is coming over live, you can call up Brookhaven to direct the experiment on the fly.

Example: Interactive 3-D Visualization for Crystallography

Synchronton and other structural data can also be downloaded. The downloaded data can then be processed and analyzed to construct a electron density map which is used to build a model for the protein.

Luis Serra's group in Institute of Systems Science is collaborating with Duncan McRee at Scripps Institute to apply their virtual reality workbench to X-ray crystallography which will allow researchers at two remote points to share the same three-dimensional information and to co-analyze and discuss this information. This interactive 3-D visualization has a large demand for high-speed networks.

Some Opportunities

For computing scientists, a global high-speed network represents opportunities for research in areas like:

Optimization and distribution problems in the context of the non-standard data
Methods and standards for the querying and reliable exchange of bulk data
Algorithms, tools, and standards for wide-area data mining
Methods for migration of computation and threads
Technology for high-bandwidth real-time interaction

For biologists, it means opportunities to investigate and collaborate on many problems on a large scale:

Protein-protein interactions
Gene expression
Sequence annotation
Crystallographic analysis