Compressed index

Participants: Chandana, Jing Quan Lim, Hoang, Wing-Kai Hon, Tak-Wah Lam, Sadakane, Wing-Kin Sung


Background

DNA sequences, which hold the code of life for every living organism, can be represented by strings over 4 characters A, C, G, and T. Due to the advance in bio-technology, we already know the complete sequences for a number of living organisms, including human. The advance in sequencing also generated many sequencing data. We faced two problems. First, the dataset is too big. It is easily generate hundreds of billions of data per individual. Second, to do analysis, biologists require tools that can locate the positions of an arbitrary pattern over a long DNA sequence efficiently. However, genomic data is long. It is time consuming to search a parttern by linearly scan the DNA sequence.

Objectives

To resolve the issue of hugh amount of data, we can use compression techniques. To resolve the pattern searching problem, we can use data-structure. Now, we need to resolve two issues at the same time.

In this project, we aim to create indexing data-structure that is compressed. There ae a few subproblems.

Software

Selected Publications


Last updated: 30/12/2015, Wing-Kin Sung.