Algorithms for Suffix Arrays and Compressed Suffix Arrays

Participants: Kunihiko Sadakane, Tak-Wah Lam, Keng Sung, Swee Seong Wong.


Background

Suffix arrays and compressed suffix arrays are important full-text indexing technologies frequently used for indexing texts without word boundary, such as DNA or protein sequences. However, constructing these arrays requires a lot of space, making these data-structures impractical. For example, constructing CSA for human genome requires at least 24G RAM. In this project, we aim to develop efficient algorithms for searching and constructing suffix arrays, compressed suffix arrays, and other types of full-text indices.

Achievements

We are pioneer in efficient algorithms for searching and constructing full-text indices like suffix arrays and compressed suffix arrays. Up to now, no known algorithm can build a full-text index for a big dataset using reasonable computing resources. For instance, we are the only group who can construct a full-text index for the whole human genome using a PC with 2G RAM within 2 hours.

We have achieved good results in computer science (e.g., our FOCS paper in 2003). We have also applied our algorithms to solve real biology problems (e.g., our Nature Method paper in 2005 and Cell paper in 2006). One of our algorithms won the Japan Forum on Information Technology Award in 2003. We have also worked on pattern searching. Utilizing our full-text indices, we can do approximate searching more efficiently than many previous solutions (e.g., our ESA in 2006).

Selected Publications


Last updated: 1/5/07, Limsoon Wong.