Home Up Feedback Site Map Search

       Find probe (online version)  

 

                             

 

 

Home
Links
Projects
People
Lecture Notes
Publications

The oligo microarray (DNA chip) technology in recent years has a significant impact on genomic study. Many fields such as gene discovery, drug discovery, toxicological research and disease diagnosis, will certainly benefit from its use. A microarray is an orderly arrangement of thousands of DNA fragments where each DNA fragment is a probe (or a fingerprint) of a gene/cDNA. It is important that each probe must uniquely associate with a particular gene/cDNA. Otherwise, the performance of the microarray will be affected.

Existing algorithms usually select probes using the criteria of homogeneity, sensitivity, and specificity. We propose to include one additional criterion, uniformity, which further improves the quality of the probes selected. For efficiency, existing algorithms reduce the time complexity by employing some heuristics. Such approaches reduce the accuracy.

Instead, we make use of some smart filtering techniques to avoid redundant computation while maintaining the accuracy. Based on the new algorithm, optimal short (20 bases) or long (50 or 70 bases) probes can be computed efficiently for large genomes.

Our Contribution

FindProbe

Our algorithm selects good probes based on the criteria of homogeneity, sensitivity and specificity as proposed by Lockhart.

Homogeneity is the ability of a probe to hybridize at a given experiment temperature. For every probe, the melting temperature is the temperature at which 50% of the probe can hybridize to its complementary strand. To be a good probe p for an intended target, we should make sure the melting temperature of p is close to the specified experimental temperate.

Sensitivity is the ability of a probe to detect low-abundance mRNAs. This is a key performance feature of microarrays which can be jeopardized by probes that form significant secondary structures. Thus it is important to reject probes with high self-complementariness and select probes with minimal secondary structure.

Specificity measures the uniqueness of a probe to its corresponding gene in the genome. A probe that is unique to its corresponding gene in terms of sequence similarity minimizes the chance of cross-hybridization. This step is very computational intensive and takes up the most time in probe design programs. However, by the use of the Pigeon Hole Principle, we speeded up specificity filter greatly. Our algorithm only finds and checks exact regions in the genome that potentially cause cross-hybridization. Since these regions are small compared to the entire genome, we avoid redundant checks. Most importantly, our approach is not a heuristic approach and thus is able to filter all ``bad'' probes.

Download: Paper

FindProbe_v2

To improve accuracy, we proposed to include the uniformity criterion to obtain probes with a highly uniform distribution of mismatches. This is important because the distribution of similar sequences in a probe affects its reliability. Uniformity filter eliminates probes that have many long substrings appearing in other genes. Since every probe as well as its substrings is unlikely to hybridize with the incorrect genes, the resulting probe set is more accurate.

Uniformity measures the mismatch distribution of a probe with all other non-target sequences in the genome. A probe with a good mismatch distribution, termed as a uniform probe, has two properties:

  • The longest common substring (lcs) between the probe and any non-target sequence must be as short as possible.

  • The longest common substring at every position of the probe with any non-target sequence must be minimal. This ensures that every part of the probe is unique to its target sequence. We measure the overall uniqueness of the probe by a statistic which we call the mismatch distribution penalty.

The idea of our uniformity rule is to report all regions of a probe that may cause cross-hybridization. In the Methods section, we describe a fast and practical technique to compute the lcs of a probe. In addition, we formally introduce the mismatch distribution penalty statistic that will lead to a better ranking of probes in terms of specificity. Thus, together with Hamming distance, we can filter more non-specific probes (ie. probes that may cause cross-hybridization) when compared with methods based on longest common substring or hamming distance alone.

Table 1 - Benchmark results for short probes design

Benchmark results of our algorithm to design short probes (20-25 mers) for the 4 genomes.

 

Li and Stormo

Rouillard et al.

Rahmann

FindProbe

FindProbe_v2

E. coli

4662239 bps

 400 genes

23-mers

1.5 days

-

-

20-mers

 70 seconds

20-mers

49 seconds

S. pombe

7098029 bps

4897 genes

-

-

-

20-mers

302 seconds

20-mers

167 seconds

S. cerevisiae

 8953158 bps

6343 genes

24-mers

4 days

-

25-mers

< 2 hours

20-mers

358 seconds

20-mers

193 seconds

N. crassa

38044343 bps

10895 genes

-

-

25-mers

< 4 hours

20-mers

53 minutes

20-mers

31 minutes

Table 2 - Benchmark results for long probes design

Benchmark results of our algorithm to design long probes (50 mers) for the 4 genomes.

 

 

Li and Stormo

Rouillard et al.

Rahmann

FindProbe

FindProbe_v2

E. coli

4662239 bps

 400 genes

-

-

-

50-mers

191 seconds

50-mers

3 minutes

S. pombe

7098029 bps

4897 genes

-

-

-

50-mers

30 minutes

50-mers

20 minutes

S. cerevisiae

 8953158 bps

6343 genes

-

50-mers

1 day

-

50-mers

50 minutes

50-mers

33 minutes

N. crassa

38044343 bps

10895 genes

-

-

-

50-mers

214 minutes

50-mers

143 minutes

 

Download Program: FindProbe_v2.zip

 

Home ] Up ]

Send mail to gunjanch@comp.nus.edu.sg with questions or comments about this web site.
Last modified: 01/27/04