SCHOOL OF COMPUTING,NUS POSTGRADUATE SEMINAR BY MS XU XIN Data Mining Techniques in Gene Expression Data Analysis Executive Classroom, SoC-1 level 5 1 September 2006, 10.00am Abstract: With the rapid development of microarray chip technology, gene expression data are being generated in huge quantities. The indispensable task of data mining, as a result, is to effectively and efficiently extract useful biological information from gene expression data. However, high-dimensionality and complex relationships among genes impose great challenges for the existing data mining methods. In this thesis, we systematically study the existing problems of state-of-the art data mining algorithms for gene expression data in class association rule mining, associative classification and subspace clustering of genes of nonlinear and shiftingand-scaling correlations. Specifically, we propose the concept of top-k covering rule groups (TopKRGs) for each gene expression sample, and design a row-wise mining algorithm to discover TopKRGs efficiently. We further develop a new associative classifier by combining the nl rules consisting of the most significant genes based on the entropy test on the top k covering rule groups. To address the nonlinear correlation problem and the shifting-and-scaling correlation problem, we introduce our Curler and RegMiner algorithms respectively to identify the subset of genes which exhibit non-linear or shifting-and-scaling correlation patterns across a subset of conditions. Extensive experimental studies are conducted on synthetic and real-life datasets. The experimental results show the effectiveness and efficiency of our algorithms. While we mainly use gene expression data in our study, our algorithms can also be applied to high-dimensional data in other domains.