1	Sample-Wise Enumeration Methods for Mining Microarray Datasets Anthony K. H. Tung Department of Computer Science National University of Singapore
2	A Microarray Dataset Find closed patterns which occur frequently among genes. Find rules which associate certain combination of the columns that affect the class of the rows Gene1,Gene10,Gene1001 -> Cancer
3	Challenge I Large number of patterns/rules number of possible column combinations is extremely high Solution: Concept of a closed pattern Patterns are found in exactly the same set of rows are grouped together and represented by their upper bound Example: the following patterns are found in row 2,3 and 4
4	Challenge II Most existing frequent pattern discovery algorithms perform searches in the column/item enumeration space i.e. systematically testing various combination of columns/items For datasets with 1000-100,000 columns, this search space is enormous Instead we adopt a novel row/sample enumeration algorithm for this purpose. CARPENTER (SIGKDD’03) is the FIRST algorithm which adopt this approach
5	Column/Item Enumeration Lattice Each nodes in the lattice represent a combination of columns/items An edge exists from node A to B if A is subset of B and A differ from B by only 1 column/item Search can be done
6	Column/Item Enumeration Lattice Each nodes in the lattice represent a combination of columns/items An edge exists from node A to B if A is subset of B and A differ from B by only 1 column/item Search can be done depth first Keep edges from parent to child only if child is the prefix of parent
7	General Framework for Column/Item Enumeration
8	A Multidimensional View
9	Sample/Row Enumeration Algorihtms To avoid searching the large column/item enumeration space, our mining algorithm search for patterms/rules in the sample/row enumeration space Our algorithms does not fitted into the column/item enumeration algorithms They are not YAARMA (Yet Another Association Rules Mining Algorithm) Column/item enumeration algorithms simply does not scale for microarray datasets
10	Existing Row/Sample Enumeration Algorithms CARPENTER(SIGKDD'03) Find closed patterns using row enumeration FARMER(SIGMOD’04) Find interesting rule groups and building classifiers based on them COBBLER(SSDBM'04) Combined row and column enumeration for tables with large number of rows and columns FARMER's demo (VLDB'04) Balance the scale: 3 row enumeration algorithms vs >50 column enumeration algorithms
11	Concepts of CARPENTER
12	Row Enumeration
13	Pruning Method 1 Removing rows that appear in all tuples of transposed table will not affect results
14	Pruning method 2 if a rule is discovered before, we can prune enumeration below this node Because all rules below this node has been discovered before For example, at node 34, if we found that {aeh} has been found, we can prune off all branches below it
15	Pruning Method 3: Minimum Support Example: From TT\|_{1}, we can see that the support of all possible pattern below node {1} will be at most 5 rows.
16	From CARPENTER to FARMER What if classes exists ? What more can we do ? Pruning with Interestingness Measure Minimum confidence Minimum chi-square Generate lower bounds for classification/ prediction
17	Interesting Rule Groups Concept of a rule group/equivalent class rules supported by exactly the same set of rows are grouped together Example: the following rules are derived from row 2,3 and 4 with 66% confidence
18	Pruning by Interestingness Measure In addition, find only interesting rule groups (IRGs) based on some measures: minconf: the rules in the rule group can predict the class on the RHS with high confidence minchi: there is high correlation between LHS and RHS of the rules based on chi-square test Other measures like lift, entropy gain, conviction etc. can be handle similarly
19	Ordering of Rows: All Class C before ~C
20	Pruning Method: Minimum Confidence Example: In TT\|{2,3} on the right, the maximum confidence of all rules below node {2,3} is at most 4/5
21	Pruning method: Minimum chi-square
22	Finding Lower Bound, MineLB Example: An upper bound rule with antecedent A=abcde and two rows (r1 : abcf ) and (r2 : cdeg) Initialize lower bounds {a, b, c, d, e} add “abcf”--- new lower {d ,e} Add “cdeg”--- new lower bound{ad, bd, ae, be}
23	Implementation In general, CARPENTER FARMER can be implemented in many ways: FP-tree Vertical format For our case, we assume the dataset can be fitted into the main memory and used pointer-based algorithm similar to BUC
24	Experimental studies Efficiency of FARMER On five real-life dataset lung cancer (LC), breast cancer (BC) , prostate cancer (PC), ALL-AML leukemia (ALL), Colon Tumor(CT) Varying minsup, minconf, minchi Benchmark against CHARM [ZaHs02] ICDM'02 Bayardo’s algorithm (ColumE) [BaAg99] SIGKDD'99 Usefulness of IRGs Classification
25	Example results--Prostate
26	Example results--Prostate
27	Naive Classification Approach Generate the upper bounds of IRGs Rank the upper bounds, thus ranking the IRGs; Apply coverage pruning on the IRGs; Predict the test data based on the IRGs that it covers.
28	Classification results
29	Summary of Experiments FARMER is much more efficient than existing algorithms There are evidences to show that IRGs is useful for classification of microarray datasets
30	COBBLER: Combining Column and Row Enumeration Extend CARPENTER to handle datasets with both large number of columns and rows Switch dynamically between column and row enumeration based on estimated cost of processing
31	Single Enumeration Tree
32	Dynamic Enumeration Tree
33	Dynamic Enumeration Tree
34	Switching Condition Naïve idea of switching based on row number and feature number does not work well to estimate the required computation of an enumeration sub-tree, i.e., row enumeration sub-tree or feature enumeration sub-tree. Estimate the maximal level of enumeration for each children sub-tree Example of estimating the maximal level of enumeration: Suppose r=10, S(f1)=0.8, S(f2)=0.5, S(f3)=0.5, S(f4)=0.3 and minsup=2 S(f1)S(f2)S(f3)r =2 ≥ minsup S(f1)S(f2)S(f3)S(f4)*r =0.6 < minsup Then the estimated deepest node under f1 is f1f2f3
35	Switching Condition
36	Switching Condition
37	Length and Row ratio
38	Extension of our work by other groups (with or without citation) [1] Using transposition for pattern discovery from microarray data, Francois Rioult (GREYC CNRS), Jean-Francois Boulicaut (INSA Lyon), Bruno Cremileux (GREYC CNRS), Jeremy Besson (INSA Lyon) See the presence and absence of genes in the sample as a binary matrix. Perform a transposition of the matrix which is essentially our transposed table. Enumeration methods are the same otherwise.
39	Extension of our work by other groups (with or without citation) II [2] Mining Coherent Gene Clusters from Gene-Sample-Time Microarray Data. D. Jiang, Jian Pei, M. Ramanathan, C. Tang and A. Zhang. (Industrial full paper, Runner-up for the best application paper award). SIGKDD’2004
40	Extension of our work by other groups (with or without citation) III
41	Extension of our work by other groups (with or without citation) IV
42	Extension of our work by other groups (with or without citation) V
43	Extension of our work by other groups (with or without citation) VI
44	Extension of our work by other groups (with or without citation) VII [2] Pruning Rule 3.2 (Pruning subsumed sets). At a node v = {s_i… s_ik} if {si1,…s_ik} U Tail is a subset of some maximal coherent sample set, then the subtree of the node can be pruned.
45	Extension of our work (Conclusion) The sample/enumeration framework had been successfully adopted by other groups in mining microarray datasets We are proud of our contribution as the group the produce the first row/sample enumeration algorithm CARPENTER and is happy that other groups also find the method useful However, citations from these groups would have been nice. After all academic integrity is the most important things for a researcher.
46	Future Work: Generalize Framework for Row Enumeration Algorithms?
47	Conclusions Many datasets in bioinformatics have very different characteristics compared to those that has been previously studied These characteristics can either work against you or for you In the case of microarray datasets with large number columns but small number of rows/samples, we turn what is against us to our advantage Row/Sample enumeration Pruning strategy We show how our methods have been modified by other groups to produce useful algorithm for mining microarray datasets
48	"Thank you!!!" Thank you!!! atung@comp.nus.edu.sg www.comp.nus.edu.sg/~atung/sfu_talk.pdf