Sample-Wise Enumeration Methods for Mining Microarray Datasets
Anthony K. H. Tung
Department of Computer Science
National University of Singapore

A Microarray Dataset
Find closed patterns which occur frequently among genes.
Find rules which associate certain combination of the columns that affect the class of the rows
Gene1,Gene10,Gene1001 -> Cancer

Challenge I
Large number of patterns/rules
number of possible column combinations is extremely high
Solution: Concept of a closed pattern
Patterns are found in exactly the same set of rows are grouped together and represented by their upper bound
Example: the following patterns are found in row 2,3 and 4

Challenge II
Most existing frequent pattern discovery algorithms perform searches in the column/item enumeration space i.e. systematically testing various combination of columns/items
For datasets with 1000-100,000 columns, this search space is enormous
Instead we adopt a novel row/sample enumeration algorithm for this purpose. CARPENTER (SIGKDD’03) is the FIRST algorithm which adopt this approach

Column/Item Enumeration Lattice
Each nodes in the lattice represent a combination of columns/items
An edge exists from node A to B if A is subset of B and A differ from B by only 1 column/item
Search can be done

Column/Item Enumeration Lattice
Each nodes in the lattice represent a combination of columns/items
An edge exists from node A to B if A is subset of B and A differ from B by only 1 column/item
Search can be done depth first
Keep edges from parent to child only if child is the prefix of parent

General Framework for Column/Item Enumeration

A Multidimensional View

Sample/Row Enumeration Algorihtms
To avoid searching the large column/item enumeration space, our mining algorithm search for patterms/rules in the sample/row enumeration space
Our algorithms does not fitted into the column/item enumeration algorithms
They are not YAARMA (Yet Another Association Rules Mining Algorithm)
Column/item enumeration algorithms simply does not scale for microarray datasets

Existing Row/Sample Enumeration Algorithms
CARPENTER(SIGKDD'03)
Find closed patterns using row enumeration
FARMER(SIGMOD’04)
Find interesting rule groups and building classifiers based on them
COBBLER(SSDBM'04)
Combined row and column enumeration for tables with large number of rows and columns
FARMER's demo (VLDB'04)
Balance the scale: 3 row enumeration algorithms vs >50 column enumeration algorithms

Concepts of CARPENTER

Row Enumeration

Pruning Method 1
Removing rows that appear in all tuples of transposed table will not affect results

Pruning method 2
if a rule is discovered before, we can prune enumeration below this node
Because all rules below this node has been discovered before
For example, at node 34, if we found that {aeh} has been found, we can prune off all branches below it

Pruning Method 3: Minimum Support
Example: From TT|{1}, we can see that the support of all possible pattern below node {1} will be at most 5 rows.

From CARPENTER to FARMER
What if classes exists ? What more can we do ?
Pruning with Interestingness Measure
Minimum confidence
Minimum chi-square
Generate lower bounds for classification/ prediction

Interesting Rule Groups
Concept of a rule group/equivalent class
rules supported by exactly the same set of rows are grouped together
Example: the following rules are derived from row 2,3 and 4 with 66% confidence

Pruning by Interestingness Measure
In addition, find only interesting rule groups (IRGs) based on some measures:
minconf: the rules in the rule group can predict the class on the RHS with high confidence
minchi: there is high correlation between LHS and RHS of the rules based on chi-square test
Other measures like lift, entropy gain, conviction etc. can be handle similarly

Ordering of Rows: All Class C before ~C

Pruning Method: Minimum Confidence
Example: In TT|{2,3} on the right, the maximum confidence of all rules below node {2,3} is at most 4/5

Pruning method: Minimum chi-square

Finding Lower Bound, MineLB
Example: An upper bound rule with antecedent A=abcde and two rows (r1 : abcf ) and (r2 : cdeg)
Initialize lower bounds {a, b, c, d, e}
add “abcf”--- new lower {d ,e}
Add “cdeg”--- new lower  bound{ad, bd, ae, be}

Implementation
In general, CARPENTER FARMER can be implemented in many ways:
FP-tree
Vertical format
For our case, we assume the dataset can be fitted into the main memory and used pointer-based algorithm similar to BUC

Experimental studies
Efficiency of FARMER
On five real-life dataset
lung cancer (LC), breast cancer (BC) , prostate cancer (PC), ALL-AML leukemia (ALL), Colon Tumor(CT)
Varying minsup, minconf, minchi
Benchmark against
CHARM [ZaHs02] ICDM'02
Bayardo’s algorithm (ColumE) [BaAg99] SIGKDD'99
Usefulness of IRGs
Classification

Example results--Prostate

Example results--Prostate

Naive Classification Approach
Generate the upper bounds of IRGs
Rank the upper bounds, thus ranking the IRGs;
Apply coverage pruning on the IRGs;
Predict the test data based on the IRGs that it covers.

Classification results

Summary of Experiments
FARMER is much more efficient than existing algorithms
There are evidences to show that IRGs is useful for classification of microarray datasets

COBBLER: Combining Column and Row Enumeration
Extend CARPENTER to handle datasets with both large number of columns and rows
Switch dynamically between column and row enumeration based on estimated cost of processing

Single Enumeration Tree

Dynamic Enumeration Tree

Dynamic Enumeration Tree

Switching Condition
Naïve idea of switching based on row number and feature number does not work well
to estimate the required computation of an enumeration sub-tree, i.e.,  row enumeration sub-tree or feature enumeration sub-tree.
Estimate the maximal level of enumeration for each children sub-tree
Example of estimating the maximal level of enumeration:
Suppose r=10, S(f1)=0.8, S(f2)=0.5, S(f3)=0.5, S(f4)=0.3 and minsup=2
S(f1)*S(f2)*S(f3)*r =2 ≥ minsup
S(f1)*S(f2)*S(f3)*S(f4)*r =0.6 < minsup
Then the estimated deepest node under f1 is f1f2f3

Switching Condition

Switching Condition

Length and Row ratio

Extension of our work by other groups (with or without citation)
[1]  Using transposition for pattern discovery from microarray data, Francois Rioult (GREYC CNRS), Jean-Francois Boulicaut (INSA Lyon), Bruno Cremileux (GREYC CNRS), Jeremy Besson (INSA Lyon)
See the presence and absence of genes in the sample as a binary matrix. Perform a transposition of the matrix which is essentially our transposed table. Enumeration methods are the same otherwise.

Extension of our work by other groups (with or without citation) II
[2] Mining Coherent Gene Clusters from Gene-Sample-Time Microarray Data. D. Jiang, Jian Pei, M. Ramanathan, C. Tang and A. Zhang. (Industrial full paper, Runner-up for the best application paper award). SIGKDD’2004

Extension of our work by other groups (with or without citation) III

Extension of our work by other groups (with or without citation) IV

Extension of our work by other groups (with or without citation) V

Extension of our work by other groups (with or without citation) VI

Extension of our work by other groups (with or without citation) VII
[2] Pruning Rule 3.2 (Pruning subsumed sets). At a node v = {si… sik} if {si1,…sik} U Tail is a subset of some maximal coherent sample set, then the subtree of the node can be pruned.

Extension of our work (Conclusion)
The sample/enumeration framework had been successfully adopted by other groups in mining microarray datasets
We are proud of our contribution as the group the produce the first row/sample enumeration algorithm CARPENTER and is happy that other groups also find the method useful
However, citations from these groups would have been nice. After all academic integrity is the most important things for a researcher.

Future Work: Generalize Framework for Row Enumeration Algorithms?

Conclusions
Many datasets in bioinformatics have very different characteristics compared to those that has been previously studied
These characteristics can either work against you or for you
In the case of microarray datasets with large number columns but small number of rows/samples, we turn what is against us to our advantage
Row/Sample enumeration
Pruning strategy
We show how our methods have been modified by other groups to produce useful algorithm for mining microarray datasets

"Thank you!!!"
Thank you!!!
atung@comp.nus.edu.sg
www.comp.nus.edu.sg/~atung/sfu_talk.pdf