Sample-Wise Enumeration Methods for Mining Microarray Data

Anthony K. H. Tung

Department of Computer Science

National University of Singapore

A Microarray Dataset

1000 -100,000 columns

•Find closed patterns which occur frequently among genes.
•Find rules which associate certain combination of the columns that affect the class of the rows
–Gene1,Gene10,Gene1001 -> Cancer

Challenge I

“a”however not part of the group

Challenge II

•Most existing frequent pattern discovery algorithms perform searches in the column/item enumeration space i.e. systematically testing various combination of columns/items
•For datasets with 1000-100,000 columns, this search space is enormous
•Instead we adopt a novel sample enumeration algorithm for this purpose. CARPENTER (SIGKDD’03) is the FIRST algorithm which adopt this approach

Column/Item Enumeration Lattice

Oval: a,c

Column/Item Enumeration Lattice

•Each nodes in the lattice represent a combination of columns/items
•An edge exists from node A to B if A is subset of B and A differ from B by only 1 column/item
•Search can be done depth first
•Keep edges from parent to child only if child is the prefix of parent
Oval: a,c

General Framework for Column/Item Enumeration

Read-based

Write-based

Point-based

Association Mining

Apriori[AgSr94],

DIC

Eclat, MaxClique[Zaki01], FPGrowth[HaPe00]

Hmine

Sequential Pattern Discovery

GSP[AgSr96]

SPADE [Zaki98,Zaki01], PrefixSpan[PHPC01]

Iceberg Cube

Apriori[AgSr94]

BUC[BeRa99], H-Cubing [HPDW01]

A Multidimensional View

read

write

Sample/Row Enumeration Algorihtms

•To avoid searching the large column/item enumeration space, our mining algorithm search for patterms/rules in the sample/row enumeration space
•Our algorithms does not fitted into the column/item enumeration algorithms
•They are not YAARMA (Yet Another Association Rules Mining Algorithm)
•Column/item enumeration algorithms simply does not scale for microarraydatasets

Existing Sample Enumeration Algorithms

CARPENTER( SIGKDD'03)
–Find closed patterns using row enumeration
FARMER( SIGMOD’04)
–Find interesting rule groups and building classifiers based on them
COBBLER( SSDBM'04)
–Combined row and column enumeration for tables with large number of rows and columns
•FARMER 's demo (VLDB'04)
•Balance the scale: 3 row enumeration algorithms vs>50 column enumeration algorithms

Concepts of CARPENTER

Example Table

Transposed Table,TT

TT|{2,3}

Row Enumeration

Pruning Method 1

•Removing rows that appear in all tuplesof transposed table will not affect results

r2 r3{aeh}

r4 has 100% support in the conditional table of “r2r3”, therefore branch “r2 r3r4”will be pruned.

r2 r3 r4 {aeh}

TT|{2,3}

Pruning method 2

Pruning Method 3: Minimum Support

•Example: From TT|{1}, we can see that the support of all possible pattern below node {1} will be at most 5 rows.

From CARPENTER to FARMER

•What if classes exists ? What more can we do ?
•Pruning with Interestingness Measure
–Minimum confidence
–Minimum chi-square
•Generate lower bounds for classification/ prediction

Interesting Rule Groups

•Concept of a rule group/equivalent class
–rules supported by exactly the same set of rows are grouped together
•Example: the following rules are derived from row 2,3 and 4 with 66% confidence

ae-->C (66%)

a-->C however is not in the group

Pruning by Interestingness Measure

•In addition, find only interesting rule groups (IRGs) based on some measures:
–minconf: the rules in the rule group can predict the class on the RHS with high confidence
–minchi: there is high correlation between LHS and RHS of the rules based on chi-square test
•Other measures like lift, entropy gain, conviction etc. can be handle similarly

Ordering of Rows: All Class C before ~C

Pruning Method: Minimum Confidence

•Example: In TT|{2,3} on the right, the maximum confidence of all rules below node {2,3} is at most 4/5

TT|{2,3}

Pruning method: Minimum chi-square

TT|{2,3}

C

~C

Total

A

max=5

min=1

Computed

~A

Computed

Computed

Computed

Constant

Constant

Constant

Text Box: Same as in computing maximum confidence

Finding Lower Bound, MineLB

Implementation

•In general, CARPENTER FARMER can be implemented in many ways:
–FP-tree
–Vertical format
•For our case, we assume the dataset can be fitted into the main memory and used pointer-based algorithm similar to BUC

Experimental studies

•Efficiency of FARMER
–On five real-life dataset
•lung cancer (LC), breast cancer (BC) , prostate cancer (PC), ALL-AML leukemia (ALL), Colon Tumor(CT)
–Varying minsup, minconf, minchi
–Benchmark against
•CHARM [ZaHs02] ICDM'02
•Bayardo’salgorithm (ColumE) [BaAg99] SIGKDD'99
•Usefulness of IRGs
–Classification

Example results--Prostate

Example results--Prostate

Naive Classification Approach

•Generate the upper bounds of IRGs
•Rank the upper bounds, thus ranking the IRGs;
•Apply coverage pruning on the IRGs;
•Predict the test data based on the IRGsthat it covers.

Classification results

Summary of Experiments

•FARMER is much more efficient than existing algorithms
•There are evidences to show that IRGsis useful for classification of microarraydatasets

COBBLER: Combining Column and Row Enumeration

•Extend CARPENTER to handle datasets with both large number of columns and rows
•Switch dynamically between column and row enumeration based on estimated cost of processing

{ }

ab

{r1}

ad {r2}

acd{ r2}

bcd

{ }

{ }

r1r3r4 { }

r2r4{d }

Single Enumeration Tree

Feature enumeration

Row enumeration

r1

a b c

r2

a c d

r3

b c

r4

d

Dynamic Enumeration Tree

{ }

r1

{bc}

r1

{c}

r2

{d }

Feature enumerationto Row enumeration

ab

{r1}

abcd

{ }

a {r1r2}

abc: {r1}

ac: {r1r2}

acd: {r2}

r1

bc

r2

cd

br

1

cr

1 r2

dr

2

Dynamic Enumeration Tree

r4

{d}

c{r2r3 }

d {r4 }

b{r1 }

Row enumerationto Feature Enumeration

c{r1r2 }

bc{r1 }

r1 {abc}

r1r3r4 { }

ac: {r1r2}

bc: {r1r3}

c: {r1r2r3}

Switching Condition

Naïve idea of switching based on row number and feature number does not work well
to estimate the required computation of an enumeration sub-tree, i.e., row enumeration sub-tree or feature enumeration sub-tree.
Estimate the maximal level of enumeration for each children sub-tree
Example of estimating the maximal level of enumeration:
Suppose r=10, S(f1)=0.8, S(f2)=0.5, S(f3)=0.5, S(f4)=0.3 and minsup=2
S(f1)*S(f2)*S(f3)*r =2 minsup
S(f1)*S(f2)*S(f3)*S(f4)*r =0.6 < minsup
Then the estimated deepest node under f1 is f1f2f3

Switching Condition

Switching Condition

Length and Row ratio

Synthetic data

Extension of our work by other groups (with or without citation)

[1] Using transposition for pattern discovery from microarraydata , Francois Rioult(GREYC CNRS), Jean-Francois Boulicaut(INSA Lyon), Bruno Cremileux(GREYC CNRS), Jeremy Besson(INSA Lyon)
•See the presence and absence of genes in the sample as a binary matrix. Perform a transposition of the matrix which is essentially our transposed table. Enumeration methods are the same otherwise.

Extension of our work by other groups (with or without citation) II

•[2 ] Mining Coherent Gene Clusters from Gene-Sample-Time Microarray Data . D. Jiang, Jian Pei, M. Ramanathan, C. Tang and A. Zhang. (Industrial full paper, Runner-up for the best application paper award). SIGKDD’2004

Gene1

Gene2

Gene3

Gene4

Sample1

Sample2

.

.

.

SampleN-1

SampleN

A gene in two samples are say to be coherent if their time series satisfied a certain matching condition

In CARPENTER, a gene in two samples are say to be matching if their expression in the two samples are almost the same

Extension of our work by other groups (with or without citation) IV

[2] Try to find a subset of samples S such that a subset of genes G is coherent for each pair of samples in S. |S|>mins, |G|>ming

In CARPENTER, we try to find a subset of samples S in which a subset of genes G is similar in expression level for each pair of samples in S. |S|>mins, |G|>0

Gene1

Gene2

Gene3

Gene4

S1

1.23

S2

1.34

.

.

.

SN-1

1.52

SN

[2] Perform sample-wise enumeration and remove genes that are not pairwisecoherent across the samples enumerated

CARPENTER: Perform sample-wise enumeration and remove genes that does not have the same expression level across the samples enumerated

Extension of our work by other groups (with or without citation) VI

Text Box: From [2]: Pruning Rule 3.1 (Pruning small sample sets). At a node v = fsi1 ; : : : ; sikg, the subtree of v can be pr

TT|{3,4}

Extension of our work (Conclusion)

•The sample/enumeration framework had been successfully adopted by other groups in mining microarraydatasets
•We are proud of our contribution as the group the produce the first row/sample enumeration algorithm CARPENTER and is happy that other groups also find the method useful
•However, citations from these groups would have been nice. After all academic integrity is the most important things for a researcher.

Future Work: Generalize Framework for Row Enumeration Algorithms?

Conclusions

•Many datasets in bioinformatics have very different characteristics compared to those that has been previously studied
•These characteristics can either work against you or for you
•In the case of microarraydatasets with large number columns but small number of rows/samples, we turn what is against us to our advantage
–Row/Sample enumeration
–Pruning strategy
•We show how our methods have been modified by other groups to produce useful algorithm for mining microarraydatasets

Thank you!!!

atung@comp.nus.edu.sg

www.comp.nus.edu.sg/~atung/sfu_talk.pdf