Sample-Wise Enumeration Methods for Mining Microarray Datasets

Anthony K. H. Tung

Department of Computer Science

National University of Singapore

A Microarray Dataset

Find closed patterns which occur frequently among genes.

Find rules which associate certain combination of the columns that affect the class of the rows

Gene1,Gene10,Gene1001 -> Cancer

Challenge I

Large number of patterns/rules

number of possible column combinations is extremely high

Solution: Concept of a closed pattern

Patterns are found in exactly the same set of rows are grouped together and represented by their upper bound

Example: the following patterns are found in row 2,3 and 4

Challenge II

Most existing frequent pattern discovery algorithms perform searches in the column/item enumeration space i.e. systematically testing various combination of columns/items

For datasets with 1000-100,000 columns, this search space is enormous

Instead we adopt a novel row/sample enumeration algorithm for this purpose. CARPENTER (SIGKDD’03) is the FIRST algorithm which adopt this approach

Column/Item Enumeration Lattice

Each nodes in the lattice represent a combination of columns/items

An edge exists from node A to B if A is subset of B and A differ from B by only 1 column/item

Search can be done

General Framework for Column/Item Enumeration

A Multidimensional View

Sample/Row Enumeration Algorihtms

To avoid searching the large column/item enumeration space, our mining algorithm search for patterms/rules in the sample/row enumeration space

Our algorithms does not fitted into the column/item enumeration algorithms

They are not YAARMA (Yet Another Association Rules Mining Algorithm)

Column/item enumeration algorithms simply does not scale for microarray datasets

Existing Row/Sample Enumeration Algorithms

CARPENTER(SIGKDD'03)

Find closed patterns using row enumeration

FARMER(SIGMOD’04)

Find interesting rule groups and building classifiers based on them

COBBLER(SSDBM'04)

Combined row and column enumeration for tables with large number of rows and columns

FARMER's demo (VLDB'04)

Balance the scale: 3 row enumeration algorithms vs >50 column enumeration algorithms

Concepts of CARPENTER

Row Enumeration

Pruning Method 1

Removing rows that appear in all tuples of transposed table will not affect results

Pruning method 2

if a rule is discovered before, we can prune enumeration below this node

Because all rules below this node has been discovered before

For example, at node 34, if we found that {aeh} has been found, we can prune off all branches below it

Pruning Method 3: Minimum Support

Example: From TT|_{1}, we can see that the support of all possible pattern below node {1} will be at most 5 rows.

From CARPENTER to FARMER

What if classes exists ? What more can we do ?

Pruning with Interestingness Measure

Minimum confidence

Minimum chi-square

Generate lower bounds for classification/ prediction

Interesting Rule Groups

Concept of a rule group/equivalent class

rules supported by exactly the same set of rows are grouped together

Example: the following rules are derived from row 2,3 and 4 with 66% confidence

Pruning by Interestingness Measure

In addition, find only interesting rule groups (IRGs) based on some measures:

minconf: the rules in the rule group can predict the class on the RHS with high confidence

minchi: there is high correlation between LHS and RHS of the rules based on chi-square test

Other measures like lift, entropy gain, conviction etc. can be handle similarly

Ordering of Rows: All Class C before ~C

Pruning Method: Minimum Confidence

Example: In TT|{2,3} on the right, the maximum confidence of all rules below node {2,3} is at most 4/5

Pruning method: Minimum chi-square

Finding Lower Bound, MineLB

Example: An upper bound rule with antecedent A=abcde and two rows (r1 : abcf ) and (r2 : cdeg)

Initialize lower bounds {a, b, c, d, e}

add “abcf”--- new lower {d ,e}

Add “cdeg”--- new lower bound{ad, bd, ae, be}

Implementation

In general, CARPENTER FARMER can be implemented in many ways:

FP-tree

Vertical format

For our case, we assume the dataset can be fitted into the main memory and used pointer-based algorithm similar to BUC

Experimental studies

Efficiency of FARMER

On five real-life dataset

lung cancer (LC), breast cancer (BC) , prostate cancer (PC), ALL-AML leukemia (ALL), Colon Tumor(CT)

Varying minsup, minconf, minchi

Benchmark against

CHARM [ZaHs02] ICDM'02

Bayardo’s algorithm (ColumE) [BaAg99] SIGKDD'99

Usefulness of IRGs

Classification

Example results--Prostate

Naive Classification Approach

Generate the upper bounds of IRGs

Rank the upper bounds, thus ranking the IRGs;

Apply coverage pruning on the IRGs;

Predict the test data based on the IRGs that it covers.

Classification results

Summary of Experiments

FARMER is much more efficient than existing algorithms

There are evidences to show that IRGs is useful for classification of microarray datasets

COBBLER: Combining Column and Row Enumeration

Extend CARPENTER to handle datasets with both large number of columns and rows

Switch dynamically between column and row enumeration based on estimated cost of processing

Single Enumeration Tree

Dynamic Enumeration Tree

Switching Condition

Naïve idea of switching based on row number and feature number does not work well

to estimate the required computation of an enumeration sub-tree, i.e., row enumeration sub-tree or feature enumeration sub-tree.

Estimate the maximal level of enumeration for each children sub-tree

Example of estimating the maximal level of enumeration:

Suppose r=10, S(f1)=0.8, S(f2)=0.5, S(f3)=0.5, S(f4)=0.3 and minsup=2

S(f1)*S(f2)*S(f3)*r =2 ≥ minsup

S(f1)*S(f2)*S(f3)*S(f4)*r =0.6 < minsup

Then the estimated deepest node under f1 is f1f2f3

Switching Condition

Length and Row ratio

Extension of our work by other groups (with or without citation)

[1] Using transposition for pattern discovery from microarray data, Francois Rioult (GREYC CNRS), Jean-Francois Boulicaut (INSA Lyon), Bruno Cremileux (GREYC CNRS), Jeremy Besson (INSA Lyon)

See the presence and absence of genes in the sample as a binary matrix. Perform a transposition of the matrix which is essentially our transposed table. Enumeration methods are the same otherwise.

Extension of our work by other groups (with or without citation) II

[2] Mining Coherent Gene Clusters from Gene-Sample-Time Microarray Data. D. Jiang, Jian Pei, M. Ramanathan, C. Tang and A. Zhang. (Industrial full paper, Runner-up for the best application paper award). SIGKDD’2004

Extension of our work by other groups (with or without citation) III

Extension of our work by other groups (with or without citation) IV

Extension of our work by other groups (with or without citation) V

Extension of our work by other groups (with or without citation) VI

Extension of our work by other groups (with or without citation) VII

[2] Pruning Rule 3.2 (Pruning subsumed sets). At a node v = {s_i… s_ik} if {si1,…s_ik} U Tail is a subset of some maximal coherent sample set, then the subtree of the node can be pruned.

Extension of our work (Conclusion)

The sample/enumeration framework had been successfully adopted by other groups in mining microarray datasets

We are proud of our contribution as the group the produce the first row/sample enumeration algorithm CARPENTER and is happy that other groups also find the method useful

However, citations from these groups would have been nice. After all academic integrity is the most important things for a researcher.

Future Work: Generalize Framework for Row Enumeration Algorithms?

Conclusions

Many datasets in bioinformatics have very different characteristics compared to those that has been previously studied

These characteristics can either work against you or for you

In the case of microarray datasets with large number columns but small number of rows/samples, we turn what is against us to our advantage

Row/Sample enumeration

Pruning strategy

We show how our methods have been modified by other groups to produce useful algorithm for mining microarray datasets

"Thank you!!!"

Thank you!!!

atung@comp.nus.edu.sg

www.comp.nus.edu.sg/~atung/sfu_talk.pdf


	Anthony K. H. Tung
	Department of Computer Science
	National University of Singapore


	Find closed patterns which occur frequently among genes.
	Find rules which associate certain combination of the columns that affect the class of the rows
		Gene1,Gene10,Gene1001 -> Cancer


	Large number of patterns/rules
		number of possible column combinations is extremely high
	Solution: Concept of a closed pattern
		Patterns are found in exactly the same set of rows are grouped together and represented by their upper bound
	Example: the following patterns are found in row 2,3 and 4


	Most existing frequent pattern discovery algorithms perform searches in the column/item enumeration space i.e. systematically testing various combination of columns/items
	For datasets with 1000-100,000 columns, this search space is enormous
	Instead we adopt a novel row/sample enumeration algorithm for this purpose. CARPENTER (SIGKDD’03) is the FIRST algorithm which adopt this approach


	Each nodes in the lattice represent a combination of columns/items
	An edge exists from node A to B if A is subset of B and A differ from B by only 1 column/item
	Search can be done


	Each nodes in the lattice represent a combination of columns/items
	An edge exists from node A to B if A is subset of B and A differ from B by only 1 column/item
	Search can be done depth first
	Keep edges from parent to child only if child is the prefix of parent


	To avoid searching the large column/item enumeration space, our mining algorithm search for patterms/rules in the sample/row enumeration space
	Our algorithms does not fitted into the column/item enumeration algorithms
	They are not YAARMA (Yet Another Association Rules Mining Algorithm)
	Column/item enumeration algorithms simply does not scale for microarray datasets


	CARPENTER(SIGKDD'03)
		Find closed patterns using row enumeration
	FARMER(SIGMOD’04)
		Find interesting rule groups and building classifiers based on them
	COBBLER(SSDBM'04)
		Combined row and column enumeration for tables with large number of rows and columns
	FARMER's demo (VLDB'04)
	Balance the scale: 3 row enumeration algorithms vs >50 column enumeration algorithms


	Removing rows that appear in all tuples of transposed table will not affect results


	if a rule is discovered before, we can prune enumeration below this node
		Because all rules below this node has been discovered before
		For example, at node 34, if we found that {aeh} has been found, we can prune off all branches below it


	Example: From TT\|_{1}, we can see that the support of all possible pattern below node {1} will be at most 5 rows.


	What if classes exists ? What more can we do ?
	Pruning with Interestingness Measure
		Minimum confidence
		Minimum chi-square
	Generate lower bounds for classification/ prediction


	Concept of a rule group/equivalent class
		rules supported by exactly the same set of rows are grouped together
	Example: the following rules are derived from row 2,3 and 4 with 66% confidence


	In addition, find only interesting rule groups (IRGs) based on some measures:
		minconf: the rules in the rule group can predict the class on the RHS with high confidence
		minchi: there is high correlation between LHS and RHS of the rules based on chi-square test
	Other measures like lift, entropy gain, conviction etc. can be handle similarly


	Example: In TT\|{2,3} on the right, the maximum confidence of all rules below node {2,3} is at most 4/5


		Example: An upper bound rule with antecedent A=abcde and two rows (r1 : abcf ) and (r2 : cdeg)
		Initialize lower bounds {a, b, c, d, e}
		add “abcf”--- new lower {d ,e}
		Add “cdeg”--- new lower bound{ad, bd, ae, be}


	In general, CARPENTER FARMER can be implemented in many ways:
		FP-tree
		Vertical format
	For our case, we assume the dataset can be fitted into the main memory and used pointer-based algorithm similar to BUC


Efficiency of FARMER
	On five real-life dataset
		lung cancer (LC), breast cancer (BC) , prostate cancer (PC), ALL-AML leukemia (ALL), Colon Tumor(CT)
	Varying minsup, minconf, minchi
	Benchmark against
		CHARM [ZaHs02] ICDM'02
		Bayardo’s algorithm (ColumE) [BaAg99] SIGKDD'99
Usefulness of IRGs
	Classification


	Generate the upper bounds of IRGs
	Rank the upper bounds, thus ranking the IRGs;
	Apply coverage pruning on the IRGs;
	Predict the test data based on the IRGs that it covers.


	FARMER is much more efficient than existing algorithms
	There are evidences to show that IRGs is useful for classification of microarray datasets


	Extend CARPENTER to handle datasets with both large number of columns and rows
	Switch dynamically between column and row enumeration based on estimated cost of processing


	Naïve idea of switching based on row number and feature number does not work well
	to estimate the required computation of an enumeration sub-tree, i.e., row enumeration sub-tree or feature enumeration sub-tree.
		Estimate the maximal level of enumeration for each children sub-tree
	Example of estimating the maximal level of enumeration:
		Suppose r=10, S(f1)=0.8, S(f2)=0.5, S(f3)=0.5, S(f4)=0.3 and minsup=2
		S(f1)S(f2)S(f3)*r =2 ≥ minsup
		S(f1)S(f2)S(f3)S(f4)r =0.6 < minsup
		Then the estimated deepest node under f1 is f1f2f3


	[1] Using transposition for pattern discovery from microarray data, Francois Rioult (GREYC CNRS), Jean-Francois Boulicaut (INSA Lyon), Bruno Cremileux (GREYC CNRS), Jeremy Besson (INSA Lyon)
	See the presence and absence of genes in the sample as a binary matrix. Perform a transposition of the matrix which is essentially our transposed table. Enumeration methods are the same otherwise.


	[2] Mining Coherent Gene Clusters from Gene-Sample-Time Microarray Data. D. Jiang, Jian Pei, M. Ramanathan, C. Tang and A. Zhang. (Industrial full paper, Runner-up for the best application paper award). SIGKDD’2004


	[2] Pruning Rule 3.2 (Pruning subsumed sets). At a node v = {s_i… s_ik} if {si1,…s_ik} U Tail is a subset of some maximal coherent sample set, then the subtree of the node can be pruned.


	The sample/enumeration framework had been successfully adopted by other groups in mining microarray datasets
	We are proud of our contribution as the group the produce the first row/sample enumeration algorithm CARPENTER and is happy that other groups also find the method useful
	However, citations from these groups would have been nice. After all academic integrity is the most important things for a researcher.


	Many datasets in bioinformatics have very different characteristics compared to those that has been previously studied
	These characteristics can either work against you or for you
	In the case of microarray datasets with large number columns but small number of rows/samples, we turn what is against us to our advantage
		Row/Sample enumeration
		Pruning strategy
	We show how our methods have been modified by other groups to produce useful algorithm for mining microarray datasets


	Thank you!!!
	atung@comp.nus.edu.sg
	www.comp.nus.edu.sg/~atung/sfu_talk.pdf