Sample-Wise Enumeration
Methods for Mining Microarray Datasets
|
|
|
Anthony K. H. Tung |
|
Department of Computer Science |
|
National University of
Singapore |
A Microarray Dataset
|
|
|
|
Find closed patterns which
occur frequently among genes. |
|
Find rules which associate
certain combination of the columns that affect the class of the rows |
|
Gene1,Gene10,Gene1001 ->
Cancer |
Challenge I
|
|
|
|
Large number of patterns/rules |
|
number of possible column
combinations is extremely high |
|
Solution: Concept of a closed
pattern |
|
Patterns are found in exactly
the same set of rows are grouped together and represented by their upper
bound |
|
Example: the following patterns
are found in row 2,3 and 4 |
|
|
Challenge II
|
|
|
Most existing frequent pattern
discovery algorithms perform searches in the column/item enumeration space
i.e. systematically testing various combination of columns/items |
|
For datasets with 1000-100,000
columns, this search space is enormous |
|
Instead we adopt a novel
row/sample enumeration algorithm for this purpose. CARPENTER (SIGKDD’03) is
the FIRST algorithm which adopt this approach |
Column/Item Enumeration
Lattice
|
|
|
Each nodes in the lattice
represent a combination of columns/items |
|
An edge exists from node A to B
if A is subset of B and A differ from B by only 1 column/item |
|
Search can be done |
Column/Item Enumeration
Lattice
|
|
|
Each nodes in the lattice
represent a combination of columns/items |
|
An edge exists from node A to B
if A is subset of B and A differ from B by only 1 column/item |
|
Search can be done depth first |
|
Keep edges from parent to child
only if child is the prefix of parent |
|
|
General Framework for
Column/Item Enumeration
A Multidimensional View
Sample/Row Enumeration
Algorihtms
|
|
|
To avoid searching the large
column/item enumeration space, our mining algorithm search for patterms/rules
in the sample/row enumeration space |
|
Our algorithms does not fitted
into the column/item enumeration algorithms |
|
They are not YAARMA (Yet
Another Association Rules Mining Algorithm) |
|
Column/item enumeration
algorithms simply does not scale for microarray datasets |
|
|
Existing Row/Sample
Enumeration Algorithms
|
|
|
|
CARPENTER(SIGKDD'03) |
|
Find closed patterns using row
enumeration |
|
FARMER(SIGMOD’04) |
|
Find interesting rule groups
and building classifiers based on them |
|
COBBLER(SSDBM'04) |
|
Combined row and column
enumeration for tables with large number of rows and columns |
|
FARMER's demo (VLDB'04) |
|
Balance the scale: 3 row
enumeration algorithms vs >50 column enumeration algorithms |
Concepts of CARPENTER
Row Enumeration
Pruning Method 1
|
|
|
Removing rows that appear in
all tuples of transposed table will not affect results |
Pruning method 2
|
|
|
|
if a rule is discovered before,
we can prune enumeration below this node |
|
Because all rules below this
node has been discovered before |
|
For example, at node 34, if we
found that {aeh} has been found, we can prune off all branches below it |
Pruning Method 3: Minimum
Support
|
|
|
Example: From TT|{1},
we can see that the support of all possible pattern below node {1} will be at
most 5 rows. |
|
|
|
|
From CARPENTER to FARMER
|
|
|
|
What if classes exists ? What
more can we do ? |
|
Pruning with Interestingness
Measure |
|
Minimum confidence |
|
Minimum chi-square |
|
Generate lower bounds for
classification/ prediction |
|
|
Interesting Rule Groups
|
|
|
|
Concept of a rule
group/equivalent class |
|
rules supported by exactly the
same set of rows are grouped together |
|
Example: the following rules
are derived from row 2,3 and 4 with 66% confidence |
|
|
|
|
Pruning by
Interestingness Measure
|
|
|
|
In addition, find only
interesting rule groups (IRGs) based on some measures: |
|
minconf: the rules in the rule
group can predict the class on the RHS with high confidence |
|
minchi: there is high
correlation between LHS and RHS of the rules based on chi-square test |
|
Other measures like lift,
entropy gain, conviction etc. can be handle similarly |
|
|
|
|
Ordering of Rows: All
Class C before ~C
Pruning Method: Minimum
Confidence
|
|
|
Example: In TT|{2,3} on the
right, the maximum confidence of all rules below node {2,3} is at most 4/5 |
Pruning method: Minimum
chi-square
Finding Lower Bound,
MineLB
|
|
|
|
Example: An upper bound rule
with antecedent A=abcde and two rows (r1 : abcf ) and (r2 : cdeg) |
|
Initialize lower bounds {a, b,
c, d, e} |
|
add “abcf”--- new lower {d ,e} |
|
Add “cdeg”--- new lower bound{ad, bd, ae, be} |
|
|
Implementation
|
|
|
|
In general, CARPENTER FARMER
can be implemented in many ways: |
|
FP-tree |
|
Vertical format |
|
For our case, we assume the
dataset can be fitted into the main memory and used pointer-based algorithm
similar to BUC |
Experimental studies
|
|
|
|
|
Efficiency of FARMER |
|
On five real-life dataset |
|
lung cancer (LC), breast cancer
(BC) , prostate cancer (PC), ALL-AML leukemia (ALL), Colon Tumor(CT) |
|
Varying minsup, minconf, minchi |
|
Benchmark against |
|
CHARM [ZaHs02] ICDM'02 |
|
Bayardo’s algorithm (ColumE)
[BaAg99] SIGKDD'99 |
|
Usefulness of IRGs |
|
Classification |
|
|
Example results--Prostate
Example results--Prostate
Naive Classification
Approach
|
|
|
Generate the upper bounds of
IRGs |
|
Rank the upper bounds, thus
ranking the IRGs; |
|
Apply coverage pruning on the
IRGs; |
|
Predict the test data based on
the IRGs that it covers. |
|
|
Classification results
Summary of Experiments
|
|
|
FARMER is much more efficient
than existing algorithms |
|
There are evidences to show
that IRGs is useful for classification of microarray datasets |
COBBLER: Combining Column
and Row Enumeration
|
|
|
Extend CARPENTER to handle
datasets with both large number of columns and rows |
|
Switch dynamically between
column and row enumeration based on estimated cost of processing |
Single Enumeration Tree
Dynamic Enumeration Tree
Dynamic Enumeration Tree
Switching Condition
|
|
|
|
Naïve idea of switching based
on row number and feature number does not work well |
|
to estimate the required
computation of an enumeration sub-tree, i.e.,
row enumeration sub-tree or feature enumeration sub-tree. |
|
Estimate the maximal level of
enumeration for each children sub-tree |
|
Example of estimating the
maximal level of enumeration: |
|
Suppose r=10, S(f1)=0.8,
S(f2)=0.5, S(f3)=0.5, S(f4)=0.3 and minsup=2 |
|
S(f1)*S(f2)*S(f3)*r =2 ≥ minsup |
|
S(f1)*S(f2)*S(f3)*S(f4)*r =0.6
< minsup |
|
Then the estimated deepest node
under f1 is f1f2f3 |
|
|
|
|
Switching Condition
Switching Condition
Length and Row ratio
Extension of our work by
other groups (with or without citation)
|
|
|
[1] Using transposition for pattern discovery
from microarray data, Francois Rioult (GREYC CNRS), Jean-Francois Boulicaut
(INSA Lyon), Bruno Cremileux (GREYC CNRS), Jeremy Besson (INSA Lyon) |
|
See the presence and absence of
genes in the sample as a binary matrix. Perform a transposition of the matrix
which is essentially our transposed table. Enumeration methods are the same
otherwise. |
|
|
|
|
|
|
Extension of our work by
other groups (with or without citation) II
|
|
|
[2] Mining Coherent Gene
Clusters from Gene-Sample-Time Microarray Data. D. Jiang, Jian Pei, M.
Ramanathan, C. Tang and A. Zhang. (Industrial full paper, Runner-up for the
best application paper award). SIGKDD’2004 |
Extension of our work by
other groups (with or without citation) III
Extension of our work by
other groups (with or without citation) IV
Extension of our work by
other groups (with or without citation) V
Extension of our work by
other groups (with or without citation) VI
Extension of our work by
other groups (with or without citation) VII
|
|
|
[2] Pruning Rule 3.2 (Pruning
subsumed sets). At a node v = {si… sik} if {si1,…sik}
U Tail is a subset of some maximal coherent sample set, then the subtree of
the node can be pruned. |
|
|
Extension of our work
(Conclusion)
|
|
|
The sample/enumeration
framework had been successfully adopted by other groups in mining microarray
datasets |
|
We are proud of our
contribution as the group the produce the first row/sample enumeration
algorithm CARPENTER and is happy that other groups also find the method
useful |
|
However, citations from these
groups would have been nice. After all academic integrity is the most
important things for a researcher. |
|
|
Future Work: Generalize
Framework for Row Enumeration Algorithms?
Conclusions
|
|
|
|
Many datasets in bioinformatics
have very different characteristics compared to those that has been
previously studied |
|
These characteristics can
either work against you or for you |
|
In the case of microarray
datasets with large number columns but small number of rows/samples, we turn
what is against us to our advantage |
|
Row/Sample enumeration |
|
Pruning strategy |
|
We show how our methods have
been modified by other groups to produce useful algorithm for mining
microarray datasets |
"Thank you!!!"
|
|
|
Thank you!!! |
|
atung@comp.nus.edu.sg |
|
www.comp.nus.edu.sg/~atung/sfu_talk.pdf |