1
|
- Anthony K. H. Tung
- Department of Computer Science
- National University of Singapore
|
2
|
- Find closed patterns which occur frequently among genes.
- Find rules which associate certain combination of the columns that
affect the class of the rows
- Gene1,Gene10,Gene1001 -> Cancer
|
3
|
- Large number of patterns/rules
- number of possible column combinations is extremely high
- Solution: Concept of a closed pattern
- Patterns are found in exactly the same set of rows are grouped together
and represented by their upper bound
- Example: the following patterns are found in row 2,3 and 4
|
4
|
- Most existing frequent pattern discovery algorithms perform searches in
the column/item enumeration space i.e. systematically testing various
combination of columns/items
- For datasets with 1000-100,000 columns, this search space is enormous
- Instead we adopt a novel row/sample enumeration algorithm for this
purpose. CARPENTER (SIGKDD’03) is the FIRST algorithm which adopt this
approach
|
5
|
- Each nodes in the lattice represent a combination of columns/items
- An edge exists from node A to B if A is subset of B and A differ from B
by only 1 column/item
- Search can be done
|
6
|
- Each nodes in the lattice represent a combination of columns/items
- An edge exists from node A to B if A is subset of B and A differ from B
by only 1 column/item
- Search can be done depth first
- Keep edges from parent to child only if child is the prefix of parent
|
7
|
|
8
|
|
9
|
- To avoid searching the large column/item enumeration space, our mining
algorithm search for patterms/rules in the sample/row enumeration space
- Our algorithms does not fitted into the column/item enumeration
algorithms
- They are not YAARMA (Yet Another Association Rules Mining Algorithm)
- Column/item enumeration algorithms simply does not scale for microarray
datasets
|
10
|
- CARPENTER(SIGKDD'03)
- Find closed patterns using row enumeration
- FARMER(SIGMOD’04)
- Find interesting rule groups and building classifiers based on them
- COBBLER(SSDBM'04)
- Combined row and column enumeration for tables with large number of
rows and columns
- FARMER's demo (VLDB'04)
- Balance the scale: 3 row enumeration algorithms vs >50 column
enumeration algorithms
|
11
|
|
12
|
|
13
|
- Removing rows that appear in all tuples of transposed table will not
affect results
|
14
|
- if a rule is discovered before, we can prune enumeration below this node
- Because all rules below this node has been discovered before
- For example, at node 34, if we found that {aeh} has been found, we can
prune off all branches below it
|
15
|
- Example: From TT|{1}, we can see that the support of all
possible pattern below node {1} will be at most 5 rows.
|
16
|
- What if classes exists ? What more can we do ?
- Pruning with Interestingness Measure
- Minimum confidence
- Minimum chi-square
- Generate lower bounds for classification/ prediction
|
17
|
- Concept of a rule group/equivalent class
- rules supported by exactly the same set of rows are grouped together
- Example: the following rules are derived from row 2,3 and 4 with 66%
confidence
|
18
|
- In addition, find only interesting rule groups (IRGs) based on some
measures:
- minconf: the rules in the rule group can predict the class on the RHS
with high confidence
- minchi: there is high correlation between LHS and RHS of the rules
based on chi-square test
- Other measures like lift, entropy gain, conviction etc. can be handle
similarly
|
19
|
|
20
|
- Example: In TT|{2,3} on the right, the maximum confidence of all rules
below node {2,3} is at most 4/5
|
21
|
|
22
|
- Example: An upper bound rule with antecedent A=abcde and two rows (r1 :
abcf ) and (r2 : cdeg)
- Initialize lower bounds {a, b, c, d, e}
- add “abcf”--- new lower {d ,e}
- Add “cdeg”--- new lower
bound{ad, bd, ae, be}
|
23
|
- In general, CARPENTER FARMER can be implemented in many ways:
- For our case, we assume the dataset can be fitted into the main memory
and used pointer-based algorithm similar to BUC
|
24
|
- Efficiency of FARMER
- On five real-life dataset
- lung cancer (LC), breast cancer (BC) , prostate cancer (PC), ALL-AML
leukemia (ALL), Colon Tumor(CT)
- Varying minsup, minconf, minchi
- Benchmark against
- CHARM [ZaHs02] ICDM'02
- Bayardo’s algorithm (ColumE) [BaAg99] SIGKDD'99
- Usefulness of IRGs
|
25
|
|
26
|
|
27
|
- Generate the upper bounds of IRGs
- Rank the upper bounds, thus ranking the IRGs;
- Apply coverage pruning on the IRGs;
- Predict the test data based on the IRGs that it covers.
|
28
|
|
29
|
- FARMER is much more efficient than existing algorithms
- There are evidences to show that IRGs is useful for classification of
microarray datasets
|
30
|
- Extend CARPENTER to handle datasets with both large number of columns
and rows
- Switch dynamically between column and row enumeration based on estimated
cost of processing
|
31
|
|
32
|
|
33
|
|
34
|
- Naïve idea of switching based on row number and feature number does not
work well
- to estimate the required computation of an enumeration sub-tree,
i.e., row enumeration sub-tree or
feature enumeration sub-tree.
- Estimate the maximal level of enumeration for each children sub-tree
- Example of estimating the maximal level of enumeration:
- Suppose r=10, S(f1)=0.8, S(f2)=0.5, S(f3)=0.5, S(f4)=0.3 and minsup=2
- S(f1)*S(f2)*S(f3)*r =2 ≥ minsup
- S(f1)*S(f2)*S(f3)*S(f4)*r =0.6 < minsup
- Then the estimated deepest node under f1 is f1f2f3
|
35
|
|
36
|
|
37
|
|
38
|
- [1] Using transposition for
pattern discovery from microarray data, Francois Rioult (GREYC CNRS),
Jean-Francois Boulicaut (INSA Lyon), Bruno Cremileux (GREYC CNRS),
Jeremy Besson (INSA Lyon)
- See the presence and absence of genes in the sample as a binary matrix.
Perform a transposition of the matrix which is essentially our
transposed table. Enumeration methods are the same otherwise.
|
39
|
- [2] Mining Coherent Gene Clusters from Gene-Sample-Time Microarray Data.
D. Jiang, Jian Pei, M. Ramanathan, C. Tang and A. Zhang. (Industrial
full paper, Runner-up for the best application paper award). SIGKDD’2004
|
40
|
|
41
|
|
42
|
|
43
|
|
44
|
- [2] Pruning Rule 3.2 (Pruning subsumed sets). At a node v = {si…
sik} if {si1,…sik} U Tail is a subset of some
maximal coherent sample set, then the subtree of the node can be pruned.
|
45
|
- The sample/enumeration framework had been successfully adopted by other
groups in mining microarray datasets
- We are proud of our contribution as the group the produce the first
row/sample enumeration algorithm CARPENTER and is happy that other
groups also find the method useful
- However, citations from these groups would have been nice. After all
academic integrity is the most important things for a researcher.
|
46
|
|
47
|
- Many datasets in bioinformatics have very different characteristics
compared to those that has been previously studied
- These characteristics can either work against you or for you
- In the case of microarray datasets with large number columns but small
number of rows/samples, we turn what is against us to our advantage
- Row/Sample enumeration
- Pruning strategy
- We show how our methods have been modified by other groups to produce
useful algorithm for mining microarray datasets
|
48
|
- Thank you!!!
- atung@comp.nus.edu.sg
- www.comp.nus.edu.sg/~atung/sfu_talk.pdf
|