�
Feng Pan Gao Cong Xu Xin Anthony K. H. Tung
email: panfeng,conggao,xuxin, atung@comp.nus.edu.sg
�
Contact Author
The problem of mining frequent closed patterns has receive considerable attention recently as it promises to have much less redundancy compared to discovering all frequent patterns. Existing algorithms can presently be separated into two groups, feature (column) enumeration and row enumeration. Feature enumeration algorithms like CHARM and CLOSET+ are efficient for datasets with small number of features and large number of rows since the number of feature combinations to be enumerated will be small. Row enumeration algorithms like CARPENTER on the other hand are more suitable for datasets (eg. bioinformatics data) with large number of features and small number of rows. Both groups of algorithms, however, will encounter problem for datasets that have large number of rows and features.
In this paper, we describe a new algorithm called COBBLER which can efficiently mine such datasets . COBBLER is designed to dynamically switch between feature enumeration and row enumeration depending on the data characteristic in the process of mining. As such, each portion of the dataset can be processed using the most suitable method making the mining more efficient. Several experiments on real-life and synthetic datasets show that COBBLER is order of magnitude better than previous closed pattern mining algorithm like CHARM, CLOSET+ and CARPENTER.
The problem of mining frequent closed patterns has received considerable attention recently as it promises to have much less redundancy compared to discovering all frequent patterns [8]. Existing algorithms can presently be separated into two groups, feature (column) enumeration and row enumeration. In feature enumeration algorithms like CHARM
[9] and CLOSET+ [7], combinations of features are tested systematically to look for frequent closed patterns. Such an approach is suitable for datasets with small number of features and large number of rows since the number of feature combinations to be tested will be small.
However, for bioinformatics data with large number of features and small number of rows, the performance of these algorithms deteoriate due to the large number of feature combinations. To go around this problem, the algorithm
Although column is a more suitable term here, we will use the term feature in this paper to avoid potential confusion during the technical discussion
CARPENTER [3] is developed to perform row enumeration on bioinformatics datasets instead. CARPENTER is a row enumeration algorithm which looks for frequent closed patterns by testing various combinations of rows. Since the bioinformatics datasets have small number of rows and large number of features, the number of row combinations will be much smaller than the number of feature combinations. As such, row enumeration algorithms like CARPENTER will be more efficient than feature enumeration algorithms on these kinds of datasets.
From the above, it is natural to make two observations.
First, we can conclude that different datasets will have different characteristics and thus require a different enumeration method in order to make closed pattern mining efficient. Furthermore, since these algorithms typically focus on processing different subset of the data during the mining, the characteristics of the data subset being handled will change from one subset to another. For example, a dataset that have much more rows than features may be partitioned into sub-datasets with more features than rows. Therefore a single feature enumeration method or a single row enumeration method may become inefficient in some phases of the enumeration even if they are the better choice at the start of the algorithm. As such, it makes sense to try to switch the enumeration method dynamically as different subsets of the data are being processed.
Second, both classes of algorithms will have problem handling datasets with large number of features and large number of rows. This can be seen if we understand the basic philosophy of these algorithms. In both classes of algorithms, the aim is to reduce the amount of data being considered by searching in the smaller enumeration space. For example, when performing feature enumeration, the number of rows being considered will decrease as the number of features in a feature set grow. It is thus possible to partition the large number of rows into smaller subset for efficient mining. However, for datasets with large number of rows and large number of features, adopting only one single enumeration method will make it difficult to reduce the data being considered in another dimension.
Motivated by these observations, we derived a new algorithm called COBBLER in this paper. COBBLER is designed to automatically switch between feature enumeration and row enumeration during the mining process based on the characteristics of the data subset being considered. As experiments will show later, such an approach will produce good results when handling different kinds of datasets. Ex-
COBBLER stands for Combining Row and Column Enumeration. The letter ‘b’ is counted twice here.
periments show that COBBLER outperforms other closed pattern mining algorithms like CHARM [9], CLOSET+[7] and CARPENTER [3].
In the next section, we will introduce some preliminaries and give our problem definition. The COBBLER algorithm will be explained in Section 3. To show the advantage of COBBLER’s dynamic enumeration, experiments will be conducted on both real-life and synthetic datasets in Section
4. Section 5 introduces some of the related work for this paper. We will conclude our discussion in Section 6.
We will give a problem description and define some notations for further discussion.
We denote our dataset as � . Let the set of binary fea-tures/columns be = and let the set of rows
be =
. We abuse our notation slightly by saying that a row contain a feature if have a value of 1 in . Thus we can also say that .
For example, in Figure 1(a), the dataset has 5 features represented by alphabet set and there are 5 rows,
, , , in the dataset, The first row contains feature
set i.e. these binary features have a value of “1” for . To simplify notation, we will use the row number to represent a set of rows hereafter. For example, “23” will be used to denote row set ! . And a feature set like
"
will also be represented as .
Here, we give two concepts called feature support set and row support set.
Definition 2.1 Feature Support Set, # $%'&)(
Given a set of features '& , we use # $%*&)( to denote
the maximal set of rows that contain +& . ,
Definition 2.2 Row Support Set, -$% &(
Given a set of rows .& , we use -$%/&0( to denote
the largest set of features that are common amount the rows
in /& . ,
Example 1 # $%'&1( and -$%/&0(
Let’s use the table in Figure 1(a). Let +& be the feature set
, then # $%*&)(324 since both and contain
*&and no other rows in table contain '& . Also let R’ be the set of rows , then -$%/&0(526 since both feature and feature occur in and and no other features
occur in both and . ,
Definition 2.3 Support, 7# $%'&1(7
Given a set of features F’, the number of rows in the dataset that contain '& is called the support of '& . Using earlier definition, we can denote the support of +& as 7# $%*&)(7 . ,
Definition 2.4 Closed Patterns
A set of features '& is called a closed pattern if there exists no F” such that '&8 *9 and 7# $%*9:(72;7# $%'&1(7 . ,
Definition 2.5 Frequent Closed Patterns
A set of features '& is called a frequent closed pattern if (1) 7# $%&(7, the support of &, is higher than a minimum support threshold. (2) '& is a closed pattern. ,
<
-$ =(
# $DE(
?>
a,c,d
1,2,5 a,b,d,e
2,3,4,5
@
b,e
1,4,5
A
b,c,d,e
1,2,4
B
a,b,c,e
2,3,4,5
(a) Original Example Table,C
(b) Transposed Table, FGF
Figure 1. Running Example
Let us illustrate these notions with another example below.
Example 2 Given that minsup=1, the feature set will
"
be a frequent closed pattern in the table of Figure 1(a) since the feature set occurs four times in the table. on the other hand is not a frequent closed pattern although it occurs two times in the table which is more than the minsup threshold. This is because that it has a superset and7# $= (72;7# $= (7 . H" ,
We will now define our problem as following: Problem Definition: Given a dataset D which contains records that are subset of a feature set F, our problem is to discover all frequent closed patterns with respect to a user support threshold minsup.
To illustrate our algorithm, we will use the tables in Figure 1 as a running example. Table 1(a) is the original table and Table 1(b) is the transposed version of Table 1(a), IJI . In IJI , the row ids are the features in I while the features are the row ids in I . A row number i exists in the row ofIJI if and only if the feature occurs in row i in T.
For example, since feature “c” occurs in K and in
the original table, row ids “1”, “4” and “5” occur in row “c” in the transposed table. To avoid confusion, we will hereafter use tuples to refer to the rows in the transposed table and use rows to refer to the rows in the original table.
Algorithms for discovering closed patterns can be represented as a search in an enumeration tree. An enumeration tree can either be a feature enumeration tree or a row enumeration tree. Figure 2(a) shows a feature enumeration tree in which each possible combination of features is represented as an unique node in the tree. Node “ab” in the tree for example represents the feature combination
?LB
while the bracket below (i.e. ) indicates that row and contain " . Algorithms like CHARM and CLOSET+ find closed pattern by performing depth-first search (DFS) in the feature enumeration tree (starting from the root). By imposing an order M # NPO on the feature, each possible combination of features will be systematically visited following a lexicographical order. In Figure 2(a), the order of enu- (in absence of any
meration will be E :"
optimization and pruning strategies).
The concept of a row enumeration tree is similar to a feature enumeration tree except that in a row enumeration tree,
24a5
{�� 24a bd
bd
b3d
24ac {�� {��3
b3d
{�
b5d
bd
245
245c b25d
{�3
b�d
24
b�d
b2d b�3d
{
b�3d 24c
2
b2a5d bd
b{�3d 2a
2a5 2a5c
{�3
b{3d b{d bd
ba5d bad
2ac
b3d
{3
b2ad
25
25c
b{�d b�d
���3
2c
b4cd b4cd
4a5 4a5cb�3d
��3
b�d b�d
b4cd
b4cd
4a
4ac
b�3d
��3 b245cd b45cd b4cd
b�3d
4 45 45c
b���3d b��d
�3 b24a5cd b24cd
b��d
4c b���3d
b{���3d
�� ��3
b4cd b4cd
a5 a5c
b{�d b�d
a
b4cd
b{�3d
�3
ac b�3d
b4cd
�3 b{��d b��d b4a5cd b4acd
5 5c
3 b���3d b24acd
c
(a) feature enumeration tree (b) row enumeration tree
Figure 2. Traditional row and feature enumer
ation tree.
<
-$(
E
# $D (
?
d,e
5
@
e 4
A
c,d,e
B
c,e
(a) -conditional table, (b) -conditional trans-
F posed table, FGF
Figure 3. Conditional Table
each possible combination of rows(instead of features), &, is represented as a node in the tree. Figure 2(b) shows a row enumeration tree. Node “12” in the figure represents row combination >? while the bracket “ ad ” below denotes the fact the “ ad ” is found in both and (i.e.-$=>? (2 . Again, by imposing a order M#N
on the rows, row enumeration algorithm like CARPENTER will be able to visit each possible combination of rows in a DFS manner on the enumeration tree. The order of node visited in Figure 2(b) will be >>?>?L@ ABB when no
pruning strategies are adopted.
Regardless of row or feature enumeration, searches in the enumeration tree are simulated by successive generation of conditional (original) table and conditional transpose table defined as the followings.
Definition 3.1 Conditional Table, I7Let be a subset of features. Given the original table I , an -conditional original table denoted as I7is a subset of rows from I such that:
,
Example 3 Let the original table in Figure 1(a) be I . When the node “b” in the enumeration tree of Figure 2(a) is visited, an X-conditional table, I7(note: = b ) will be created and is as shown in Figure 3(a). From I7, we can infer that there are 4 rows which contain “b”. ,
Definition 3.2 Conditional Transposed Table, IJI7Let be a subset of rows (in the original table). Given the transposed table IJI , an -conditional transposed table denoted as IJI7is a subset of tuples from IJI such that:
. Row and all that having higher order than according to M# Nare removed from each tuple in IJI7
,
Example 4 Let the transposed table in Figure 1(b) be IJI . When the node “12” in the row enumeration tree of Figure 2(b) is visited, an X-conditional transposed table, IJI
(note: = 1,2 ) will be created and is as shown in Figure 3(b). The inference we make from IJI is slightly different
from that we make from the earlier example. Here we can infer that a,d occurs in two rows of the dataset (i.e. and ). ,
In both Example 3 and 4, it is easy to see that the number of rows (tuples) in the conditional (transposed) table will be reduced as the search move down the enumeration tree. This enhanced the efficiency of mining since the number of rows (tuples) being processed at deeper level of the tree will also be reduced. Furthermore, the conditional (transposed) table of a node can be easily obtained from that of its parent. Searching the enumeration tree is thus a successive generation of conditional tables where the conditional table at each node is obtained by scanning the conditional table of its parent node.
As we can see, the basic characteristic of a row enumeration tree or a feature enumeration tree is that the tree is static. The current solution is to make a selection between these approaches based on the characteristic of I at the start of the algorithm. For datasets with many rows and few features, algorithms like CHARM [9] and CLOSET+ [7] that search in the feature enumeration tree will be more efficient since the number of possible feature combinations will be small. However, when the number of features is much larger than the number of rows, a row enumeration algorithm like CARPENTER [3] was shown to be much more efficient.
There are two motivations for adopting a more dynamic approach.
First, the characteristics of the conditional tables could be different from the orignal table. Since the number of rows (or tuples) can be reduced as we move down the enumeration tree, it is possible that a table I which has more rows than features initially, could have the characteristic reversed for it’s conditional tables I7(i.e. more features than rows). As such, it makes sense to adopt a different enumeration approach as the data characteristic changes.
Second, for datasets with large number of rows and also large number of features, a combination of row and feature enumeration could help to reduce both the number of rows and features being considered in the conditional tables thus enhancing the efficiency of mining.
Next, we will illustrate with a simple example on what we mean by dynamic switching of enumeration method:
Example 5 Consider the table I in Figure 1(a). Let us assume that the order for features, M>#?N@OAisB and the order for rows, M # N
is . Suppose, we first perform a feature enumeration generating the b conditional table (shown earlier in Figure 3(a)) followed by the b,c -conditional table in Figure 4(a). To switch to row enumeration, I7� will first be transposed to create IJI+$0I7�"(! in Figure 4(b). Since only row 4 and 5 are in the tuples of IJI+$0I7�"( , we next perform row enumeration on row 4, which give IJI+$0I7�(7K in Figure 4(c). From IJI+$0I7�"(7K, we see that feature “d” and “e” are both in row 4. Thus, we can conclude that only 1 row (i.e. row 4) contains the feature set b,c + d,e = b,c,d,e ( b,c is obtained from feature enumeration while d,e is obtained from row enumeration). ,
<
-$ =(
44,5 | 5 |
E
# $DE(
A
d,e
B
e
(a) F
(b) TT(F )
(c) TT F
Figure 4. Conditional Table
Figure 5(a) and 5(b) show examples of possible dynamic enumeration tree that could be generated from table I in our running example. In Figure 5(a), we highlight the path linking nodes “b”, “bc” and “4” as they correspond to the nodes we visited in Example 5. Switching from row enumeration to feature enumeration is also possible as shown in Figure 5(b).
Like previous algorithms, COBBLER will also perform a depth first search on the enumeration tree. To ensure a systematic search, enumeration is done based on M#N for row enumeration and on M#NO for feature enumeration.
To formalize the actual enumeration switching procedure, let us first divide all the nodes in our dynamic enumeration tree into two classes, row enumerated node and feature enumerated node. As the name implies, row enumerated node is a node which represents a subset of rows '& being enumerated while a feature enumerated node is a node which represents a subset of features '& being enumerated. For example, in Figure 5(a), the node A "9 is a feature enumerated node while its children node 9 is a row enumerated node.
Definition 3.3 Feature to Row Enumeration Switch Let
be a feature enumerated node representing the feature subset *& and let # $%*&1( be the rows containing +& in I . In additional, let
be the lowest ranking feature in '& based on M # NPO . A switch from feature to row enumeration will follow these steps:
TT stand for transposed
2
25
b{�3d b{�d
�3
+24da+bcd +24da+cd
b25d 5 b{��d
3
+�3d +24da+5cd
24
{ bd
a a5
{ {3
b2a5d {�
b{�3d b{�d
+25da+bd +25da+d
25
2
ba5d
3
5
+{�3d +{3d
+25da+cd
{3 b{��d b2ad 4 4c
2b
{
b���3d b���3d
+{�d +2bda+d
c
b4cd b���3d 45 45c
b��d
b��d
2c
+2bda+cd
4
4c
+�3d
b���3d b���3d
�3
5
b45cd b{��d c b245cd
24c
24
3 b24a5cd
b���3d
+45da+cd
b�3d b�3d
�3
2
2c
3 4b
b24cd
b{�3d
b�3d
+���2d +��d +4bda+cd +4bda+cd
4
4c
b���3d
4c
b���3d
c
+4bda+cd
+���3d
b���3d
+{���3d
4c
4
{
b4cd
b���3d b���3d
5b
+5bda+d �
c b���3d
5
+{�d
b4cd
5c +5bda+cd
+{�3d
4c
4
�3
b���3d b���3d
+�3d
b4cd
c 4ac
4a
b���3d
b�3d b�3d
b bc
4
�3
4c
+{��d +��d b4a5cd b4acd
b���3d
b���3d
a
b{�3d
ac +���3d b24acd b���3d
c
3
c
b�3d
(a) Switching from feature-wise to (b) Switching from row-wise to row-wise enumeration. feature-wise enumeration.
Figure 5. Dynamic enumeration trees.
1. Create transposed table IJI+$0I7%( such that
we have a tuple for each feature
, having lower rank than
given a tuple in IJI+$0I7%( representing a feature , the tuple contains all row such that # $%+&)( and # $= (
2. Perform row enumeration on IJI+$0I7 ( following the order M#N .
,
Example 6 For example, in Figure 5(a), while node 9 enumerates feature set, its descendant will switch to enumerate row set. The sub-tree of node 9 will create a transposed table with one tuple for each feature , and since , ,?are of lower rank than in M#N7O. Since
# $= " ( =B , the tuples in the enumeration table will
only contain some subsets of ?B . We thus have the enu-
merating order ? ?LBB on the transposed table ,
To define the procedure for switching from row to feature enumeration, we first introduce the concept of Direct Feature Enumerated Ancestor
Definition 3.4 Direct Feature Enumerated Ancestor,
�
$(
Given a row enumerated node , its nearest ancestor which enumerates feature subsets is called its direct feature enu-
merated ancestor, �
$(. In addition, we will use
*& to denote the feature set represented by �
$(
The root node of the enumerating tree can be considered to enumerate both row set and feature set.
For example,in Figure 5(b), � " 9:(52 ?A9. ,
$
Definition 3.5 Row to Feature Enumeration Switch
Let be a row enumerated node representing the row subset &and let -$% &( be the maximal set of features that is found in every row of .& in I . In addition, let '& be the
feature set that is represented by � $( and let
be
the lowest ranking feature in +& based on M # NPO . A switch from row to feature enumeration will follow these steps:
1. Create table I&such that for each row in # $%'& (, a correspond row &is in IJ& with
&
*& &
& -$%.&)(
,
In essence, a row to feature enumeration create a conditional table I&such all features combinations that is a super-set of *& but a subset of -$%.&)( can be tested system
atically based on feature enumeration.
Example 7 For example, in Figure 5(b), while node ?A9 enumerates row set, its descendant will switch to enumerate feature set. I&will thus be generated for finding all frequent closed patterns that is a subset of
H" (i.e. -$=??AA9( ).
but a superset of L (since that is the DFA of node Since # $=L( contain rows , , !, K and , we will cre-
ate 5 corresponds rows >&,...,<& such that & , & and & H" for B . Based on M # NPO , the . ,
enumeration order will be ""::
Having specified the operation for switching enumeration method, we will next prove that no closed frequent patterns are missed by our algorithm. Our main argument here is
that switching the enumeration method at a node will not effect the set of closed patterns that are tested at the descen-
dants of . We will first prove that this is true for switching from feature to row enumeration.
Lemma 3.1 Given a feature enumerated node , let I
be the enumeration subtree rooted at after switching from feature to row enumeration. Let I
Obe the imaginary
subtree rooted at node if there is no switch in enumeration
P$0I
method. Let 5( be the set of frequent closed patterns and IO
found in enumeration tree I
E( be the set of frequent closed patterns that are found in enumeration tree IO. We claim that
"(G2$0I
P$0IO5( .
P$0IO5( and then that We first prove that
"( $0I 5( $0IO
P$0I
E( .
Suppose node represents the feature set +& . Assuming that in IO
, a depth first search will produce a frequent closed pattern . In this case 2 *& with being the additional feature set that are added onto & when searching in subtree IO
. It can be deduced that # $% ( # $%*&1( because '& . Since is a frequent closed pattern, being its subset will also be a frequent pattern in # $%*&1( . Let /& # $%*&)( be the unique maximal set of rows that contain . It is easy to see that .& will also be enumerated in I
since all combinations of rows in # $% &( are enumerated in I
. We can now see that both *& (since /& # $%*&1( ) and are in .& which means that
will be enumerated in I
. Since all closed pattern enu-
merated in IOwill be enumerated in I P$0I
P$0I
O"( 5( . . Therefore,
On the other hand, assuming that is a frequent closed pattern that is found under I nation enumerated in subtree I . Let be the row combi--$% (). Since I that give (i.e
essentially enumerate all row combinations from # $%*&)( , we know # $%*&1( and thus *& is in every row of . By definition of -$% (, we know & which means that all rows containing are
in # $%*&)( . Since IOwill enumerate all combination of features which are in # $%'&1( , we know will be enumerated in IO
I
. Since all closed pattern enumerated in will be enumerated in IOP$0I
. Therefore, 5(P$0IO
"( . We can now conclude that $0IO5( since
"(G2$0I P$0IOP$0I P$0I P$0IO
"( 5( and 5(
"( . ,
We next look at the proceduce for switching from row to feature enumeration. Our argument will go along the same line as Lemma 3.1.
Lemma 3.2 Given a row enumerated node , let IO
be the enumeration subtree rooted at after switching from row to feature enumeration. Let I
be the imaginary sub-tree rooted at node if there is no switch in enumeration
P$0I
method. Let 5( be the set of frequent closed patterns and IOfound in enumeration tree I
E( be the set of frequent closed patterns that are found in IO
5( .
. We claim that $0IO
"(G2$0I ,
We omitted the proof for Lemma 3.1 due to lack of space. The gist of the proof is however similar to the proof for Lemma 3.2
With Lemma 3.1 and Lemma 3.2, we are sure that the set of frequent closed patterns found by our dynamic enumeration tree is equal to the set found by a pure row enumeration or feature enumeration tree. Therefore, by a depth first search of the dynamic enumeration tree, we can be sure that all the frequent closed patterns in the database can be found. It is obvious that a complete traversal of the dynamic enumeration tree is not efficient and pruning methods must be introduced to prune off unnecessary searches. Before we explain these methods, we will first introduce the framework of out algorithm in the next section.
Our formal algorithm is shown in Figure 6 and the details about the subroutines are in Figure 7.
We use both the original table I and the transposed table IJI in our algorithm with infrequent features removed. Our algorithm involves recursive computation of conditional tables and conditional transposed tables for performing a depth-first traversal of the dynamic enumeration tree. Each conditional table represents a feature enumerated node while each conditional transposed table represents a row enumerated node. For example, the -conditional table rep-
" ?B
resents the node “a b” in Figure 5(a) while the conditional transposed table represents the node “2 5” in Figure 5(b). After setting , the set of frequent closed
Algorithm Input: Original table F , transposed table FGF , features set � , row set and support level
Output: Complete set of frequent closed patterns, � Method:
Figure 6. The Main Algorithm
patterns, to be empty, our algorithm will check a switching
K
condition to decide whether to perform row enumeration or feature enumeration. Depending on the switch condition, either subroutine <
< or < will be called. The subroutine IJIJ&7, /& and takes in three parameters
. IJIJ&7is an X-conditional transposed table while .& contains the set of rows that will be consid-
. ered for row enumeration according to M#Ncontains the frequent closed patterns which have been found so far. Step 1 to 3 in the subroutine performs the counting and pruning. We will delay all discussion on pruning to Section
3.5. Step 4 in the subroutine will output the frequent closed pattern. The switching condition will be checked in Step 5 to decide whether a row enumeration or a feature enumeration will be executed next. Based on this condition, the subroutine will either continue to Step 6 for row enumeration or
<
to Step 7 for feature enumeration. Note that the
subroutine has essentially no difference from the row enumeration algorithm, CARPENTER in [3] except for Step 7 where we switch to feature enumeration. Since CARPENTER is proven to be correct and Lemma 3.2 has shown that the switch to feature enumeration does not affect our result,
<
we know that the
subroutine will output the correct set of frequent closed patterns. The subroutine < takes in three parameters
IJ&7, *& and
. IJ&7is an X-conditional original table. *& contains the set of features that will be considered for feature enumeration according to M#NO.
contains the frequent closed patters which have been found so far. Step 1 to 3 performs counting and pruning and their explanation will also be done in later section. Step 4 will output the frequent closed pattern while Step 5 will check the switching condition to decide on the enumeration method. Based on the switching condition, the subroutine will either continue to Step 6 for feature enumeration or to Step 7 for row enumeration. We again note that the < subroutine has essentially no difference from other feature enumeration algorithm like CHARM [9] and CLOSET+ [7] except for Step 7 where we switch to row enumeration. Since these algorithms are proven to be correct and Lemma 3.1 has shown that switch to row enumeration does not affect our result. We know that the < subroutine will output the correct set of frequent closed pattern.
We can observe that the recursive computation will stop when in
< , the .& becomes empty or in
We will delay the discussion for this switching condition to the next section.
Subroutine: RowMine(F5F , ,� ). Parameters:
F5F : A ! -conditional transposed table;
: A subset of rows which have not been considered in the enumeration; � : The set of frequent closed patterns that have been found;
+* '
4. If ! 431.
and 5! 76$8� , add 5! into� ;
� /=5 ! 2 <#
FeatureMine(F >;�
� );
Subroutine: FeatureMine(F ,�,� ). Parameters:
F : A ! -conditional original table;
�: A subset of features which have not been considered in the enumeration; � : The set of frequent closed patterns that have been found;
6
<#
� / � 2
FeatureMine(F >E;�
� );
7. If switch to row enumeration, transpose X conditional table
F to a transposed table F5FFDGH , for each "#%$@! * ' ,
9I@ !* ' 2
"#
RowMine(FGF FDGH :; � );
< , the *& becomes empty.
Switching condition are used to decide whether to switch from row enumeration to feature enumeration or vice verse. To determine that, our main idea is to estimate the enumeration cost for the subtree at a node and select the smaller one between a feature enumeration subtree and a row enumeration subtree.
The enumeration cost of a tree can be estimated from two components, the size of the tree and the computation cost at each node of the tree. The size of a tree is judged based on the estimated number of nodes it contains while the computation cost at a node is measured using the estimated number of rows (or features) that will be processed at the node.
For example, if a feature enumeration tree I =
con-
+ and node will process tains m nodes is � . To
/ rows, the enumeration cost of I = simplify explanation, we will focus on estimating the enumeration cost of a feature enumeration tree. The estimation of the enumeration cost of a row enumeration tree will be similar.
Assume that a feature enumeration tree, I =
, rooted
at node
which representing and # $%( and contains
sub-nodes . Let
correspond to
conditional table I7 . We give some definitions below.
26 .
$D I7
G( , the frequency of feature in I7 .
2 7# $%(7 , the number of rows conditional table I7 contains.
$=( , the estimated maximum height of the subtree
rooted at node .
Given one of the node representing feature set '& , we will first use a simple probability deduction to calculate
$=( . Suppose the node on level $=( is represented as
; , we then calculate $ ; (, the estimated num-
ber of relevant rows being processed at the node ; . Assume that the set of features which have not been con-
2
sidered is :7> and are
sorted by descending order of $DI 7 ; (. Let be a value
such that
$DEI7 ; ( < $DEI7 ; (
Then we calculate $( and $ ; ( as
$=(52
$ ; (52 $DEI7 ; (
Intuitively, $=( corresponds to the expected maximum number of levels enumeration will take place before support pruning take place.
Thus the estimated enumeration cost on node ; is
I< $D I7 ; (
E I< is the average processing time of where rows.
; , the
node
On the path from node to node will represent feature set and its estimated
enumeration cost is
I< $D I7 ; (
.$ =( be the estimated enumeration cost of enumerating
Let
through the entire path from node to node ; ,
.$ =(52 ! $ E I< $D I7 ; ((
"
e4
,e}5e�.{{{{{,e}5e).
{{{{{ ,e�5e).
,e})e�. ,e�)e�. ,e�)e3.
,e}5e�5e�.
{ ,e�5e�5e3.
{ ,e�5e35ek.
{
{
{{
{{{
{{{
{
{
{{{ { ,e}5e�5{{{eo. ,e�5{{{ep. ,e�5{{{eu. ,e})e�){{{eo. ,e�){{{ep. ,e�){{{eu.
{ 2nddordfr4) nd4�)ndq4e}( 2nddordfr4) nd4�)ndq4e�( 2nddordfr4) nd4�)ndq4e�( 2nddordfrk4 ndk�4ndqke}( 2nddordfrk4 ndk�4ndqke�( 2nddordfrk4 ndk�4ndqke�(
(a) The entire feature enumer-(b) Simplified feature enumeration tree, F GF$#!% .
$#!% . ation tree, F GF
Figure 8. Entire and simplified enumeration
tree
Figure 8(a) shows the entire representation of feature enumeration tree I D
tion tree IJ&= . Figure 8(b) is a simplified enumera- of I = in which only the longest pathes
in each sub-tree rooted at node O; are retained. The esti-
is � mated enumeration cost of I D& .$ O;(. We use the estimated enumeration cost of I =& the estimated enumeration coat of I = as an criterion for
. Therefore, the estimated enumeration cost of the feature enumeration tree is
.$ O;(
&
The estimated enumeration cost of a row enumeration tree is computed in the similar way. Having computed these two estimated values, we will select the searching method that has a smaller estimated enumeration cost in the next level of enumeration.
3.5. Prune Method
< and applies
Both subroutines < pruning strategies. We will only give a brief discussion here since they are developed in previous work and not the emphasis of our work here.
The correctness of pruning strategy 1 and 2 used in sub-
<
routine
has been proven in [3]. Here we will only prove the correctness of the pruning strategy applied in subroutine < .
In step 3 of subroutine < , all the features which occur in every row of X-conditional original table I&7will be removed from I&7and will be considered to be already enumerated. We will prove its correctness by the following Lemma.
Lemma 3.3 Let IJ&7be an X conditional original table and Y be the set of features which occur in every row of I&7. Given any subset & , we have #$&(J2#$�*&)( . Proof: By definition, #$'&)( contains a set of rows, all of which contain feature set +& . Since the features in �occur in every row of I&7, this means that these features
(Note: IJ&=7 also occur in every row of I&7 8IJ&7). Thus, the set of rows in I&7
is exactly the set of rows in I&7 � . From this, we can conclude that #$*&)(G2#$*&)( . ,
Example 8 As an example to illustrate Lemma 3.3, let us consider the -conditional table in Figure 3(a). Since feature “e” occurs in every row of I7, we can conclude that #$(=#$E( = ?L@AB . Thus, we need not create I7 in our search and feature “e” need not be considered for further enumeration down that branch of the enumeration tree. ,
Lemma 3.3 proves that all the frequent closed patterns found in the X-conditional table I&7will contain feature set �, since for each feature set & found in I&7, we
�
can get its superset '& and #�$�*&)(2#$*&)( . Thus it is correct to remove from all the rows of IJ&7and consider �to be enumerated.
To show the feasibility of implementation, we will show some details about the implementation of COBBLER.
addddddd dddddddd o,bn5c�r�n532 4,bn5c�P�n532
(a) feature enumeration (b) row enumeration Con-Conditional Pointer List at ditional Pointer List at Node “a” Node “1”
Figure 9. Conditional Pointer List
The data structure for enumeration we used in COBBLER is similar to that we used in CARPENTER. Dataset are organized in a table and memory pointers pointing to various positions in the table are organized in a conditional pointer list [3]. Since we enumerate both row and feature in COBBLER, we maintain two sets of conditional pointer list for original table I and transposed table IJI respectively. The conditional pointer list for row enumeration is the same as the conditional pointer list used in CARPENTER while the conditional pointer list for feature enumeration is create simply by replacing the feature ids with row ids and pointing them to the original table I . Figure 9 gives an example for feature enumeration conditional pointer list and row enumeration conditional pointer list. Most of the operations we take to maintain the conditional pointer lists are similar to CARPENTER. Interested readers are referred to [3] for details.
f,bn5c�P�n532
3,bn5c�r�n532
en.
e�
-n.
�,bn5c�P�n532
l,bn5c�r�n532
e�
en.
-n.
�,bn5c�P�n532
b,bn5c�r�n532
3 | f a� a4 | |||
---|---|---|---|---|
3 f | b f | c f | l | � a� ai a4 |
b | f ai a4 | |||
l | o | c | f a� ai | |
f | f |
o � a� ai a4
4. Performance
In this section we will compare the performance of COBBLER against other algorithms. All our experiments were performed on a PC with Pentium IV 2.4Ghz CPU, 1 G RAM and a 30GB hard-disk. Algorithms were coded in Standard
C.
Algorithms: We compare COBBLER against two other closed pattern discovery algorithms, CHARM [9] and CLOSET+ [7]. CHARM and CLOSET+ are both feature enumeration algorithms. We also compared the performance of CARPENTER [3] and COBBLER, but since COB-BLER’s performance is always better than CARPENTER, we do not present the result for CARPENTER here. To make a fair comparison, CHARM and CLOSET+ are also run in the main memory after one disk scan is done to load the datasets.
Datasets: We choose 1 real-life datasets and 1 synthetic dat-set to analyze the performance of COBBLER. The characteristics of the 2 datasets are shown in the table below.
Dataset | # items | # rows | row length |
thrombin | 139351 | 1316 | 29745 |
synthetic data | 100000 | 15000 | 1700 |
As we can see, the 2 datasets we used have different characteristics. The thrombin dataset consists of compounds tested for their ability to bind to a target site on thrombin, a key receptor in blood clotting. Each compound is described by a single feature vector comprised of a class value (A for active, I for inactive) and 139,351 binary features, which describe three-dimensional properties of the molecule. The synthetic dataset is generated by IBM data generator. It is a dense dataset and contains long frequent patterns even with relatively high support value.
Parameters: Three parameters are varied in our experiment, minimum support ( < ), row ratio ( ) and length ratio (). The parameter minimum support, < , is a minimum threshold of support which has been explained earlier. The parameters and are used to varying the size of the synthetic dataset we used for scalability test. The parameter row ratio, , has a value above 0. It is used to generate new datasets with different number of rows using IBM data generator. All dataset with different row ratio of was generated using a same set of parameters except that each time, the number of rows is changed to >B . The parameter length ratio, , has a value between 0 and 1. It is used to generate new datasets with different average row size from the original synthetic dataset listed in the table above. A dataset
>
with a length ratio of retains on average of the columns in the original dataset. Columns to be retained are randomly selected for each row. The default value of is 1
LB
and the default value of is . Because the real-life data is very different from the synthetic dataset, we will only use and for the synthetic dataset.
http://www.biostat.wisc.edu/ page/Thrombin.testset.zip
4.1. Varying Minimum Support
In this set of experiments, we set and to their de-
LB >
fault value, and , and vary the minimum support. Because of the different characteristics of the 2 datasets, we vary the minimum support in different ranges. The thrombin dataset is relatively sparse and its minimum support varies in a range which has low minimum support value. The synthetic dataset is relatively dense and the number of frequent items is quite sensitive to the minimum support, so its minimum support varies in a smaller range which has relatively high minimum support value.
Figure 10 and 11 show how COBBLER compares against
<
CHARM and CLOSET+ as is varied. We can observe that on the real-life dataset, CLOSET+ performs worst for most of the time while CHARM performs best
<
when is relatively high and when the < is decreased to be low, COBBLER performs the best. This is because when the < is high, the structure of the dataset after removing all the infrequent items is relatively simple. Because the characteristic of the data subset seldom changes during the enumeration, COBBLER will only use one of the enumeration method and become either a pure feature enumeration algorithm or a pure row enumeration algorithm. The advantage of COBBLER’s dynamic enumeration cannot been seen and therefore COBBLER is outperformed by CHARM which is a highly optimized feature enumeration algorithm. <
With the decrease of , the structure of the dataset after removing infrequent items will become more complex. COBBLER begins to switch between feature enumeration method and row enumeration method according to the varying characteristic of the data subset. Therefore COB-
BLER outperforms CHARM in low < on the real-life datasets.
On the synthetic dataset, COBBLER performs the best for most of the time since the synthetic dataset is dense and complex enough. CHARM performs worst on this dataset,
even at very high < . This is due to the fact that the synthetic dataset is a very dense one which results in a very large feature enumeration space for CHARM.
4.2. Vary Length Ratio
In this set of experiments, we varying the size of the synthetic dataset by changing the length ratio, �. We set
< >
to >B
, to > and vary from to .If
�
is set to values smaller than , the generated dataset will be too sparse for any interesting result. Figure 12 shows the performance comparison of COBBLER, CHARM and CLOSET+ on the synthetic dataset when we vary . For CHARM and CLOSET+, it takes too much time to run on dataset with
2 >, so the result is not included in Figure
12. As we can see from the graph, COBBLER outperforms CHARM and CLOSET+ in most cases. CHARM is always the worst among these 3 algorithms and both COBBLER and CLOSET+ are order of magnitude better than it. CLOSET+ has a steep increase in run time as length ratio is increased. Its performance is as good as COBBLER when is low but is soon outperformed by COBBLER when is increased to some higher values.
COBBLER performance is not significantlly better than CLOSET+ with low values because a low value of will destroy many of the frequent patterns in the dataset, making the dataset sparse. This will cause COBBLER to perform pure feature enumeration method and lose the advantage of performing dynamic enumeration. With the increase of , the dataset will become more complex and COBBLER will show its advantage over CLOSET+ and also CHARM.
4.3. Varying Row Ratio
In this set of experiments, we varying the size of the synthetic dataset by varying row ratio, . We set < to
>?B
LB
, to its default value of and varying from to . Figure 13 shows the performance comparison of COBBLER, CHARM and CLOSET+ on the synthetic dataset when we vary . As we can see, with the increase of the number of rows, the datasets become more complex and COBBLER ’s dynamic enumeration strategy shows its advantage over the other two algorithms. In all the cases, COBBLER outperforms CHARM and CLOSET+ by an order of magnitude and also has a smoothest increase in run time.
As can be seen, in all the experiments we conducted, COBBLER outperforms CLOSET+ in most cases and outperforms CHARM when the dataset becomes complicated for increased and or decreased < . This result also demonstrates that COBBLER is efficient in datasets with different characteristics as it uses combined row and feature enumeration and can switch between these two enumeration methods according to the characteristics of a dataset while in the searching process.
5. Related Work
Frequent pattern mining [1, 2, 6, 10] as a vital topic has received a significant amount of attention during the past decade. The number of frequent patterns in a large data set can be very large and many of these frequent patterns may be redundant. To reduce the frequent patterns to a compact size, mining frequent closed patterns has been proposed. The followings are some new advances for mining closed frequent patterns.
CLOSET [5] and CLOSET+ [7] are two algorithms which discover closed patterns by depth-first, feature enumeration. CLOSET uses a frequent pattern tree (FPstructure) for a compressed representation of the dataset. CLOSET+ is an updated version of CLOSET. In CLOSET+, a hybrid tree-projection method is implemented and it builds conditional projected table in two different ways according to the density of the dataset. As shown in our experiment, both CLOSET and CLOSET+ are unable to handle long datasets due to their pure feature enumeration strategy.
CHARM [9] is a feature enumeration algorithm for mining frequent closed pattern. Like CLOSET+, CHARM performs depth-first, feature enumeration. But instead of using FP-tree structure, CHARM use a vertical format to store the dataset in which a list of row ids is stored for each feature. These row id lists are then merged during the feature enumeration to generate new row id lists that represent corresponding feature sets in the enumeration tree. In addition, a technique called diffset is used to reduce the size of the row id lists and the computational complexity for merging them.
Another algorithm for mining frequent closed pattern is CARPENTER [3]. CARPENTER is a pure row enumeration algorithm. CARPENTER discovers frequent closed patterns by performing depth-first, row enumeration combined with efficient search pruning techniques. CARPENTER is especially designed to mine frequent closed patterns in datasets containing large number of columns and small number of rows.
6. Conclusion
In this paper, we proposed an algorithm called COBBLER which can dynamically switch between row and feature enumeration for frequent closed pattern discovery. COBBLER can automatically select an enumeration method according to the characteristics of the datasets before and during the enumeration. This dynamic strategy helps COBBLER to deal with different kind of dataset including large, dense datasets that have varying characteristics on different data subsets. Experiments show that our approach yields good payoff as COBBLER outperforms existing frequent closed pattern discovery algorithms like CLOSET+, CHARM and CARPENTER on several kinds of datasets. In the future, we will look at how COBBLER can be extended to handle datasets that can’t be fitted into the main memory.
References
[1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 1994 Int. Conf. Very Large Data Bases (VLDB’94), pages 487–499, Sept. 1994.
[2] H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. In Proc. AAAI’94 Workshop Knowledge Discovery in Databases (KDD’94).
[3] F. Pan, G. Cong, and A. K. H. Tung. Carpenter: Finding closed patterns in long biological datasets. In Proc. Of ACM-SIGKDD Int’l Conference on Knowledge Discovery and Data Mining, 2003.
[4] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang. H-mine: Hyper-structure mining of frequent patterns in large databases. In Proc. IEEE 2001 Int. Conf. Data Mining (ICDM’01), Novermber.
[5] J. Pei, J. Han, and R. Mao. CLOSET: An efficient algorithm for mining frequent closed itemsets. In Proc. 2000 ACMSIGMOD Int. Workshop Data Mining and Knowledge Discovery (DMKD’00).
Figure 12. Varying �Figure 13. Varying " (synthetic data)(synthetic data)
[6] P. Shenoy, J. Haritsa, S. Sudarshan, G. Bhalotia, M. Bawa, and D. Shah. Turbo-charging vertical mining of large databases. In Proc. 2000 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’00), pages 22–23, Dallas, TX, May 2000.
[7] J. Wang, J. Han, and J. Pei. Closet+: Searching for the best strategies for mining frequent closed itemsets. In Proc. 2003 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD’03), Washington, D.C., Aug 2003.
[8] M. Zaki. Generating non-redundant association rules. In
Proc. 2000 Int. Conf. Knowledge Discovery and Data Mining (KDD’00), 2000.
[9] M. Zaki and C. Hsiao. Charm: An efficient algorithm for closed association rule mining. In Proc. of SDM 2002, 2002.
[10] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules. In Proc. 1997 Int. Conf. Knowledge Discovery and Data Mining (KDD’97), pages 283–286, Newport Beach, CA, Aug. 1997.