Frequent Pattern Mining

Learning Objectives

After this unit, students should be able to

explain frequent pattern mining.
describe the applications of frequent pattern mining.
define association rule along with their support and confidence.
describe apriori algorithm.

itemset

Frequent pattern mining is a fundamental task in data mining that involves discovering recurring patterns, relationships, or associations in large datasets. One of the most common applications of frequent pattern mining is in market basket analysis, where it helps to discover associations between items bought together frequently by customers. For example, it might reveal that customers who buy bread often also buy butter, helping retailers to optimize product placements or promotions. It is also used in other fields such as web usage mining and bioinformatics (gene sequence patterns).

Before we move on to the algorithms for identifying frequent patterns in the data, let's first define the necessary terminology.

Terminology

Please consider the following collection of transactions. Each transaction contains a collection of items of bought in a grocery store.

Transaction #1. bread, yogurt
Transaction #2. bread, milk, cereal, eggs
Transaction #3. bread, yogurt, milk, cheese, cereal
Transaction #4. bread, milk, cereal

Following key concept establish a foundational basis for the frequent pattern mining algorithm. Examples are given in reference to the toy transaction dataset above.

Concept	Definition	Example
Itemset	It is a collection of items.	\(\{milk\}, \{bread, yogurt\}\)
\(K\)-itemset	It is an itemset containing \(K\) items.	\(\{milk\}\) is an \(1\)-itemset.
Support of Itemset	It is fraction of total transactions that contains items from the itemset.	Support of \(\{milk, bread\}\) is \(3/4\).
Association Rule	It is a rule of the form if X is bought, Y is likely be bought, denoted as \(X \Rightarrow Y\) where \(X\) and \(Y\) are itemsets.	\(\{yogurt, milk\} \Rightarrow \{milk\}\).

Once can derive many association rules for a given dataset; but not every rule provides any insight into the patterns - such as a trivial rule \(\{bread, milk\} \Rightarrow \{milk\}\). The interestingness of the association rule is quantified using the confidence of the association rule defined as follows:

\[ Confidence(X \Rightarrow Y) = \frac{Support(X \cup Y)}{Support(X)} \]

Based on the support and confidence of an association rule \(X \Rightarrow Y\), they can be classified into one of the following regions.

Low Confidence - Low Support. Items occur less frequently in the dataset due to the low support.
High Confidence - Low Support. Items occur less frequently in the dataset due to the low support.
Low Confidence - High Support. Items occur frequently but if items in \(Y\) does not often appear without items in \(X\).
High Confidence - High Support. Items occur frequently and items in \(Y\) often appear with items in \(X\).

Most of the frequent pattern mining algorithms generate association rules that exceed a pre-specified minimum support and minimum confidence.

Apriori Algorithm

apriori

The Apriori algorithm is a foundational algorithm in frequent pattern mining. It's a bottom-up approach that iteratively generates candidate itemsets and prunes those that don't meet the minimum support threshold. It works on apriori principle that if an itemset is frequent itemset, all its subsets must also be frequent.

The algorithm first scans the dataset to find individual items (1-itemsets) that meet a specified minimum support threshold. Then, it combines these frequent 1-itemsets to form 2-itemsets, and so on, pruning itemsets that do not meet the minimum support along the way. This process continues until no further frequent itemsets can be found. Although Apriori is easy to understand and implement, its performance can suffer due to the multiple dataset scans required, especially in large datasets. More efficient algorithms like FP-growth are used for large datasets.

Lift

Lift is another key metric used in evaluating the strength of an association rule in frequent pattern mining. It measures how much more likely two items are to occur together than they would be if they were statistically independent.

Lift of an association rule \(X \Rightarrow Y\) is defined as:

\[ Lift(X \Rightarrow Y) = \frac{Support(X \cup Y)}{Support(X) \times Support(Y)} \]

Lift shows the degree of association between two items compared to what would be expected by chance. A lift value: - Greater than 1: indicates a positive correlation (items in \(X\) and \(Y\) more likely to occur together than by a random chance). - Equal to 1: implies that the items are independent. - Less than 1: indicates a negative correlation (items occur together less often than by chance).

While confidence measures the strength of a rule in a dataset, it does not account for the popularity of the items. For instance if \(5\%\) of transactions contain both bread and milk the support of the itemset \(\{bread, milk\}\) is \(0.05\). If the support of bread is \(0.2\) and support of milk is \(0.3\), the lift of for \(\{bread, milk\}\) is \(0.83\). The lift of less than \(1\) indicates that bread and milk are not strongly associated with each other.