数据挖掘 12 Frequent Itemsets and Association Rules
Frequent itemsets and association rules.
This problem aims to find a set of items that are bought together. These itemsets can be efficiently mined by exploiting the apriori principle of the support measure.
Frequent Itemsets Mining
Frequent itemsets: collections of objects that appear a minimum number of times in the baskets of the customers.
Set of items , set of transactions identifiers .
Dataset is a set of pairs .
A basket is a transaction-set .
An itemset is a subset of items, where is the powerset of all subsets of .
Definition of support: For an itemset we define the support as the number of transactions that contain itemset , i.e., .
Frequent itemset mining problem finds all the itemsets that have a support . The support can be expressed as a number or a percentage, similarly to frequent subgraph mining in last lecture. If we have items, we have potentially itemsets. Itemsets are typically represented in a lattice structure, in which the root is empty, and each node is the union of itemsets in the parent nodes. The question is how do we find frequent itemsets efficiently.
A naive solution. To count each itemset individually. This requires to construct itemsets and compte the frequency for each of them, reaching time and space, where is the size of the largest transaction. 👎IMPRACTICAL!
The Apriori Algorithm
If are two itemsets s.t. , the support . We say that the support is anti-monotone.
The algorithm exploits the Apriori property by means of an iterative process that alternates frequent itemset generation, and candidate generation. Let be the candidate itemsets of size , let be the frequent itemsets of size . The Apriori algorithm starts from the intuition that, instead of generating all the itemsets, the candidates of size should only contain frequent itemsets of size .
Pseudo-code
Input: Data , items , minimum support .
while do
// frequent itemset generation
- Scan DB to find which itemsets in are frequent, put them into
// candidate generation
- Use to generate a collection of candidate itemsets of size .
Candidate generation. It constructs the candidate itemsets of size by combining frequent itemsets of size . For simplicity, we assume that itemsets are ordered. Also we assume that itemsets in are ordered. We can generate candidates of size by joining two itemsets of size that share the first items. In a database terms, we are computing a self-join between the list of candidates of size using the first columns as joining attributes.
Once a new candidate is generated, we can check whether all the ordered subsets are also frequent, i.e., they are in the set . This ensure the correct application of the Apriori principle. The generatiion of candidates proceeds as follows
- Self-join : create set by joining frequent -itemsets that share the first items
- Pruning step: Remove a candidate if a -subset is not frequent.
Frequent itemset generation. Important step: counting the support for each of the candidates . Speed up by using hash map. Better solution: hash-tree.
Itemsets of size . For a typical shop-basket data, the number of items is by far the largest among other values. It’s important to have efficient, in-memory optimizations. There are 2 approaches:
counting all pairs using a triangular matrix
requires 4 bytes per pair
keeping a table for triples , where is the count of the pair of items .
require 12 bytes per pair but only for those pair with positive count.
Factors Affecting the Complexity
- Minimum support threshold
- Dimensionality of the dataset
- Size of Database
- Average transaction width. It may increase max length of frequent itemsets and traversals of has h tree.
Association Rule Mining
It is a refinement of frequent itemsets, in which an itemset is frequent conditionally to another itemset. E.g. means when items occurs, also occur.
Definition of association rule. It is an implication of the form as the number of items appears in transactions that contain , i.e., .
The association rule mining task finds all rules at least with support and confidence . The algorithm is apriori + additional step to generate the rules. The algorithm proceeds:
- Frequent itemset generation. Generate all itemsets whose support is greater than
- Rule generation. Generate rules with confidence at least for each frequent itemset, where a rule is a partitioning of a frequent itemset into left-hand-side and right-hand-side.
So the pending issue for new algorithm is how do we generate such rules.
Rule Generation
The naive way: enumerate all the LHS and RHS. The process takes for each frequent itemset all subsets and generates rules of the form that satisfy minimum confidence.
Does confidence posses the desired apriori property? Not in the strict sense.
Luckily, the apriori property holds for rules generated from the same itemset.
If we move items from LHS to RHS as the support in the denominator of the confidence increases. As such, if a rule from the same item has confidence all rules in which the RHS is a superset of also have low confidence and can be pruned.
So we generate a candidate rule by merging 2 rules that share the same prefix on RHS. This process is essentially the apriori algorithm on RHS.
FP-Tree and FP-Growth
A smarter and more intricate algorithm, that exploits a tree structure called FP-tree.
FP-tree construction. An FP-tree is a prefix tree that contains a compressed representation of the transaction database. Each transaction in the FP-tree is a overlapping path in the tree.
The first step before constructing the tree is sorting the itemsets to ensure the uniqueness of the prefixes.
- The construction of the FP-tree starts from an empty node. Each node has a label corresponding to an item and a counter, corresponding to the number of transactions in the paths containing the item.
- For each transaction follow the prefix-path and increase the counters on the way.
- If at some point we reach a leaf node we expand the path adding nodes with counter 1.
- Add a pointer between nodes that refer to the same item.
We add a header table corresponding to the first (from the left of the tree) correspondence of the item.
The final size of the tree depends on the redundancy in the transactions. In the best case, all transaction are the same and the FP-tree is a path. In worst case, all the transactions are different and the size of the tree is bigger than that of the data.
FP-Growth is an algorithm that takes the FP-tree as input and performs a divide-and-conquer strategy to enumerate all the frequent itemsets.
Pseudo-code
Itemsets mining with FP-Tree
Input: All frequent itemsets and their support
Output: The FP-tree, minimum support
- Proceed in divide-and-conquer manner
- for each Items do
- Consider candidate itemsets ending with
- Recursion on
The meta-algorithm works bottom-up. This means that for each possible last item, consider itemsets with last items one of items preceding it in the ordering.
The items are lexicographically ordered and the header-table contains the first occurrence of an item.
FP-Growth
Input: All frequent itemsets and their support
Output: The FP-tree, minimum support
- Proceed in divide-and-conquer manner
- for each suffix do
- Construct a prefix-tree for , compute support using header table and pointers
- If is frequent then
- Recompute support
- Prune infrequent items
- Prune leaves and recur
However, the items counter of the items need to be updated as the global FP-tree contains the counters for all transactions, including those not ending with the considered suffix. The counters propagates bottom-up from the leaves to the root.
Once the counters are updated, if any of the labels have support less than the node is pruned and the subtree of the node is attacked to the node’s father. After pruning, the suffix is removed from the tree and the computation proceeds recursively bottom-up.
Pros and Cons
👍
- Efficient
- Allows to compute support along with the frequent items.
👎
- Efficiency depends on the compaction factor
- If the tree is bushy, the number of sub-problems to solve is large