Frequent itemsets and association rules.

This problem aims to find a set of items that are bought together. These itemsets can be efficiently mined by exploiting the apriori principle of the support measure.

Frequent Itemsets Mining

Frequent itemsets: collections of objects that appear a minimum number of times in the baskets of the customers.

Set of dd items I\mathcal I, set of nn transactions identifiers T\mathcal T.

Dataset D\mathcal D is a set of pairs (i,t)I×T(i,t)\in\mathcal I\times\mathcal T.

A basket is a transaction-set t(D)={i1,,it}t(\mathcal D)=\{i_1,\cdots,i_{|t|}\}.

An itemset is a subset of I2II\subseteq2^\mathcal I items, where 2I2^\mathcal I is the powerset of all subsets of I\mathcal I.

Definition of support: For an itemset II we define the support σ(I)\sigma(I) as the number of transactions that contain itemset II, i.e., σ(I)={tTIt(D)}\sigma(I)=|\{t\in\mathcal T|I\subseteq t(\mathcal D)\}|.

Frequent itemset mining problem finds all the itemsets I2II\subseteq 2^\mathcal I that have a support σ(I)minsup\sigma(I)\ge minsup. The support can be expressed as a number or a percentage, similarly to frequent subgraph mining in last lecture. If we have dd items, we have potentially 2d2^d itemsets. Itemsets are typically represented in a lattice structure, in which the root is empty, and each node is the union of itemsets in the parent nodes. The question is how do we find frequent itemsets efficiently.

A naive solution. To count each itemset individually. This requires to construct 2d2^d itemsets and compte the frequency for each of them, reaching O(2dnw)\mathcal O(2^dnw) time and O(d)\mathcal O(d) space, where ww is the size of the largest transaction. 👎IMPRACTICAL!

The Apriori Algorithm

If X,Y2IX,Y\in 2^\mathcal I are two itemsets s.t. XYX\subseteq Y, the support σ(X)σ(Y)\sigma(X)\ge\sigma(Y). We say that the support is anti-monotone.

The algorithm exploits the Apriori property by means of an iterative process that alternates frequent itemset generation, and candidate generation. Let CkC_k be the candidate itemsets of size kk, let LkL_k be the frequent itemsets of size kk. The Apriori algorithm starts from the intuition that, instead of generating all the itemsets, the candidates of size kk should only contain frequent itemsets of size k1k-1.

Pseudo-code

Input: Data D\mathcal D, items I\mathcal I, minimum support minsupminsup.

  1. k1k\gets 1

  2. C1IC_1\gets\mathcal I

  3. while CkC_k\neq\empty do

    // frequent itemset generation

    • Scan DB to find which itemsets in CkC_k are frequent, put them into LkL_k

    // candidate generation

    • Use LkL_k to generate a collection of candidate itemsets Ck+1C_{k+1} of size k+1k+1.
    • kk+1k\gets k+1

Candidate generation. It constructs the candidate itemsets of size k+1k+1 by combining frequent itemsets of size kk. For simplicity, we assume that itemsets are ordered. Also we assume that itemsets in LkL_k are ordered. We can generate candidates of size k+1k+1 by joining two itemsets of size kk that share the first k1k-1 items. In a database terms, we are computing a self-join between the list of candidates of size kk using the first k1k-1 columns as joining attributes.

Once a new candidate is generated, we can check whether all the ordered subsets are also frequent, i.e., they are in the set LkL_k. This ensure the correct application of the Apriori principle. The generatiion of candidates Ck+1C_{k+1} proceeds as follows

  1. Self-join LkL_k: create set Ck+1C_{k+1} by joining frequent kk-itemsets that share the first k1k-1 items
  2. Pruning step: Remove a Ck+1C_{k+1} candidate if a kk-subset is not frequent.

Frequent itemset generation. Important step: counting the support for each of the candidates CkC_k. Speed up by using hash map. Better solution: hash-tree.

Itemsets of size k=2k=2. For a typical shop-basket data, the number of k=2k=2 items is by far the largest among other kk values. It’s important to have efficient, in-memory optimizations. There are 2 approaches:

  1. counting all pairs using a triangular matrix

    requires 4 bytes per pair

  2. keeping a table for triples [i,j,c][i,j,c], where cc is the count of the pair of items (ii,ij)(i_i,i_j).

    require 12 bytes per pair but only for those pair with positive count.

Factors Affecting the Complexity

  • Minimum support threshold
  • Dimensionality of the dataset
  • Size of Database
  • Average transaction width. It may increase max length of frequent itemsets and traversals of has h tree.

Association Rule Mining

It is a refinement of frequent itemsets, in which an itemset is frequent conditionally to another itemset. E.g. {a,b}{c,d,e}\{a,b\}\Rarr\{c,d,e\} means when items a,ba,b occurs, also c,d,ec,d,e occur.

Definition of association rule. It is an implication of the form XYX\Rarr Y as the number of items YY appears in transactions that contain XX, i.e., γ(XY)=σ(XY)σ(X)\gamma(X\Rarr Y)=\frac{\sigma(X\cup Y)}{\sigma(X)}.

The association rule mining task finds all rules at least with support minsupminsup and confidence mincon fmin con\ f. The algorithm is apriori + additional step to generate the rules. The algorithm proceeds:

  1. Frequent itemset generation. Generate all itemsets whose support is greater than minsupminsup
  2. Rule generation. Generate rules with confidence at least mincon fmincon\ f for each frequent itemset, where a rule is a partitioning of a frequent itemset into left-hand-side and right-hand-side.

So the pending issue for new algorithm is how do we generate such rules.

Rule Generation

The naive way: enumerate all the LHS and RHS. The process takes for each frequent itemset SS all subsets LSL\subset S and generates rules of the form SLLS-L\Rarr L that satisfy minimum confidence.

Does confidence posses the desired apriori property? Not in the strict sense.

Luckily, the apriori property holds for rules generated from the same itemset.

If we move items from LHS to RHS as the support in the denominator of the confidence increases. As such, if a rule XYX\Rarr Y from the same item has confidence minconf\le minconf all rules in which the RHS is a superset of YY also have low confidence and can be pruned.

So we generate a candidate rule by merging 2 rules that share the same prefix on RHS. This process is essentially the apriori algorithm on RHS.

FP-Tree and FP-Growth

A smarter and more intricate algorithm, that exploits a tree structure called FP-tree.

FP-tree construction. An FP-tree is a prefix tree that contains a compressed representation of the transaction database. Each transaction in the FP-tree is a overlapping path in the tree.

The first step before constructing the tree is sorting the itemsets to ensure the uniqueness of the prefixes.

  1. The construction of the FP-tree starts from an empty node. Each node has a label corresponding to an item and a counter, corresponding to the number of transactions in the paths containing the item.
  2. For each transaction follow the prefix-path and increase the counters on the way.
  3. If at some point we reach a leaf node we expand the path adding nodes with counter 1.
  4. Add a pointer between nodes that refer to the same item.

We add a header table corresponding to the first (from the left of the tree) correspondence of the item.

The final size of the tree depends on the redundancy in the transactions. In the best case, all transaction are the same and the FP-tree is a path. In worst case, all the transactions are different and the size of the tree is bigger than that of the data.

FP-Growth is an algorithm that takes the FP-tree as input and performs a divide-and-conquer strategy to enumerate all the frequent itemsets.

Pseudo-code

Itemsets mining with FP-Tree

Input: All frequent itemsets and their support

Output: The FP-tree, minimum support minsupminsup

  1. Proceed in divide-and-conquer manner
  2. for each Items iIi\in\mathcal I do
    • Consider candidate itemsets CC ending with ii
    • Recursion on CC

The meta-algorithm works bottom-up. This means that for each possible last item, consider itemsets with last items one of items preceding it in the ordering.

The items are lexicographically ordered and the header-table contains the first occurrence of an item.

FP-Growth

Input: All frequent itemsets and their support

Output: The FP-tree, minimum support minsupminsup

  1. Proceed in divide-and-conquer manner
  2. for each suffix XX do
    • Construct a prefix-tree for XX, compute support using header table and pointers
    • If XX is frequent then
      • Recompute support
      • Prune infrequent items
      • Prune leaves and recur

However, the items counter of the items need to be updated as the global FP-tree contains the counters for all transactions, including those not ending with the considered suffix. The counters propagates bottom-up from the leaves to the root.

Once the counters are updated, if any of the labels have support less than minsupminsup the node is pruned and the subtree of the node is attacked to the node’s father. After pruning, the suffix is removed from the tree and the computation proceeds recursively bottom-up.

Pros and Cons

👍

  • Efficient
  • Allows to compute support along with the frequent items.

👎

  • Efficiency depends on the compaction factor
  • If the tree is bushy, the number of sub-problems to solve is large