Association Rule Mining

Association rule mining is a popular method for discovering meaningful co-occurrence patterns in large transaction databases. Let I = {i1, i2,..., in} be a set of n binary attributes called items. Let D = {t1, t2,..., tm} be a set of transactions called the database. Each transaction in D has a unique transaction ID and contains a subset of the items in I. An association rule is defined as an implication of the form X ⇒ Y where X, Y ⊆ I and X ∩ Y = Ø. The item sets X and Y are called antecedent (left-hand-side or LHS) and consequent (right-hand-side or RHS) of the rule, respectively. The support supp(X) of an item set X is defined as the proportion of transactions in the database which contain the item set. Note that the support of an association rule X ⇒ Y is supp(X ∪ Y). The confidence of a rule is defined conf(X ⇒ Y) = supp(X ∪ Y) / supp(X). Confidence can be interpreted as an estimate of the probability P(Y | X). In addition to support and confidence, SMILE also reports lift and leverage to measure statistical dependence.

For example, the rule {onions, potatoes} ⇒ {burger} in supermarket data suggests customers buying onions and potatoes together are also likely to buy burger. Such patterns are useful for product placement and promotion strategy.

Association rules are usually required to satisfy a user-specified minimum support and a user-specified minimum confidence at the same time. Association rule generation is usually split up into two separate steps:

  • First, minimum support is applied to find all frequent item sets in a database (i.e. frequent item set mining).
  • Second, these frequent item sets and the minimum confidence constraint are used to form rules.

Frequent Itemset Mining

Finding all frequent item sets in a database is difficult since it involves searching all possible item sets (item combinations). The set of possible item sets is the power set over I (the set of items) and has size 2n - 1 (excluding the empty set which is not a valid item set). Although the size of the power set grows exponentially in the number of items n in I, efficient search is possible using the downward-closure property of support (also called anti-monotonicity) which guarantees that for a frequent item set also all its subsets are frequent and thus for an infrequent item set, all its supersets must be infrequent.

In practice, we may only consider the frequent item set that has the maximum number of items bypassing all the sub item sets. An item set is maximal frequent if none of its immediate supersets is frequent.

For a maximal frequent item set, even though we know that all the sub item sets are frequent, we don't know the actual support of those sub item sets, which are very important to find the association rules within the item sets. If the final goal is association rule mining, we would like to discover closed frequent item sets. An item set is closed if none of its immediate supersets has the same support as the item set.

Some well known algorithms of frequent item set mining are Apriori, Eclat, and FP-Growth. Apriori is the best-known algorithm to mine association rules. It uses a breadth-first search strategy to counting the support of item sets and uses a candidate generation function which exploits the downward closure property of support. Eclat is a depth-first search algorithm using set intersection.

FP-growth (frequent pattern growth) algorithm employs an extended prefix-tree (FP-tree) structure to store the database in a compressed form. The FP-growth algorithm is currently one of the fastest approaches to discover frequent item sets. FP-growth adopts a divide-and-conquer approach to decompose both the mining tasks and the databases. It uses a pattern fragment growth method to avoid the costly process of candidate generation and testing used by Apriori.

The basic idea of the FP-growth algorithm can be described as a recursive elimination scheme: in a preprocessing step delete all items from the transactions that are not frequent individually, i.e., do not appear in a user-specified minimum number of transactions. Then select all transactions that contain the least frequent item (least frequent among those that are frequent) and delete this item from them. Recurse to process the obtained reduced (also known as projected) database, remembering that the item sets found in the recursion share the deleted item as a prefix. On return, remove the processed item from the database of all transactions and start over, i.e., process the second frequent item etc. In these processing steps the prefix tree, which is enhanced by links between the branches, is exploited to quickly find the transactions containing a given item and also to remove this item from the transactions after it has been processed.

When input itemsets are already in memory, use the methods below. The parameter itemsets is the item set database, where each row is an item set, which may have different length. The item identifiers have to be in [0, n), where n is the number of items. Duplicate items in one row are tolerated and collapsed by presence semantics. Note that each row may be reordered after the call. The parameter minSupport is the required minimum support of item sets in terms of frequency. The output is a lazy Stream<ItemSet>.


    def fpgrowth(minSupport: Int, itemsets: Array[Array[Int]]): Stream[ItemSet]
    

    public class FPTree {
        public static FPTree of(int minSupport, int[][] itemsets);
        public static FPTree of(double minSupport, int[][] itemsets);
    }

    public class FPGrowth {
        public static Stream<ItemSet> apply(FPTree tree);
    }

    // ItemSet is a record
    // int[] items();
    // int support();
          

    fun fpgrowth(minSupport: Int, itemsets: Array<IntArray>): Stream<ItemSet>
    

In practice, the raw input is often too large to fit in memory. To handle this case, SMILE accepts Supplier<Stream<int[]>>. The supplier is invoked twice internally: once for item frequency counting, and once to build the FP-Tree.


    def fptree(minSupport: Int, supplier: Supplier[Stream[Array[Int]]]): FPTree

    def fpgrowth(tree: FPTree): Stream[ItemSet]
    

    import java.nio.file.Files;
    import java.nio.file.Path;
    import java.util.Arrays;
    import java.util.function.Supplier;
    import java.util.stream.Stream;

    Supplier<Stream<int[]>> supplier = () -> {
        try {
            return Files.lines(Path.of("transactions.dat"))
                    .map(line -> Arrays.stream(line.split("\\s+"))
                            .mapToInt(Integer::parseInt)
                            .toArray());
        } catch (Exception ex) {
            throw new RuntimeException(ex);
        }
    };

    FPTree tree = FPTree.of(1500, supplier);   // absolute support
    FPTree tree2 = FPTree.of(0.003, supplier); // relative support (0.3%)
          

    fun fptree(minSupport: Int, supplier: Supplier<Stream<IntArray>>): FPTree

    fun fpgrowth(tree: FPTree): Stream<ItemSet>
          

Below is a complete Java example on an in-memory toy dataset.


    smile> val itemsets = Array(
        Array(1, 3),
        Array(2),
        Array(4),
        Array(2, 3, 4),
        Array(2, 3),
        Array(2, 3),
        Array(1, 2, 3, 4),
        Array(1, 3),
        Array(1, 2, 3),
        Array(1, 2, 3)
    )

    smile> fpgrowth(3, itemsets).forEach { itemset => println(itemset) }
    4 (3)
    1 (5)
    2 1 (3)
    3 2 1 (3)
    3 1 (5)
    2 (7)
    3 2 (6)
    3 (8)
    

    smile> import smile.association.*;
    smile> import java.util.Arrays;

    smile> int[][] itemsets = {
                {1, 3},
                {2},
                {4},
                {2, 3, 4},
                {2, 3},
                {2, 3},
                {1, 2, 3, 4},
                {1, 3},
                {1, 2, 3},
                {1, 2, 3}
            }
    itemsets ==> int[10][] { int[2] { 1, 3 }, int[1] { 2 }, int[1] ...  3 }, int[3] { 1, 2, 3 } }

    smile> var tree = FPTree.of(0.3, itemsets)
    tree ==> smile.association.FPTree@2a7b6f69

    smile> System.out.println(tree.size())
    10

    smile> System.out.println(tree.minSupport())
    3

    smile> FPGrowth.apply(tree).forEach(set ->
               System.out.printf("%s support=%d%n",
                   Arrays.toString(set.items()), set.support()))
    [4] support=3
    [1] support=5
    ...
    [3, 2] support=6
    [3] support=8
          

    >>> import smile.association.*
    >>> val itemsets = arrayOf(
        intArrayOf(1, 3),
        intArrayOf(2),
        intArrayOf(4),
        intArrayOf(2, 3, 4),
        intArrayOf(2, 3),
        intArrayOf(2, 3),
        intArrayOf(1, 2, 3, 4),
        intArrayOf(1, 3),
        intArrayOf(1, 2, 3),
        intArrayOf(1, 2, 3)
    )

    >>> fpgrowth(3, itemsets).forEach { println(it) }
    4 (3)
    1 (5)
    2 1 (3)
    3 2 1 (3)
    3 1 (5)
    2 (7)
    3 2 (6)
    3 (8)
    

Each row above is a frequent item set with its raw support count. For larger datasets, stream transactions from a file and consume the output lazily.


    smile> val tree = fptree(1000, () => {
      smile.util.Paths.getTestDataLines("transaction/kosarak.dat").map { line =>
          line.split(" ").map(_.toInt).toArray
      }
    })

    smile> fpgrowth(tree).limit(10).forEach { itemset => println(itemset) }
    5634 (1000)
    3805 (1001)
    3376 (1001)
    2279 (1001)
    6333 (1002)
    243 (1002)
    808 (1003)
    3875 (1004)
    2265 (1004)
    996 (1004)
    

    smile> var data = (java.util.function.Supplier<java.util.stream.Stream<int[]>>) () ->
                java.nio.file.Files.lines(java.nio.file.Path.of("transactions.dat"))
                    .map(line -> java.util.Arrays.stream(line.split("\\s+"))
                        .mapToInt(Integer::parseInt)
                        .toArray())

    smile> var tree = FPTree.of(0.003, data)
    tree ==> smile.association.FPTree@3dddbe65

    smile> long nSets = FPGrowth.apply(tree).count()
    nSets ==> 711424

    smile> long nRules = ARM.apply(0.5, tree).count()
    nRules ==> 1302458
          

    >>> import smile.util.*;
        import java.util.function.*;
        import java.util.stream.*;
        class Parser : Supplier<Stream<IntArray>> {
            override fun get(): Stream<IntArray> {  
               return smile.util.Paths.getTestDataLines("transaction/kosarak.dat").map { line ->
                   line.split(" ").map({ w -> w.toInt() }).toIntArray()
               }
            }
        }
        val tree = fptree(1000, Parser())

    >>> fpgrowth(tree).limit(10).forEach { println(it) }
    5634 (1000)
    3805 (1001)
    3376 (1001)
    2279 (1001)
    6333 (1002)
    243 (1002)
    808 (1003)
    3875 (1004)
    2265 (1004)
    996 (1004)
    

For large inputs, tune minSupport first and filter rules early (e.g. by lift/leverage) to reduce memory pressure and improve throughput.

Association Rules

After mining frequent itemsets, generate association rules with a confidence threshold. SMILE computes support, confidence, lift, and leverage for each rule.


    def arm(minSupport: Int, confidence: Double, itemsets: Array[Array[Int]]): Stream[AssociationRule]

    def arm(confidence: Double, tree: FPTree): Stream[AssociationRule]
    

    public class ARM {
        public static Stream<AssociationRule> apply(double confidence, FPTree tree);
    }

    // AssociationRule is a record
    // int[] antecedent();
    // int[] consequent();
    // double support(), confidence(), lift(), leverage();
          

    def arm(minSupport: Int, confidence: Double, itemsets: Array<IntArray>): Stream<AssociationRule>

    def arm(confidence: Double, tree: FPTree): Stream<AssociationRule>
    

The API is similar to fpgrowth with an additional confidence parameter in [0, 1].


    smile> arm(0.6, tree).limit(10).forEach { rule => println(rule) }
    (11) => (6) support = 32.73% confidence = 89.00% lift = 1.47 leverage = 0.1039
    (3, 11) => (6) support = 14.51% confidence = 89.09% lift = 1.47 leverage = 0.0462
    (1) => (6) support = 13.34% confidence = 66.89% lift = 1.10 leverage = 0.0123
    (3, 1) => (6) support = 5.84% confidence = 68.28% lift = 1.12 leverage = 0.0064
    (6, 1) => (11) support = 8.70% confidence = 65.17% lift = 1.77 leverage = 0.0379
    (11, 1) => (6) support = 8.70% confidence = 93.70% lift = 1.54 leverage = 0.0306
    (6, 3, 1) => (11) support = 3.81% confidence = 65.30% lift = 1.78 leverage = 0.0167
    (3, 11, 1) => (6) support = 3.81% confidence = 93.73% lift = 1.54 leverage = 0.0134
    (218) => (6) support = 7.85% confidence = 87.67% lift = 1.44 leverage = 0.0241
    (3, 218) => (6) support = 3.43% confidence = 87.86% lift = 1.45 leverage = 0.0106
    

    smile> ARM.apply(0.5, tree).limit(10).forEach(System.out::println)
    AssociationRule([3] => [1], support=50.0%, confidence=62.5%, lift=1.25, leverage=0.100)
    AssociationRule([1] => [3], support=50.0%, confidence=100.0%, lift=1.25, leverage=0.100)
    AssociationRule([2] => [1], support=30.0%, confidence=42.9%, lift=0.86, leverage=-0.050)
    ...
          

    >>> arm(0.6, tree).limit(10).forEach { println(it) }
    (11) => (6) support = 32.73% confidence = 89.00% lift = 1.47 leverage = 0.1039
    (3, 11) => (6) support = 14.51% confidence = 89.09% lift = 1.47 leverage = 0.0462
    (1) => (6) support = 13.34% confidence = 66.89% lift = 1.10 leverage = 0.0123
    (3, 1) => (6) support = 5.84% confidence = 68.28% lift = 1.12 leverage = 0.0064
    (6, 1) => (11) support = 8.70% confidence = 65.17% lift = 1.77 leverage = 0.0379
    (11, 1) => (6) support = 8.70% confidence = 93.70% lift = 1.54 leverage = 0.0306
    (6, 3, 1) => (11) support = 3.81% confidence = 65.30% lift = 1.78 leverage = 0.0167
    (3, 11, 1) => (6) support = 3.81% confidence = 93.73% lift = 1.54 leverage = 0.0134
    (218) => (6) support = 7.85% confidence = 87.67% lift = 1.44 leverage = 0.0241
    (3, 218) => (6) support = 3.43% confidence = 87.86% lift = 1.45 leverage = 0.0106
    
Fork me on GitHub