Apriori algorithm for association rule learning

Apriori is one of the most popular algorithms for generating association rules. Employing the anti-monotonicity property, it is able to process large volumes of data within a reasonable amount of time. Here we analyze the operation of the algorithm and peculiarities of its implementation.

Modern databases are very large, reaching giga-and terabytes in volume, and keep growing. Therefore, generation of association rules requires efficient scalable algorithms that can solve the problem within an acceptable time frame. One of such algorithms will be discussed in this article. We will describe Apriori algorithm. The terminology and notation that we will use are given in the article "Introduction to the Analysis of Association Rules".

To be able to apply the algorithm, you need to pre-process the data as follows:

  • convert all data to binary format;
  • change the data structure.
Transaction numberItem nameQuantity
1001А2
1001D3
1001E1
1002А2
1002F1
1003B2
1003А2
1003C2
.........

Table 1. Standard Transaction Database View

TIDABCDEFGHIK...
10011001100000...
10021000010000...
10031110000010...

Table 2. Normalized Form

The number of columns in the table is equal to the number of items present in the set of transactions D. Each record corresponds to a transaction, where 1 in the corresponding column means that the item is present in the transaction, and 0 means otherwise. Note that the initial view of the table may differ from the one shown in Table 1. The main thing is that the data should be converted to a normalized form, otherwise the algorithm will not be applicable.

All items in the table are sorted in alphabetical order (if they are numbers, they should be arranged in numerical order). This is not a coincidence.

In the first step, you need to find frequent item sets, and then, in the second step — determine rules using them. The number of items in a set is called the size of the set, and a set consisting of  k items is called a k-length item set.

Anti-Monotonicity Property

Identification of frequent item sets is an operation that requires a lot of computation power and, consequently, a lot of time. A primitive approach to solving of this problem is a simple search of all possible item sets. This will require O(^{|2I|}) operations, where |I| is the number of items. Apriori uses one of the support properties stating that support for any item set cannot exceed the minimum support for any of its subsets.

For example, the support for a 3-length item set {Bread, Butter, Milk} will always be less than or equal to the support for a 2-length item sets {Bread, Butter}, {Bread, Milk}, {Butter, Milk}. The point is that any transaction containing {Bread, Butter, Milk} must also contain {Bread, Butter}, {Bread, Milk}, {Butter, Milk} and this rule cannot be reversed.

This property is called anti-monotonicity and serves to reduce the dimensionality of the search space. Hadn't we discovered this property, finding multi-item sets would be almost an impossible task due to the exponential growth of the volume of calculations.

The anti-monotonicity property can also be described with a different wording: the support decreases or remains the same as the size of the item set increases. From all of the above it follows that any  k-length item set will be frequent if and only if all of its (k-1) -length item subsets are frequent.

All possible item sets from IIcan be represented as a grid starting with an empty set, then at level 1 there are 1-length item sets, at level 2 — 2-length item sets, and so on. At level k , there are k-length item sets associated with all their (k-1) -length item subsets.

Let's study Figure 1, which illustrates the item set I - {A, B, C, D}. Let's assume that item set \{ A, B \} has a support below the specified threshold and, accordingly, is not frequent. Then, according to the anti-monotonicity property, all its supersets are also not frequent and are discarded. This entire branch, starting with \{ A, B \}, is highlighted in blue. This heuristics can significantly reduce the search space.

Apriori Algorithm

At the first step of the algorithm, frequent 1-length item sets are counted. To do this, you need to go through the entire data set and calculate the support required for them, i.e. how many times it occurs in the database.

The next steps will consist of two parts: generation of potentially frequent item sets (candidates) and counting of the support required for the candidates.

The algorithm described above can be written as the following pseudocode:

  1. F_1 = \{ Frequent 1-length item sets \};
  2.   \text{ for } (k=2, F_{k-1} \verb!<>! \varnothing, k\verb!++! )
  3.     \{
  4.       C_k = \{ generation of candidates \ k based on \ Apriorigen ( F_{k-1} ) \}
  5.       for all transactions t∈D
  6.       \{
  7.         C_t = subset(C_k,t) // Removal of redundant rules
  8.           for all candidates c \in C_t
  9.           c.count\verb!++!
  10.       \}
  11.       F_k = [c \in C_k | c.count > = minsupp ] // Selection of candidates
  12.     \}
  13. Result \cup F_k

 

Let's describe the candidate generation function. This time, there is no need to access the database again. To get k-length item sets, we use (k-1) -length item sets that were defined in the previous step and proven to be frequent.

We remember that our source set is stored in an ordered form. Candidate generation will also consist of two steps.

Packing. Each candidate C_k will be formed through expanding a frequent set of size (k-1) performed by means of adding an item from another (k-1) -length item set. Let's describe the algorithm for this Apriorigen function as a small SQL-like query.

INSERT INTO

          C_k

SELECT

          p.item_1, p.item_2, ..., p.item_{k-1}, q.item_{k-1}

FROM

          F_{k-1} p, F_{k-1} q

WHERE

          p.item_1=q.item_1

          p.item_2=q.item_2,...,

          p.item_{k-2}=q.item_{k-2}

          p.item_{k-1}<q.item_{k-1}

Removal of redundant rules. Based on the anti-monotonicity property, you should delete all sets with C_k, if at least one of its (k-1) subsets are not frequent.

The next task after the candidate generation is to count the support for each candidate. It is obvious that the number of candidates can be very large, thus an effective counting method is needed. The most trivial approach is to compare each transaction with each candidate. But this is not the best solution.

It is much faster and more efficient to use an approach based on storage of candidates in a hash-tree. The internal nodes of the tree contain hash-tables with references to descendants, and the leaves contain hash-tables with references to candidates. We can use this tree to quickly calculate the support required for the candidates.

A hash-tree is built each time candidates are formed. Initially, the tree consists only of the root, which is a leaf and does not contain any candidate sets. Each time a new candidate is formed, it is entered in the root of the tree and so on, until the number of candidates in the root list exceeds a certain threshold.

As soon as the number of candidates exceeds the threshold, the root is converted to a hash-table, i.e. it becomes an internal node, and has leaf descendants created for it. All the examples are distributed among the descendant nodes according to the hash-values of the items included into the set, and so on. Each new candidate is hashed on the internal nodes, until it reaches the first leaf node, where it will be stored until the number of sets again exceeds the threshold.

A hash-tree with candidates of the sets has been built; now using the hash-tree, it is easy to calculate the support required for each candidate. To do this, you need to "run" each transaction through the tree and expand the counter mechanisms for those candidates the items of which are also contained in the transaction, i.e. C_k∩Ti=C_k. At the root level, the hash function will be applied to each item in the transaction.

After it, at the second level, the hash function will be applied to the second tier items, and so on. The k-item is hashed at level k. This process goes on until we reach the leaf. If the candidate stored in the leaf is a subset of the transaction under consideration, we need to increase the support counter for this candidate by one.

After each transaction from the source dataset has been "run" through the tree, you can check whether the candidate support values reach the minimum threshold. Candidates for which this condition is met are transferred to the "frequent" category. Moreover, you should remember the support of the set, which will be useful for extraction of rules. The same procedure is used to find (k+1) -length item sets, etc.

After all frequently encountered item sets have been found, you can start generating the rules.

Extraction of the rules is not as time-consuming as the previous task. First, to calculate the confidence of the rule, it is sufficient enough to know the support of the set itself and that of the set the rule is conditioned upon. For example, there is a frequent set \{ A, B, C \} and we need to calculate the confidence for rule AB⇒C. We know the support of the set itself, but its subset \{ A, B \}, upon which the rule is conditioned, is often found as well due to the anti-monotonicity property, and therefore we know its support. taking that into account we can easily calculate the confidence. This saves us from excessive viewing of the transaction database, which would be required if this support was unknown.

To extract a rule from the frequent set F, find all its non-empty subsets. And for each subset s we can formulate the following rule:s ⇒ (F - s) if the confidence of rule conf(s ⇒ (F - s)) = supp(F)/supp(s) is equal or exceeds the threshold minconf.

Note that the numerator remains constant. Then, the confidence is minimal, if the denominator has the maximum value; this happens, when the rule condition contains a set consisting of one item. All supersets of a given set have a smaller or equal support and, consequently, a higher confidence value. This property can be used when extracting the rules.

If we start extracting the rules, considering only one item in the rule condition at first, and if this rule has the necessary support, then all rules conditioned upon the supersets of this item will also have a confidence value above the specified threshold. For example, if a rule A ⇒ BCDE satisfies the minimum confidence threshold minconf, then AB ⇒ CDE also satisfies it.

To extract all the rules, a recursive procedure must be used. Important warning: any rule made up of a frequent set must contain all items of the set. For example, if a set consists of items \{ A, B, C \} , then the rule A⇒B shall not be considered.

Reference list

  1. R. Agrawal, R. Srikant. «Fast Discovery of Association Rules», In Proc. of the 20th International Conference on VLDB, Santiago, Chile, September 1994.
  2. R. Agrawal, T. Imielinski, A. Swami. 1993. Mining Associations between Sets of Items in Massive Databases. In Proc. of the 1993 ACM-SIGMOD Int’l Conf. on Management of Data, 207-216.

See also

Genetic algorithms as a mathematical apparatus
Genetic algorithms aim to solve optimization and modeling tasks by iterating, variation and combining target parameters with the help of mechanisms similar to biological evolution.
Scalable CLOPE algorithm for clustering categorical data
Splitting of categorical and transactional data sets into groups with similar attributes into large databases is the most important task of data mining. In most cases, traditional clustering...
Clustering algorithms on Data Mining
This article is an attempt to systematize and give a holistic view of the latest achievements in the development of effective approaches to data clustering. The purpose of this article was not...