The apriori algorithm is a canonical algorithm from the field of [[knowledge discovery in databases]] that identifies frequent subsets.
The support is the probability of the itemset occurring. Define a minimum support to determine which itemsets to identify. The key insight is that if X is an infrequent itemset (i.e., support is less than minimum support), any superset containing X are also infrequent. This is called apriori pruning.
The pseudocode is
1. scan dataset to get freq $1$-itemseets
2. generate candidate $(k+1)$-itemsets from feq. $k$-itemsets
3. scan dataset to remove infrequent candidate $(k+1)$-itemsets
4. stop when no more frequent or candidate itemsets occur
```python
from itertools import combinations as iter_combinations
def findsubsets(s, n):
return list(sorted((combinations(s,n))))
def items_from_frequent_itemsets(frequent_itemset):
return sorted({item for keys in frequent_itemset for item in keys})
def generate_frequent_itemsets(dataset, support, items, n=1, frequent_items=None):
len_transactions = len(dataset)
frequent_itemsets = {}
if frequent_items is None:
frequent_items = {}
if n == 1:
for item in items:
count = sum(item in transaction for transaction in dataset.values())
if count >= support * len_transactions:
frequent_itemsets[item] = count
else:
candidate_sets = findsubsets(frequent_items.keys(), n)
for candidate in candidate_sets:
count = sum(all(i in transaction for i in candidate) for transaction in dataset.values())
if count >= support * len_transactions:
frequent_itemsets[candidate] = count
return frequent_itemsets
```