Table of Content
- Introduction to Association Rule Mining
- The Apriori Algorithm
- Apriori Algorithm using mlextend package
- Association Rule Mining using Native Code
Association rule mining is an unsupervised learning algorithm, wherein we intend to find relations between elements. It uses the principles of joint probabilities and conditional probabilities to create strong association rules. This technique is the foundation layer for collaborative filtering. There are some disadvantages though, like, its computationally expensive. I have already covered the conceptual part of Association Rule Mining in my previous blog, Association Rule Mining for Collaborative Filtering. Herein, I have already discussed in detail about Collaborative filtering, Content based Filtering and Hybrid Filtering as a recommendation system. The implementation was done is R Language, and now I am showcasing a similar implementation in Python. Towards the end, I am going to show you how you can start using your own native python code to run the functionality of a Association Rule Mining Algorithm, without using a pre built package.
The Apriori Algorithm for Association rule mining uses a breath first search iteratively from a bottom up perspective. That means it will read the elements in a row and create baskets in row data. The frequent itemsets are ceated from a minimum to a maximum combinatory sequence. The support and confidence parameters is further used to prune the final subset of rules. Since it uses a hash tree structure for candidate item count, it is very memory intensive and specifically inefficient. But with a good infrastructure these inefficiencies can be ignored. This is also one of the reasons why you can explore writting your own native code.
The mlextend package is quite good implementation in Python. We will be using the data set shopping
# import the pandas package for handling data ingest import pandas as pd # import the numpy package for NaN handling in the pandas object import numpy as np # read the sample data set downloaded from the link raw_data = pd.read_csv("C:/Users/Mohan/Downloads/shopping.csv") # data ingest as a pandas data frame leads the nulls to be read as NaN, Use replace to fill them with NULL # inplace=True will replace and overwrite the data in the pandas object raw_data.replace(np.nan,"",inplace=True) # convert the dataframe to a list of list as its computationally less memory intensive data_list=raw_data.iloc[:,1:33].values.tolist() # install the package mlxtend Once(you can do this in the conda prompt or terminal) # pip install mlextend # The TransactionEncoder is used for converting the data which is right now in list of list format # to DataFrame of True's and False or Sparse format , recommended with high number of distinct articles # Below is a non sparse format implementation, later I will show a sparse format implementation from mlxtend.preprocessing import TransactionEncoder transform_data_encode = TransactionEncoder() transformed_data = transform_data_encode.fit(data_list).transform(data_list) df_ready_for_mining = pd.DataFrame(transformed_data, columns=transform_data_encode.columns_) # drop the 1st column created because of NULLs in data transformed as a label candidate df_ready_for_mining.drop(df_ready_for_mining.columns[],axis=1,inplace=True) # import the apriori modules from mlxtend from mlxtend.frequent_patterns import apriori,association_rules # View your data before mining for any abnormality df_ready_for_mining.head() # Create the frequent itemset using the apriori functionality of mlextend # Kept a low value for support so as to get base of large number of frequent itemset frequent_candidates = apriori(df_ready_for_mining, min_support=0.001, use_colnames=True) # View the frequent itemset collection frequent_candidates.head() # Run the association rules function of mined frequent itemset rules = association_rules(frequent_candidates, metric="lift", min_threshold=1) # Inspect your rules with filters rules[ (rules['lift'] >= 4) & (rules['confidence'] >= 0.5) ]
We will be using the code half way till the data list preperation as the same receipe, as with the above case. When the data has a high number of candidate elements, to save on the storage space, we can implement the Sparse Data Structure. Apart from this memory saving attribute, rest of the procedure remains same. Let's get into this part now.
# The TransactionEncoder is used for converting the data which is right now in list of list format # to Sparse DataFrame of True's and False or Sparse format from mlxtend.preprocessing import TransactionEncoder sparse_transform_data_encode = TransactionEncoder() sparse_transformed_data = sparse_transform_data_encode.fit(data_list).transform(data_list, sparse=True) sparse_df_ready_for_mining = pd.DataFrame.sparse.from_spmatrix(sparse_transformed_data, columns=sparse_transform_data_encode.columns_) # drop the 1st column created because of NULLs in data treated as a label sparse_df_ready_for_mining.drop(sparse_df_ready_for_mining.columns[],axis=1,inplace=True) # import the apriori modules from mlxtend from mlxtend.frequent_patterns import apriori,association_rules # View your data before mining for any abnormality sparse_df_ready_for_mining.head() # Run the association rules function of mined frequent itemset sparse_frequent_candidates = apriori(sparse_df_ready_for_mining, min_support=0.001, use_colnames=True) # Run the association rules function of mined frequent itemset sparse_rules = association_rules(sparse_frequent_candidates, metric="lift", min_threshold=1) # Inspect your rules with filters sparse_rules[ (sparse_rules['lift'] >= 4) & (sparse_rules['confidence'] >= 0.5) ] # lets check the size of the standard object and the sparse object we created. # The only reason we are doing the sparse is for space and memory optimization. # import the system module import sys # get size of the two objects to compare sys.getsizeof(df_ready_for_mining) sys.getsizeof(sparse_df_ready_for_mining) # The Sparse data form consumes less space
# Load all dependencies import numpy as np import itertools from itertools import combinations import collections from collections import Counter from itertools import chain import pandas as pd #use the below data data_set=[ ("oil","soap","butter") , ("oil","ghee","soap"), ("wax","soap","ghee","oil"), ("ghee","wax")] # Use text labelled elements #empty list comb= # create itemset elements for i in range(len(data_set)): for j in range(len(data_set[i])): comb.append(list(combinations(data_set[i], j+1))) # Convert to list of tuple for further processing comb_ = itertools.chain.from_iterable(comb) flattened_comb = list(comb_) # Itemset and Count final_Hash=(collections.Counter(flattened_comb)) print(final_Hash) # Convert to dataframe with proper headings df=pd.DataFrame(final_Hash.items(), columns=['item', 'Frequency']) # Compute the support of individual itemset df['support']=df['Frequency']/df['Frequency'].sum() print(df)
You can experiment with the native code part hereon to understand more. Check the formula for confidence and lift and try to build your code. I have not added more on the filtering of the rules over here as its already covered in the R implementation. Recommend you to try this in Python.
About the Author: