Association Rule Mining in python
Table of Content
Introduction to Association Rule Mining
The Apriori Algorithm
Apriori Algorithm using mlextend package
Apriori Algorithm Normal Approach
Apriori Algorithm Sparse Data Approach
Association Rule Mining using Native Code
Introduction to Association Rule Mining
Association rule mining is an unsupervised learning algorithm, wherein we intend to find relations between elements. It uses the principles of joint probabilities and conditional probabilities to create strong association rules. This technique is the foundation layer for collaborative filtering. There are some disadvantages though, like, its computationally expensive. I have already covered the conceptual part of Association Rule Mining in my previous blog, Association Rule Mining for Collaborative Filtering. Herein, I have already discussed in detail about Collaborative filtering, Content based Filtering and Hybrid Filtering as a recommendation system. The implementation was done is R Language, and now I am showcasing a similar implementation in Python. Towards the end, I am going to show you how you can start using your own native python code to run the functionality of a Association Rule Mining Algorithm, without using a pre built package.
The Apriori Algorithm
The Apriori Algorithm for Association rule mining uses a breath first search iteratively from a bottom up perspective. That means it will read the elements in a row and create baskets in row data. The frequent itemsets are ceated from a minimum to a maximum combinatory sequence. The support and confidence parameters is further used to prune the final subset of rules. Since it uses a hash tree structure for candidate item count, it is very memory intensive and specifically inefficient. But with a good infrastructure these inefficiencies can be ignored. This is also one of the reasons why you can explore writting your own native code.
Apriori Algorithm using mlextend package
The mlextend package is quite good implementation in Python. We will be using the data set shopping
# import the pandas package for handling data ingest
import pandas as pd
# import the numpy package for NaN handling in the pandas object
import numpy as np
# read the sample data set downloaded from the link
raw_data = pd.read_csv("C:/Users/Mohan/Downloads/shopping.csv")
# data ingest as a pandas data frame leads the nulls to be read as NaN, Use replace to fill them with NULL
# inplace=True will replace and overwrite the data in the pandas object
raw_data.replace(np.nan,"",inplace=True)
# convert the dataframe to a list of list as its computationally less memory intensive
data_list=raw_data.iloc[:,1:33].values.tolist()
# install the package mlxtend Once(you can do this in the conda prompt or terminal)
# pip install mlextend
# The TransactionEncoder is used for converting the data which is right now in list of list format
# to DataFrame of True's and False or Sparse format , recommended with high number of distinct articles
# Below is a non sparse format implementation, later I will show a sparse format implementation
from mlxtend.preprocessing import TransactionEncoder
transform_data_encode = TransactionEncoder()
transformed_data = transform_data_encode.fit(data_list).transform(data_list)
df_ready_for_mining = pd.DataFrame(transformed_data, columns=transform_data_encode.columns_)
# drop the 1st column created because of NULLs in data transformed as a label candidate
df_ready_for_mining.drop(df_ready_for_mining.columns[[0]],axis=1,inplace=True)
# import the apriori modules from mlxtend
from mlxtend.frequent_patterns import apriori,association_rules
# View your data before mining for any abnormality
df_ready_for_mining.head()
# Create the frequent itemset using the apriori functionality of mlextend
# Kept a low value for support so as to get base of large number of frequent itemset
frequent_candidates = apriori(df_ready_for_mining, min_support=0.001, use_colnames=True)
# View the frequent itemset collection
frequent_candidates.head()
# Run the association rules function of mined frequent itemset
rules = association_rules(frequent_candidates, metric="lift", min_threshold=1)
# Inspect your rules with filters
rules[ (rules['lift'] >= 4) & (rules['confidence'] >= 0.5) ]
Apriori Algorithm using mlextend package with Sparse Data Fomat
We will be using the code half way till the data list preperation as the same receipe, as with the above case. When the data has a high number of candidate elements, to save on the storage space, we can implement the Sparse Data Structure. Apart from this memory saving attribute, rest of the procedure remains same. Let's get into this part now.
# The TransactionEncoder is used for converting the data which is right now in list of list format
# to Sparse DataFrame of True's and False or Sparse format
from mlxtend.preprocessing import TransactionEncoder
sparse_transform_data_encode = TransactionEncoder()
sparse_transformed_data = sparse_transform_data_encode.fit(data_list).transform(data_list, sparse=True)
sparse_df_ready_for_mining = pd.DataFrame.sparse.from_spmatrix(sparse_transformed_data, columns=sparse_transform_data_encode.columns_)
# drop the 1st column created because of NULLs in data treated as a label
sparse_df_ready_for_mining.drop(sparse_df_ready_for_mining.columns[[0]],axis=1,inplace=True)
# import the apriori modules from mlxtend
from mlxtend.frequent_patterns import apriori,association_rules
# View your data before mining for any abnormality
sparse_df_ready_for_mining.head()
# Run the association rules function of mined frequent itemset
sparse_frequent_candidates = apriori(sparse_df_ready_for_mining, min_support=0.001, use_colnames=True)
# Run the association rules function of mined frequent itemset
sparse_rules = association_rules(sparse_frequent_candidates, metric="lift", min_threshold=1)
# Inspect your rules with filters
sparse_rules[ (sparse_rules['lift'] >= 4) & (sparse_rules['confidence'] >= 0.5) ]
# lets check the size of the standard object and the sparse object we created.
# The only reason we are doing the sparse is for space and memory optimization.
# import the system module
import sys
# get size of the two objects to compare
sys.getsizeof(df_ready_for_mining)
sys.getsizeof(sparse_df_ready_for_mining)
# The Sparse data form consumes less space
Association Rule Mining using Native Code
# Load all dependencies
import numpy as np
import itertools
from itertools import combinations
import collections
from collections import Counter
from itertools import chain
import pandas as pd
#use the below data
data_set=[ ("oil","soap","butter") , ("oil","ghee","soap"), ("wax","soap","ghee","oil"), ("ghee","wax")] # Use text labelled elements
#empty list
comb=[]
# create itemset elements
for i in range(len(data_set)):
for j in range(len(data_set[i])):
comb.append(list(combinations(data_set[i], j+1)))
# Convert to list of tuple for further processing
comb_ = itertools.chain.from_iterable(comb)
flattened_comb = list(comb_)
# Itemset and Count
final_Hash=(collections.Counter(flattened_comb))
print(final_Hash)
# Convert to dataframe with proper headings
df=pd.DataFrame(final_Hash.items(), columns=['item', 'Frequency'])
# Compute the support of individual itemset
df['support']=df['Frequency']/df['Frequency'].sum()
print(df)
You can experiment with the native code part hereon to understand more. Check the formula for confidence and lift and try to build your code. I have not added more on the filtering of the rules over here as its already covered in the R implementation. Recommend you to try this in Python.
About the Author:
Mohan Rai
Mohan Rai is an Alumni of IIM Bangalore , he has completed his MBA from University of Pune and Bachelor of Science (Statistics) from University of Pune. He is a Certified Data Scientist by EMC.Mohan is a learner and has been enriching his experience throughout his career by exposing himself to several opportunities in the capacity of an Advisor, Consultant and a Business Owner. He has more than 18 years’ experience in the field of Analytics and has worked as an Analytics SME on domains ranging from IT, Banking, Construction, Real Estate, Automobile, Component Manufacturing and Retail. His functional scope covers areas including Training, Research, Sales, Market Research, Sales Planning, and Market Strategy.
Table of Content
- Introduction to Association Rule Mining
- The Apriori Algorithm
- Apriori Algorithm using mlextend package
- Apriori Algorithm Normal Approach
- Apriori Algorithm Sparse Data Approach
- Association Rule Mining using Native Code
Association rule mining is an unsupervised learning algorithm, wherein we intend to find relations between elements. It uses the principles of joint probabilities and conditional probabilities to create strong association rules. This technique is the foundation layer for collaborative filtering. There are some disadvantages though, like, its computationally expensive. I have already covered the conceptual part of Association Rule Mining in my previous blog, Association Rule Mining for Collaborative Filtering. Herein, I have already discussed in detail about Collaborative filtering, Content based Filtering and Hybrid Filtering as a recommendation system. The implementation was done is R Language, and now I am showcasing a similar implementation in Python. Towards the end, I am going to show you how you can start using your own native python code to run the functionality of a Association Rule Mining Algorithm, without using a pre built package.
The Apriori Algorithm for Association rule mining uses a breath first search iteratively from a bottom up perspective. That means it will read the elements in a row and create baskets in row data. The frequent itemsets are ceated from a minimum to a maximum combinatory sequence. The support and confidence parameters is further used to prune the final subset of rules. Since it uses a hash tree structure for candidate item count, it is very memory intensive and specifically inefficient. But with a good infrastructure these inefficiencies can be ignored. This is also one of the reasons why you can explore writting your own native code.
The mlextend package is quite good implementation in Python. We will be using the data set shopping
# import the pandas package for handling data ingest
import pandas as pd
# import the numpy package for NaN handling in the pandas object
import numpy as np
# read the sample data set downloaded from the link
raw_data = pd.read_csv("C:/Users/Mohan/Downloads/shopping.csv")
# data ingest as a pandas data frame leads the nulls to be read as NaN, Use replace to fill them with NULL
# inplace=True will replace and overwrite the data in the pandas object
raw_data.replace(np.nan,"",inplace=True)
# convert the dataframe to a list of list as its computationally less memory intensive
data_list=raw_data.iloc[:,1:33].values.tolist()
# install the package mlxtend Once(you can do this in the conda prompt or terminal)
# pip install mlextend
# The TransactionEncoder is used for converting the data which is right now in list of list format
# to DataFrame of True's and False or Sparse format , recommended with high number of distinct articles
# Below is a non sparse format implementation, later I will show a sparse format implementation
from mlxtend.preprocessing import TransactionEncoder
transform_data_encode = TransactionEncoder()
transformed_data = transform_data_encode.fit(data_list).transform(data_list)
df_ready_for_mining = pd.DataFrame(transformed_data, columns=transform_data_encode.columns_)
# drop the 1st column created because of NULLs in data transformed as a label candidate
df_ready_for_mining.drop(df_ready_for_mining.columns[[0]],axis=1,inplace=True)
# import the apriori modules from mlxtend
from mlxtend.frequent_patterns import apriori,association_rules
# View your data before mining for any abnormality
df_ready_for_mining.head()
# Create the frequent itemset using the apriori functionality of mlextend
# Kept a low value for support so as to get base of large number of frequent itemset
frequent_candidates = apriori(df_ready_for_mining, min_support=0.001, use_colnames=True)
# View the frequent itemset collection
frequent_candidates.head()
# Run the association rules function of mined frequent itemset
rules = association_rules(frequent_candidates, metric="lift", min_threshold=1)
# Inspect your rules with filters
rules[ (rules['lift'] >= 4) & (rules['confidence'] >= 0.5) ]
We will be using the code half way till the data list preperation as the same receipe, as with the above case. When the data has a high number of candidate elements, to save on the storage space, we can implement the Sparse Data Structure. Apart from this memory saving attribute, rest of the procedure remains same. Let's get into this part now.
# The TransactionEncoder is used for converting the data which is right now in list of list format
# to Sparse DataFrame of True's and False or Sparse format
from mlxtend.preprocessing import TransactionEncoder
sparse_transform_data_encode = TransactionEncoder()
sparse_transformed_data = sparse_transform_data_encode.fit(data_list).transform(data_list, sparse=True)
sparse_df_ready_for_mining = pd.DataFrame.sparse.from_spmatrix(sparse_transformed_data, columns=sparse_transform_data_encode.columns_)
# drop the 1st column created because of NULLs in data treated as a label
sparse_df_ready_for_mining.drop(sparse_df_ready_for_mining.columns[[0]],axis=1,inplace=True)
# import the apriori modules from mlxtend
from mlxtend.frequent_patterns import apriori,association_rules
# View your data before mining for any abnormality
sparse_df_ready_for_mining.head()
# Run the association rules function of mined frequent itemset
sparse_frequent_candidates = apriori(sparse_df_ready_for_mining, min_support=0.001, use_colnames=True)
# Run the association rules function of mined frequent itemset
sparse_rules = association_rules(sparse_frequent_candidates, metric="lift", min_threshold=1)
# Inspect your rules with filters
sparse_rules[ (sparse_rules['lift'] >= 4) & (sparse_rules['confidence'] >= 0.5) ]
# lets check the size of the standard object and the sparse object we created.
# The only reason we are doing the sparse is for space and memory optimization.
# import the system module
import sys
# get size of the two objects to compare
sys.getsizeof(df_ready_for_mining)
sys.getsizeof(sparse_df_ready_for_mining)
# The Sparse data form consumes less space
# Load all dependencies
import numpy as np
import itertools
from itertools import combinations
import collections
from collections import Counter
from itertools import chain
import pandas as pd
#use the below data
data_set=[ ("oil","soap","butter") , ("oil","ghee","soap"), ("wax","soap","ghee","oil"), ("ghee","wax")] # Use text labelled elements
#empty list
comb=[]
# create itemset elements
for i in range(len(data_set)):
for j in range(len(data_set[i])):
comb.append(list(combinations(data_set[i], j+1)))
# Convert to list of tuple for further processing
comb_ = itertools.chain.from_iterable(comb)
flattened_comb = list(comb_)
# Itemset and Count
final_Hash=(collections.Counter(flattened_comb))
print(final_Hash)
# Convert to dataframe with proper headings
df=pd.DataFrame(final_Hash.items(), columns=['item', 'Frequency'])
# Compute the support of individual itemset
df['support']=df['Frequency']/df['Frequency'].sum()
print(df)
You can experiment with the native code part hereon to understand more. Check the formula for confidence and lift and try to build your code. I have not added more on the filtering of the rules over here as its already covered in the R implementation. Recommend you to try this in Python.
About the Author:
Write A Public Review