Association Rule Mining In Python.

Table of Content

Introduction to Association Rule Mining
The Apriori Algorithm
Apriori Algorithm using mlextend package
1. Apriori Algorithm Normal Approach
2. Apriori Algorithm Sparse Data Approach
Association Rule Mining using Native Code

Introduction to Association Rule Mining

Association rule mining is an unsupervised learning algorithm, wherein we intend to find relations between elements. It uses the principles of joint probabilities and conditional probabilities to create strong association rules. This technique is the foundation layer for collaborative filtering. There are some disadvantages though, like, its computationally expensive. I have already covered the conceptual part of Association Rule Mining in my previous blog, Association Rule Mining for Collaborative Filtering. Herein, I have already discussed in detail about Collaborative filtering, Content based Filtering and Hybrid Filtering as a recommendation system. The implementation was done is R Language, and now I am showcasing a similar implementation in Python. Towards the end, I am going to show you how you can start using your own native python code to run the functionality of a Association Rule Mining Algorithm, without using a pre built package.

The Apriori Algorithm

The Apriori Algorithm for Association rule mining uses a breath first search iteratively from a bottom up perspective. That means it will read the elements in a row and create baskets in row data. The frequent itemsets are ceated from a minimum to a maximum combinatory sequence. The support and confidence parameters is further used to prune the final subset of rules. Since it uses a hash tree structure for candidate item count, it is very memory intensive and specifically inefficient. But with a good infrastructure these inefficiencies can be ignored. This is also one of the reasons why you can explore writting your own native code.

Apriori Algorithm using mlextend package

The mlextend package is quite good implementation in Python. We will be using the data set shopping

    # import the pandas package for handling data ingest 
    import pandas as pd

    # import the numpy package for NaN handling in the pandas object
    import numpy as np

    # read the sample data set downloaded from the link
    raw_data = pd.read_csv("C:/Users/Mohan/Downloads/shopping.csv")
    
    # data ingest as a pandas data frame leads the nulls to be read as NaN, Use replace to fill them with NULL
    # inplace=True will replace and overwrite the data in the pandas object
    raw_data.replace(np.nan,"",inplace=True)
    
    # convert the dataframe to a list of list as its computationally less memory intensive
    data_list=raw_data.iloc[:,1:33].values.tolist()


    # install the package mlxtend Once(you can do this in the conda prompt or terminal)
    # pip install mlextend 
   
    # The TransactionEncoder is used for converting the data which is right now in list of list format
    # to DataFrame of True's and False  or Sparse format , recommended with high number of distinct articles
    # Below is a non sparse format implementation, later I will show a sparse format implementation
    from mlxtend.preprocessing import TransactionEncoder
    transform_data_encode = TransactionEncoder()
    transformed_data = transform_data_encode.fit(data_list).transform(data_list)
    df_ready_for_mining = pd.DataFrame(transformed_data, columns=transform_data_encode.columns_)

    # drop the 1st column created because of NULLs in data transformed as a label candidate
    df_ready_for_mining.drop(df_ready_for_mining.columns[[0]],axis=1,inplace=True)
    
    # import the apriori modules from mlxtend
    from mlxtend.frequent_patterns import apriori,association_rules

    # View your data before mining for any abnormality
    df_ready_for_mining.head()

    # Create the frequent itemset using the apriori functionality of mlextend
    # Kept a low value for support so as to get base of large number of frequent itemset
    frequent_candidates = apriori(df_ready_for_mining, min_support=0.001, use_colnames=True)

    # View the frequent itemset collection
    frequent_candidates.head()

    # Run the association rules function of mined frequent itemset
    rules = association_rules(frequent_candidates, metric="lift", min_threshold=1)

    # Inspect your rules with filters
    rules[ (rules['lift'] >= 4) & (rules['confidence'] >= 0.5) ]

Apriori Algorithm using mlextend package with Sparse Data Fomat

We will be using the code half way till the data list preperation as the same receipe, as with the above case. When the data has a high number of candidate elements, to save on the storage space, we can implement the Sparse Data Structure. Apart from this memory saving attribute, rest of the procedure remains same. Let's get into this part now.

    # The TransactionEncoder is used for converting the data which is right now in list of list format
    # to Sparse DataFrame of True's and False  or Sparse format 
    from mlxtend.preprocessing import TransactionEncoder
    sparse_transform_data_encode = TransactionEncoder()
    sparse_transformed_data = sparse_transform_data_encode.fit(data_list).transform(data_list, sparse=True)
    sparse_df_ready_for_mining = pd.DataFrame.sparse.from_spmatrix(sparse_transformed_data, columns=sparse_transform_data_encode.columns_)
    

    # drop the 1st column created because of NULLs in data treated as a label
    sparse_df_ready_for_mining.drop(sparse_df_ready_for_mining.columns[[0]],axis=1,inplace=True)
    
    # import the apriori modules from mlxtend
    from mlxtend.frequent_patterns import apriori,association_rules

    # View your data before mining for any abnormality
    sparse_df_ready_for_mining.head()

    # Run the association rules function of mined frequent itemset
    sparse_frequent_candidates = apriori(sparse_df_ready_for_mining, min_support=0.001, use_colnames=True)
    
    # Run the association rules function of mined frequent itemset
    sparse_rules = association_rules(sparse_frequent_candidates, metric="lift", min_threshold=1)

    # Inspect your rules with filters
    sparse_rules[ (sparse_rules['lift'] >= 4) & (sparse_rules['confidence'] >= 0.5) ]

    # lets check the size of the standard object and the sparse object we created.
    # The only reason we are doing the sparse is for space and memory optimization.
    # import the system module
    import sys
    
    # get size of the two objects to compare
    sys.getsizeof(df_ready_for_mining)
    sys.getsizeof(sparse_df_ready_for_mining)

    # The Sparse data form consumes less space

Association Rule Mining using Native Code

    # Load all dependencies
    import numpy as np
    import itertools
    from itertools import combinations
    import collections
    from collections import Counter
    from itertools import chain
    import pandas as pd      

    #use the below data 
    data_set=[ ("oil","soap","butter") , ("oil","ghee","soap"), ("wax","soap","ghee","oil"), ("ghee","wax")] # Use text labelled elements
 
    #empty list  
    comb=[]
   
    # create itemset elements
    for i in range(len(data_set)): 
        for j in range(len(data_set[i])):
            comb.append(list(combinations(data_set[i], j+1)))

    # Convert to list of tuple for further processing
    comb_ = itertools.chain.from_iterable(comb)
    flattened_comb = list(comb_)        

    # Itemset and Count 
    final_Hash=(collections.Counter(flattened_comb))
    print(final_Hash)

    # Convert to dataframe with proper headings
    df=pd.DataFrame(final_Hash.items(), columns=['item', 'Frequency'])

    # Compute the support of individual itemset
    df['support']=df['Frequency']/df['Frequency'].sum()
    print(df)

You can experiment with the native code part hereon to understand more. Check the formula for confidence and lift and try to build your code. I have not added more on the filtering of the rules over here as its already covered in the R implementation. Recommend you to try this in Python.

About the Author:

Mohan Rai

Mohan Rai is an Alumni of IIM Bangalore , he has completed his MBA from University of Pune and Bachelor of Science (Statistics) from University of Pune. He is a Certified Data Scientist by EMC.Mohan is a learner and has been enriching his experience throughout his career by exposing himself to several opportunities in the capacity of an Advisor, Consultant and a Business Owner. He has more than 18 years’ experience in the field of Analytics and has worked as an Analytics SME on domains ranging from IT, Banking, Construction, Real Estate, Automobile, Component Manufacturing and Retail. His functional scope covers areas including Training, Research, Sales, Market Research, Sales Planning, and Market Strategy.

Association Rule Mining in python

Modal

Are you sure to delete this information ?