Table Of Content

## Introduction

In business it is important to know about your customers and their needs, but it is very difficult and time consuming to provide personalized service for each customer so, we group them on different characteristic and buying behavior which is known as Customer segmentation.

Grouping customers can help in understanding their behavior and also help company to do target marketing which increase the sales of company. One of the methods for customer segmentation is RFM analysis.

In this article we will learn about what RFM modeling is and how we can do this in retail sector.

## RFM Modeling

In RFM modeling R stands for Recency, F stands for Frequency and M stands for Monetary. In this analysis customers grouped on the basis of their purchase history, how recently they purchased any product from company (Recency), how often they purchase (Frequency) and how much did they buy (Monetary).

We assign value of R, F and M for each customer; let’s take an example of retail product to understand it. Suppose a customer bought two products in an interval of one months of each price of 10K, here the value of R is 1 as last transaction of customer was in last month and he bought only one product in a month, value of F will be 2 as customer bought total 2 products and M will be 20K as he bought two product each of 10K.

So, till now we have assigned the value to a customers but the question is how we can find the best customer among them.

Are they from who spend the most money in your product? But what happen if they bought year ago from you and they are no longer your customer, are they still your best customers? Probably the answer is no. Considering only one aspect to group them is not a good thing and not gives the desired result, that’s why we use the RFM model which combines all three aspects and rank them accordingly.

## Import Dataset and Libraries

Dataset contains the retail transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based store. Click this download retail data link.

### Importing necessary libraries

```    import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

```

```    retail = pd.read_excel("Online Retail.xlsx")

InvoiceNo StockCode                          Description  Quantity           InvoiceDate  UnitPrice  CustomerID         Country
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   2010-12-01 08:26:00       2.55     17850.0  United Kingdom
1    536365     71053                  WHITE METAL LANTERN         6   2010-12-01 08:26:00       3.39     17850.0  United Kingdom
2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   2010-12-01 08:26:00       2.75     17850.0  United Kingdom
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   2010-12-01 08:26:00       3.39     17850.0  United Kingdom
4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   2010-12-01 08:26:00       3.39     17850.0  United Kingdom

```
```    retail.info()

RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
#   Column       Non-Null Count   Dtype
---  ------       --------------   -----
0   InvoiceNo    541909 non-null  object
1   StockCode    541909 non-null  object
2   Description  540455 non-null  object
3   Quantity     541909 non-null  int64
4   InvoiceDate  541909 non-null  datetime64[ns]
5   UnitPrice    541909 non-null  float64
6   CustomerID   406829 non-null  float64
7   Country      541909 non-null  object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB

```
```    retail.describe()

Quantity      UnitPrice     CustomerID
count  541909.000000  541909.000000  406829.000000
mean        9.552250       4.611114   15287.690570
std       218.081158      96.759853    1713.600303
min    -80995.000000  -11062.060000   12346.000000
25per       1.000000       1.250000   13953.000000
50per       3.000000       2.080000   15152.000000
70per       0.000000       4.130000   16791.000000
max     80995.000000   38970.000000   18287.000000

```
```    retail.isnull().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

```

As we can see that the dataset contains null values in Customer ID and Description column, we have to remove the null values as Customer ID contains unique values.

```    retail.dropna(subset=['CustomerID'],how='all',inplace=True)
retail.isnull().sum()

InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64

```

Now we don’t have any null value in our dataset, so we can move further now. Let’s see the number of unique customer id.

```    retail['CustomerID'].nunique()

4372

```

Now let’s see the quantity and unit price distribution

```    retail['Quantity'].value_counts()
1      73314
12     60033
2      58003
6      37688
4      32183

-51         1
95         1
-162        1
94         1
342        1
Name: Quantity, Length: 436, dtype: int64

```

We have negative value for quantity which can not be possible so we have to remove this values.

```    retail = retail[retail['Quantity']>0]
# verify if its taken care of
retail['Quantity'].min()
1

```

In dataset both date and time is given but we have to deal only with date so create a new column which contains date only.

```    import datetime as dt
retail['date'] = pd.DatetimeIndex(retail['InvoiceDate']).date

InvoiceNo StockCode                          Description  Quantity          InvoiceDate  UnitPrice  CustomerID         Country        date
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   2010-12-01 08:26:00       2.55     17850.0  United Kingdom  2010-12-01
1    536365     71053                  WHITE METAL LANTERN         6   2010-12-01 08:26:00       3.39     17850.0  United Kingdom  2010-12-01
2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   2010-12-01 08:26:00       2.75     17850.0  United Kingdom  2010-12-01
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   2010-12-01 08:26:00       3.39     17850.0  United Kingdom  2010-12-01

```

## Creating RFM Table

Till now our half part is done, now we have to calculate RFM values and make a column for them and then group the customers according to RFM values.

Firstly we will calculate R value

```    recency = retail.groupby(by='CustomerID', as_index=False)['date'].max()
recency.columns = ['CustomerID','LastPurshaceDate']
recent_date = recency.LastPurshaceDate.max()
print(recent_date)

2011-12-09

```
```    recency['Recency'] = recency['LastPurshaceDate'].apply(lambda x: (recent_date - x).days)

CustomerID LastPurshaceDate  Recency
0     12346.0       2011-01-18      325
1     12347.0       2011-12-07        2
2     12348.0       2011-09-25       75
3     12349.0       2011-11-21       18
4     12350.0       2011-02-02      310

```
```    # Drop duplicates
df = retail
df.drop_duplicates(subset=['InvoiceNo', 'CustomerID'], keep="first", inplace=True)

```

Now we will calculate the value of frequency

```    # Calculate the frequency of purchases
frequency = df.groupby(by=['CustomerID'], as_index=False)['InvoiceNo'].count()
frequency.columns = ['CustomerID','Frequency']

CustomerID  Frequency
0     12346.0          1
1     12347.0          7
2     12348.0          4
3     12349.0          1
4     12350.0          1

```

At last we will create monetary value

```    # Create column total cost
df['TotalCost'] = df['Quantity'] * df['UnitPrice']
monetary = df.groupby(by='CustomerID',as_index=False).agg({'TotalCost': 'sum'})
monetary.columns = ['CustomerID','Monetary']

CustomerID  Monetary
0     12346.0  77183.60
1     12347.0    163.16
2     12348.0    331.36
3     12349.0     15.00
4     12350.0     25.20

```

Now we will combine all this value in new dataset

```    # Create RFM Table
# Merge recency dataframe with frequency dataframe
rfm = recency.merge(frequency,on='CustomerID')
# Merge with monetary dataframe
rfm_df = rfm.merge(monetary,on='CustomerID')
# Use CustomerID as index
rfm_df.set_index('CustomerID',inplace=True)

LastPurshaceDate  Recency  Frequency  Monetary
CustomerID
12346.0          2011-01-18      325          1  77183.60
12347.0          2011-12-07        2          7    163.16
12348.0          2011-09-25       75          4    331.36
12349.0          2011-11-21       18          1     15.00
12350.0          2011-02-02      310          1     25.20

```

We don’t need the last purchase date column, so we can remove it

```    rfm_df = rfm_df.drop(['LastPurshaceDate'], axis=1)

Recency  Frequency  Monetary
CustomerID
12346.0         325          1  77183.60
12347.0           2          7    163.16
12348.0          75          4    331.36
12349.0          18          1     15.00
12350.0         310          1     25.20

data = rfm_df.copy()

```

## Finding number of clusters

Now, we have to find the best number of clusters or group for dataset, which can be done by elbow method.

```    # Scaling the Data
from sklearn.preprocessing import StandardScaler, normalize
sc = StandardScaler()
data_scaled = sc.fit_transform(data)
data_scaled.shape

(4339, 3)

data_scaled

array([[ 2.32967293e+00, -4.24674873e-01,  2.45807187e+01],
[-9.00448767e-01,  3.54080191e-01, -4.27122694e-02],
[-1.70421263e-01, -3.52973410e-02,  1.10612625e-02],
...,
[-8.50446884e-01, -2.94882363e-01, -8.26459904e-02],
[-8.90448391e-01,  1.52221279e+00, -7.35345418e-02],
[-5.00433697e-01, -1.65089852e-01, -6.91706374e-02]])

# Applying Elbow Method
from sklearn.cluster import KMeans
score_1 = []
cluster = range(1,15)
for i in cluster:
kmeans = KMeans(n_clusters = i)
kmeans.fit(data_scaled)
score_1.append(kmeans.inertia_)

```
```    plt.plot(score_1, 'bx-')
plt.title('Finding the right number of clusters')
plt.xlabel('Clusters')
plt.ylabel('Scores')
plt.show()

```

Figure 1 : Number of Clusters

From above graph we can say that the best number of cluster for this is 3 or 4. We will go with 4 numbers of clusters; you can try 3 numbers of clusters also.

```    kmeans = KMeans(4)
kmeans.fit(data_scaled)
data['cluster'] = kmeans.labels_

Recency  Frequency  Monetary  cluster
CustomerID
12350.0         310          1     25.20        1
12353.0         204          1     19.90        1
12354.0         232          1     20.80        1
12355.0         214          1     30.00        1
12361.0         287          1     23.40        1
12365.0         291          2    335.69        1
12373.0         311          1     19.50        1
12377.0         315          2     57.00        1
12383.0         184          5    102.40        1
12386.0         337          2     98.00        1

Recency  Frequency  Monetary  cluster
CustomerID
12346.0         325          1  77183.60        2
12748.0           0        210   3841.31        2
12971.0           3         86   3952.36        2
13089.0           2         97   5389.39        2
13408.0           1         62   4682.30        2
13694.0           3         50   7519.06        2
13798.0           1         57   8194.26        2
14156.0           9         55   6010.73        2
14527.0           2         55    613.21        2
14606.0           1         93   1023.97        2

```
```    sns.boxplot(data.cluster,data.Recency)

```

Figure 2 : Recency Distribution as a parameter in Clusters

```    sns.boxplot(data.cluster,data.Frequency)

```

Figure 3 : Frequency Distribution as a parameter in Clusters

```    sns.boxplot(data.cluster,data.Monetary)

```

Figure 4 : Monetary Distribution as a parameter in Clusters

## Conclusion

A company can get benefits from customer segmentation as it is easy for them to target their customer in more personalized way and do target marketing. With the help of RFM modeling it is easy to do customer segmentation or group them with similar characteristic and business insights can be made. It is helpful in various sectors like financial, marketing, sales, etc. You can read this interesting article on Survival models