Table of Content

  1. Introduction
  2. Imbalance classes vs Unbalanced classes
  3. Types of Imbalance
  4. Why it is important to handle imbalance data
  5. How to handle imbalance data
    1. Using data resampling techniques
      1. Random Under Sampling
      2. Random Over Sampling
      3. SMOTE(Synthetic Minority Oversampling Technique)
    2. Using algorithmic ensembling technique
      1. Bagging Based Balancing Method
      2. Boosting Based Balancing Method
  6. Performing class balancing on Telecom Churn Dataset
  7. Conclusion

 

 

Introduction

Imbalanced Classes, is the condition in which one type of class/data is more than the other type of data. It means that the data is skewed toward a particular type of class and favors the results of the machine learning model for that class.

It is observed in classification problems only as the target variable value is discrete. It not only affects binary classification but also greatly affects the multiclass classification problem.

Example- Suppose we are working on a dog cat classifier and we have a dataset that has only 20% dog flags and 80?t flags. This dataset is thus known as imbalanced data and the classes are imbalanced classes.

 

Imbalance classes vs Unbalanced classes

Now the question arises – What is the difference between imbalance classes and unbalanced classes? So the answer to this question lies in the definition of these two classes, as we already know the definition of imbalance class we will now understand what is unbalanced classes. An unbalanced class refers to that class that was balanced at an early stage but now it is not balanced either due to preprocessing or splitting of the dataset.

Example of Imbalanced Classes:

Suppose you have 10 candies of 2 flavors – mango and orange. If you have 4 candies of mango flavor and 6 candies of orange flavor. This is the case of imbalance classes as the number of mango-flavored candy is less than orange-flavored candy.

Example of Unbalanced Classes:

Suppose you have 10 candies of 2 flavors – mango and orange. Now you have the same number of candies for both the flavors but for training and testing you took 4 mango-flavored candy and 3 orange-flavored candy now even though the data was balanced initially but due to splitting the data becomes unbalanced.

 

Types of Imbalance

Classification of imbalance in classes is based on the percent or how much the actual difference is there between the classes and based on that difference we define the imbalance.

Slight Imbalance - This is the imbalance in which the data difference between the classes is non-significant or not much. The general 40% to 60 %  coverage of classes comes under this category.

Severe Imbalance - This is the imbalance in which the data difference between the classes is significant. The general 70% to 99 %  coverage of classes comes under this category.

 

Why it is important to handle imbalance data

The main reason we try to remove the imbalance between the classes is that it greatly affects the accuracy of our model. Now the question arises why is it so? And the answer lies like machine learning algorithms that are whenever any event occurs non frequently it is considered as a rare event. The standard machine learning classification algorithm has a bias toward those classes that have a large number of values. Hence the classes having fewer data are treated as noise and are often ignored causing to predict only those classes as a result which are large in number. This generally gives high accuracy results but fails to perform well on the F-1 score.  

 

How to handle imbalance data

Using data resampling techniques

This is the data level approach in which we perform resampling of data to balance the number of values for each class.

 

Random Under Sampling

In this type of sampling technique, the class consist of more data is considered and a percentage of its data is only taken for the algorithm to balance the class size.

But this type of sampling causes a huge loss of data and hence some way may affect the full potential of the model.

 

Random Over Sampling

In this type of sampling technique, the class consist of less data is considered and the data of this class is replicated to reduce the gap between the size of majority and minority class.

But this type of sampling causes the overfitting of data as there is huge redundancy in the dataset.

 

SMOTE(Synthetic Minority Oversampling Technique)

This is the technique that is used to avoid overfitting that is caused by random oversampling as the data is a replica of the same data. This method takes the subset of the data and then creates the synthetic instances of the data. These synthetic instances are different from the original data and are added to the original dataset. The new dataset thus obtained is used as a sample to train the classification models.

But it is ineffective for high dimensional data.

 

Using algorithmic ensembling technique

In this, we use external algorithms to resample the class data.

 

Bagging Based Balancing Method

In this, the data is divided into different parts and each part is different from the dataset and at last, the final result is obtained from the aggregated values from every output obtain from each part of the data.

 

Boosting Based Balancing Method

It uses two boosting methods that are Ada Boost method and the Gradient Tree Boosting method to perform the Balancing of classes. The basic intuition of this method is that it combines base/weak classifiers that give average outcomes to the strong learners.  

 

Performing class balancing on Telecom Churn Dataset

    # importing the necessary libraries 
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    import imblearn
    from sklearn.linear_model import LogisticRegression
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.metrics import confusion_matrix
    from sklearn.metrics import classification_report
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score
    from imblearn.under_sampling import RandomUnderSampler
    from imblearn.over_sampling import RandomOverSampler
    from imblearn.combine import SMOTETomek
    import warnings
    warnings.filterwarnings('ignore')
  
    # Reading the csv data and checking the columns
    # Download the churn data from this link
    data = pd.read_csv("churn-data.csv")
    data.columns

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

    # Giving first glance to the data
    pd.set_option('display.max_columns', None)
    data.head()

   customerID  gender  SeniorCitizen Partner Dependents  tenure PhoneService     MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies        Contract PaperlessBilling              PaymentMethod  MonthlyCharges TotalCharges Churn
0  7590-VHVEG  Female              0     Yes         No       1           No  No phone service             DSL             No          Yes               No          No          No              No  Month-to-month              Yes           Electronic check           29.85        29.85    No
1  5575-GNVDE    Male              0      No         No      34          Yes                No             DSL            Yes           No              Yes          No          No              No        One year               No               Mailed check           56.95       1889.5    No
2  3668-QPYBK    Male              0      No         No       2          Yes                No             DSL            Yes          Yes               No          No          No              No  Month-to-month              Yes               Mailed check           53.85       108.15   Yes
3  7795-CFOCW    Male              0      No         No      45           No  No phone service             DSL            Yes           No              Yes         Yes          No              No        One year               No  Bank transfer (automatic)           42.30      1840.75    No
4  9237-HQITU  Female              0      No         No       2          Yes                No     Fiber optic             No           No               No          No          No              No  Month-to-month              Yes           Electronic check           70.70       151.65   Yes


 

    # Getting information of the columns
    data.info()

RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 
 17  PaymentMethod     7043 non-null   object 
 18  MonthlyCharges    7043 non-null   float64
 19  TotalCharges      7043 non-null   object 
 20  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(18)      
memory usage: 1.1+ MB                         

    # Checking how much difference is there in the target variables 
    x = np.array(data["Churn"].unique()) 
    y = np.array(data["Churn"].value_counts()) 
    sns.barplot(x=x,y=y)

Visual representation of imbalanced data set

Figure 1 : Visual representation of imbalanced data set 

    # This function is for encoding data which is in Yes and No format
    def encodingforYes(series):
        if series=="No":
            return 1
        else:
            return 0
    
    def encodingforgender(series):
        if series=="Male":
            return 1
        else:
            return 0

    # Columns that consist of data in yes and no format
    columns = ['Partner','Dependents','PhoneService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection'
             ,'TechSupport', 'StreamingTV', 'StreamingMovies','PaperlessBilling','Churn']

    # Traversing on the columns list and applying the encoding function
    for i in columns:
        data[i] = data[i].apply(encodingforYes)

    data['gender']=data['gender'].apply(encodingforgender)

    # Function so as to perform onehot encoding on the categorical data
    def onehot(column):
        heading = column.unique()
        newheading = [i+str(column) for i in heading]
        data = OneHotEncoder(sparse=False).fit_transform(np.array(column).reshape(-1,1))
        subframe = pd.DataFrame(data,columns=newheading)
        return subframe

    # Traversing on the categorical columns 
    onehotcolumns=['MultipleLines','InternetService','Contract','PaymentMethod']

    for i in onehotcolumns:
        column = data[i]
        data.drop(i,axis=1,inplace=True,)
        frame = onehot(column)
        data = pd.concat([data,frame],axis=1)

    # Converting object type data into numeric type as Logistic Regression don't understand object data
    def conversion(value):
        if(value==" "):
            return '0'
        else:
            return value
    data['TotalCharges'] = data['TotalCharges'].apply(conversion)    
    data['TotalCharges'] = pd.to_numeric(data['TotalCharges'])

    # Dropping unwanted column
    data.drop("customerID",axis=1,inplace=True)
    len(data.columns)

29

    # Initializing a Simple LogisticRegression object 
    lr = LogisticRegression()

    # Distributing data in terms of independent and dependent variables
    X = data.drop('Churn',axis=1)
    y = data['Churn']

    # Splitting data in train and test
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3)

    # Fitting data
    lr.fit(X_train,y_train)

    # Predicting the data 
    y_pred = lr.predict(X_test)

    # Checking the precision, recall, f1-score and accuracy
    print(classification_report(y_pred,y_test))

              precision    recall  f1-score   support

           0       0.56      0.66      0.60       470
           1       0.90      0.85      0.87      1643

    accuracy                           0.81      2113
   macro avg       0.73      0.75      0.74      2113
weighted avg       0.82      0.81      0.81      2113

    # Now performing undersampling on the data with sampling strategy 1 which means data of both classes will be same
    undersampler = RandomUnderSampler(sampling_strategy=1)

    X_under,y_under = undersampler.fit_resample(X,y)


    y_ = np.array(y_under.value_counts())
    sns.barplot(x=x,y=y_)

Balanced Data Using Under Sampling

Figure 2 : Balanced Data Using Under Sampling 

    print(y_under.value_counts())

1    1869
0    1869
Name: Churn, dtype: int64


    X_under_train,X_under_test,y_under_train,y_under_test = train_test_split(X_under,y_under,test_size=0.3)
    lr.fit(X_under_train,y_under_train)
    y_under_pred = lr.predict(X_under_test)
    print(classification_report(y_under_pred,y_under_test))

              precision    recall  f1-score   support

           0       0.80      0.72      0.76       608
           1       0.71      0.79      0.75       514

    accuracy                           0.75      1122
   macro avg       0.76      0.76      0.75      1122
weighted avg       0.76      0.75      0.75      1122

Now you can see after performing undersampling our score reduced as we lost some information.

    # Now performing oversampling on the data with sampling strategy 1 which means data of both classes will be same
    oversampler = RandomOverSampler(sampling_strategy=1)

    X_over,y_over = oversampler.fit_resample(X,y)


    y_ = np.array(y_over.value_counts())
    sns.barplot(x=x,y=y_)

Balanced Data Using Over Sampling

Figure 3 : Balanced Data Using Over Sampling 

    print(y_over.value_counts())

1    5174
0    5174
Name: Churn, dtype: int64

    X_over_train,X_over_test,y_over_train,y_over_test = train_test_split(X_over,y_over,test_size=0.3)
    lr.fit(X_over_train,y_over_train)
    y_over_pred = lr.predict(X_over_test)
    print(classification_report(y_over_pred,y_over_test))

              precision    recall  f1-score   support

           0       0.81      0.75      0.78      1670
           1       0.73      0.79      0.76      1435

    accuracy                           0.77      3105
   macro avg       0.77      0.77      0.77      3105
weighted avg       0.77      0.77      0.77      3105

Now you can see after performing oversampling our score reduced as the data redundancy increased causing overfitting.

 

    # Now performing oversampling using SMOTETomek on the data with sampling strategy 1 which means data of both classes will be same
    smote = SMOTETomek(sampling_strategy=1)

    X_smote,y_smote = smote.fit_resample(X,y)

    y__ = np.array(y_smote.value_counts())
    sns.barplot(x=x,y=y__)

Balanced Data Using Over Sampling using SMOTETomek

Figure 3 : Balanced Data Using Over Sampling using SMOTETomek 

    print(y_smote.value_counts())

1    4724
0    4724
Name: Churn, dtype: int64

X_smote_train,X_smote_test,y_smote_train,y_smote_test = train_test_split(X_smote,y_smote,test_size=0.3)
lr.fit(X_smote_train,y_smote_train)
y_smote_pred = lr.predict(X_smote_test)
print(classification_report(y_smote_pred,y_smote_test))

              precision    recall  f1-score   support

           0       0.81      0.79      0.80      1455
           1       0.79      0.80      0.80      1401

    accuracy                           0.80      2856
   macro avg       0.80      0.80      0.80      2856
weighted avg       0.80      0.80      0.80      2856

Now you can see after performing oversampling by SMOTETomek our score increased as the data formed is different than the original data.

 

Conclusion

Whenever any dataset contains any imbalanced classes then, we can’t directly rely on our machine learning models accuracy metrics as the output is generally skewed toward the majority class. It is very important to perform data sampling. You should also read through this article on Cross Vaidation in ML Models. This will help you with techniques for validating the models.

 

 

About the Author's:

Utkarsh Bahukhandi

Utkarsh Bahukhandi, is B.Tech undergraduate from Maharaja Agrasen institute of technology. He is a data science enthusiast and explores challenging projects in ML and DS niche like Natural Language Processing and Computer Vision.

 

Mohan Rai

Mohan Rai is an Alumni of IIM Bangalore , he has completed his MBA from University of Pune and Bachelor of Science (Statistics) from University of Pune. He is a Certified Data Scientist by EMC. Mohan is a learner and has been enriching his experience throughout his career by exposing himself to several opportunities in the capacity of an Advisor, Consultant and a Business Owner. He has more than 18 years’ experience in the field of Analytics and has worked as an Analytics SME on domains ranging from IT, Banking, Construction, Real Estate, Automobile, Component Manufacturing and Retail. His functional scope covers areas including Training, Research, Sales, Market Research, Sales Planning, and Market Strategy.