Understanding Data Pipelines For Machine Learning Using Sci Kit Learn A Blog Around Data Pipelines And End To End Machine Learning In Python.

Table of Content

Inspiration
Dataset exploration
Dependency Packages Installation in your python environment
Exploratory Data Analysis
Missing value treatment
Outlier Treatment
1. What to do with these outliers
Feature engineering
Model Building
Conclusion

Inspiration

The objective of this project is to identify the presence of diabetes in patient with the help of machine learning (Mathematical complicated calculation, Pun intended). This would help the doctors and insurance companies to save their time and money respectively. I am saying this because to check whether you have diabetes or not you have to go through various tests which may cost you time and money.

Let’s think you have health insurance then they have to bear that cost right. Now if insurance company have some predictive capability to predict the probability of a person having diabetes, it will help them to test only higher probability candidates. This obviously reduces company expenses.

This project is the approach to build that predictive capability using machine learning algorithm by way of automated pipelines. This is an end to end project.

Dataset exploration

First step in ML project is to understand the problem and dataset. Here the datasets we will be using consists of several medical variables (features) and one target variable (Outcome).

Pregnancy: This indicates number of times a female subject was pregnant
Glucose: Glucose concentration a 2 hour in an oral glucose tolerance test (mg/dl) - A 2-hour value between 140 and 200 mg/dL (7.8 and 11.1 mmol/L) is called impaired glucose tolerance. This is called "pre- diabetes." It means you are at increased risk of developing diabetes over time. A glucose level of 200 mg/dL (11.1 mmol/L) or higher is used to diagnose diabetes.
BloodPressure: Diastolic blood pressure (mm Hg)- If Diastolic B.P > 90 means High B.P (High Probability of Diabetes) Diastolic B.P < 60 means low B.P (Less Probability of Diabetes)
Skin Thickness: Triceps skinfold thickness (mm) -A value used to estimate body fat. Normal Triceps SkinFold Thickness in women is 23mm. Higher thickness leads to obesity and chances of diabetes increases.
Insulin: 2-Hour Serum Insulin (mu U/ml) - Normal Insulin Level 16-166 mIU/L Values above this range can be alarming.
BMI: Body Mass Index (weight in kg/ height in m2) - Body Mass Index of 18.5 to 25 is within the normal range BMI between 25 and 30 then it falls within the overweight range. A BMI of 30 or over falls within the obese range.
DiabetesPedigreeFunction: It provides information about diabetes history in relatives and genetic relationship of those relatives with patients. Higher Pedigree Function means patient is more likely to have diabetes.
Age: Age in years
Outcome: Class variable (either 0 or 1)- where ‘0’ denotes having diabetes and ‘1’ denotes patient having diabetes The dependent variable is whether the patient is having diabetes or not.

Dependency Packages Installation in your python environment

We will install packages in python by using “pip” followed by install package name. For example, write “ pip install sklearn ” and run that tab when you are using jupyter notebook.

Required packages are as below:

Pandas- to perform dataframe operations
Numpy- to perform array operations
Matplotlib- to plot charts
Seaborn – to plot charts
Missingo- anlysing the missing values
sklearn – to build the model
pickle- to save and load the model

Loading the libraries and dataset

    # import required libraries
    import pandas as pds
    import numpy as np
    import matplotlib.pyplot as plot
    import seaborn as sn
    %matplotlib inline
    import warnings
    warnings.filterwarnings('ignore')
    import missingno as msno

Download the Diabetic data set by clicking here.

    # Loading the dataset( Assumption : data file is in working directory )
    diabetes_df=pds.read_csv('health_care_diabetes.csv')

Exploratory Data Analysis

For classification problems we need a balanced data set. We will check imbalance in our data by using value_counts function which in turn counts the rows of each class in one column.

    # Check if the data is a Balanced one or not
    print(diabetes_df.Outcome.value_counts())
    diabetes_df.Outcome.value_counts().plot(kind='barh')

0 500

1 268

Name: Outcome, dtype: int64

Distribution of the target variable to check whether its balanced data set

Figure 1 : Distribution of the target variable (Diabetic - Yes/No) to confirm unbalanced data set problem

Key observations from figure 1,

we got an imbalanced dataset

    # Compute basic statistic for our data 
    # Columns with non-numerical data would be not be displayed unless we specify the parameter include="all"

    #transpose the result use '.T'
    diabetes_df.describe().T

Basic statistic for the diabetics data set

Figure 2 : Descriptive Statistics for the Diabetic data set

Key observations from figure 2,

After looking at the mean, median and mode along with the histogram we can say that most of features are skewed just because of the zero values concentration on one side, and having some outliers
On following columns, zero does not make sense and indicates presence of missing value.
1. Glucose
2. BloodPressure
3. SkinThickness
4. Insulin
5. BMI
As total observation are 768, so for now we can say there is no missing values.

Recommended to replace zero values with NaN for better understanding and interpretation of dataset.

Missing value treatment

    # making a copy of dataframe
    df_copy = diabetes_df.copy(deep = True)
    #replace zero values as nan in relevent columns
    for i in df_copy[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']]:
        df_copy[i].replace(0, np.nan, inplace= True)

    # we will plot missing matrix map with missingno package
    msno.bar(df_copy, figsize=(8,6),fontsize=10)

Identifying the missing values in the data by looking at the count distribution

Figure 3 : Identifying the missing values in the data by looking at the count distribution

Now we know there are so many missing values. There are two ways to handle this issue, we can delete the whole row containing missing value or deleting the whole column if missing value percentage is more than 60%. Here in this dataset we got missing values in insulin column but we know that for predicting diabetes insulin is best feature thus we will keep all the features as it is and we will impute missing values by using mean value of that column.

    # split data in two dataframes based on Outcome to impute NaN values
    df1=df_copy[(df_copy.Outcome==0)]
    df2=df_copy[(df_copy.Outcome==1)]
    df1.fillna(df1.mean(),inplace=True)
    df2.fillna(df2.mean(),inplace=True)

    df_copy1=pds.concat((df1,df2),axis=0)
    df_copy1.sort_values(by='Age',inplace=True)
    df_copy1.reset_index(drop=True,inplace=True)
    df_copy1.isna().sum()

Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64

we have successfully imputed the missing values.

Outlier Treatment

We will plot boxplot to analyse and understand the extreme values , so that we can better understand how many values are out of the interquartile range and which features are having more number of outliers.

    # Outliers Visualization
    def plot_outliers(df):
        df_name = df.columns
        fig, axs = plot.subplots(1, len(df_name), figsize=(20, 10))

        for i, col in enumerate(df_name):
            axs[i].set_title(col)
            axs[i].boxplot(df[col])
        fig.suptitle('Outliers');

    plot_outliers(df_copy1)

checking the outliers in diabetics data using box plots

Figure 4 : Checking the outliers in diabetics data using box plots

What to do with these outliers

From above boxplot and function output we came to know that there are some outliers which are very extreme values.

Outliers are having some useful information or they are just because of typo error. we will verify it by removing extreme values at first....If accuracy not achieved then we will drop all outlier's.
Insulin and DiabetesPedigreefunction having many outlier's. so we will remove only extreme values just to avoid data loss.

Note:

Outliers leads to make wrong decisions by making wrong if else loop in decesion trees whenever outlier is a typo and also its not easy for liniear methods to handle outliers.

But when it is genuine and containing useful information at that time droping it will loss of information. In short model will not be able to be generalised.

    # Insulin value above 420 are extreme and making tail longer so we will drop them
    df_copy1=df_copy1[(df_copy1.Insulin<=420)]

    # skinthickness above 80 has only one datapoint so we will drop it
    df_copy1=df_copy1[(df_copy1.SkinThickness<=80)]

    # Blood Pressure below 20 doesn't make sense so we will drop it
    df_copy1=df_copy1[(df_copy1.BloodPressure>=20)]

    # BMI above 60 has only one datapoint so we will drop it
    df_copy1=df_copy1[(df_copy1.BMI<=60)]

    # DiabetesPedigreeFunction value above 2 are extreme and making tail longer so we will drop them
    df_copy1=df_copy1[(df_copy1.DiabetesPedigreeFunction<=2)]

    # Age value above 80 are extreme and making tail longer so we will drop them
    df_copy1=df_copy1[(df_copy1.Age<=80)]

    # Pregnancies value above 15 are extreme and making tail longer so we will drop them
    df_copy1=df_copy1[(df_copy1.Pregnancies<=15)]
    df_clean=df_copy1.copy(deep=True)

    # dataframe shape before and after outlier treatment
    print('Before outlier removal dataset shape is', df_copy.shape)
    print('After outlier removal dataset shape is', df_copy1.shape)

Before outlier removal dataset shape is (768, 9)
After outlier removal dataset shape is (744, 9)

We have successfully removed those extreme values from the dataset which may have created more noise to build a model, resulting in lower score.

Further we will do some feature engineering and EDA to better understand the features and creating newer ones.

Feature engineering

We will further analyse the dataset distribution using pairplot method available in seaborn.

    # pairplot 
    sn.pairplot(df_clean, hue='Outcome')

Look at the distribution of diabetic data using pairplot in seaborn

Figure 5 : Look at the distribution of diabetic data using pairplot in seaborn

    # close look on scatter plot of Glucose vs Age which seems to be corelated
    sn.scatterplot(x = df_clean['Glucose'], y = df_clean['Age'], hue = "Outcome", data = df_clean)

Scatter plot of glucose and age highlights the correlation

Figure 6 : Scatter plot of glucose and age highlights the correlation

Observations :

From above pairplot diagonal KDE plot we can conclude that our clean data looks normally distributed. Except Pregnancies and Age and DiabetesPedigreeFunction these features slightly right skewed
Scatterplots showing data points are not easily linearly separable as they are overlapped, thus linear classification algorithms will not be able to give higher accuracy.
Pairplot shows 'BloodPressure' and 'DiabetesPedigreeFunction' has similar KDE distribution for both Outcomes so it might not suitable for separation
Glucose has close relation with insulin and Age and most of healthy peopels found within a range of Glucose 40 to 120 and data is more concentrated there, Over this Glucose range we can predict most probably the person is Diabetic

we will create new features depending on what we come to know from our EDA

    df_clean['AG_Prob_D']= np.where(((df_clean.Glucose<=150) & (df_clean.Age<30)),0,1)

    df_clean['IG_Prob_D']= np.where(((df_clean.Insulin<=150)&(df_clean.Glucose<=130)),0,1)

    df_clean['Preg_DPF']= df_clean.DiabetesPedigreeFunction*df_clean.Pregnancies

Correlation Heatmap

    # correlation of features
    corr=df_clean.corr()
    sn.heatmap(corr[(corr >= 0.2) & (corr >= -0.2)],annot=True)

Correlation grid for diabetic data set

Figure 7 : Correlation grid for diabetic data set

Observations:

DiabetesPedigreeFunction and Blood pressure has no linear corelation with outcome more than 0.2. So these variables will not be helpful with linear classification models
No feature is highly corelated with outcome. Insulin, Glucose and newly created features are having slightly higher corelation with outcome.

Model Building

Train Test Split (Hold Out Method) : A mechanism of identifying the train and test data splits, usually a 2/3 blocked splits. If there is surety of absence of bias-variance anomaly, then the modeller can decide whether he want's a 70:30 split or 70:20:10 split. In certain cases, we may go for 50:50, specifically when we expect over learning by the model.

Cross Validation: Instead of segmenting the data into train, test and validation sets, or train test splits, we can get equal sized blocks called as folds. As a programmer, you can decide, if you want 3/5/7... folds. One of the folds will be the test set and the other combined would be the train set. There are many variations within this adding randomness vs sequential.

Stratify : Stratify parameter makes sure that the number of cases in the population data is maintained in the same proportion in the split data sets.

    # Its always recommended to set a seed for reproducibility, so that you can simulate the same sets for experimenation and rectification of models
    SEED = 10

    # converting and splitting  features and label into numpy array train and test data
    from sklearn.model_selection import train_test_split
    features, label = df_clean.iloc[:,[1,2,3,4,5,7,9,10,11]],df_clean.iloc[:,[8]]

    # train, test split(we will be validate our model on 20?ta with test cases and will build model with CV)

    X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=0.2, random_state=SEED,stratify=label)

What is Imbalanced Dataset

In order to make our dataset Balanced we will use imblearn package. With the help of oversampling we will get a balanced dataset for modeling so that our model will not tends towards predicting more zero class rather than one class for which we are interested.

Note: we are considering that we are focused to classify correctly the Diabetic(outcome=1) person. (If both outcomes have same weight then there is no need of oversampling.)

    print("Before Over Sampling, counts: Features {} and Label {}".format(X_train.shape,y_train.shape)) 
    print("{}".format(y_train.Outcome.value_counts())) 
    # import SMOTE module from imblearn library 
    # Because of permission, installing from package managers like Anaconda Navigator may not work
    # Hence use Anaconda prompt or command prompt or terminal 
    # pip install -U imbalanced-learn
    from imblearn.over_sampling import SMOTE 
    sm = SMOTE(random_state = 64) 
    X_train_os, y_train_os = sm.fit_resample(X_train, y_train) 
  
    print("After Over Sampling, counts: Features {} and Label {}".format(X_train_os.shape,y_train_os.shape)) 
    print("{}".format(y_train_os.Outcome.value_counts()))

Before Over Sampling, counts: Features (595, 9) and Label (595, 1)
0 394
1 201
Name: Outcome, dtype: int64

After Over Sampling, counts: Features (788, 9) and Label (788, 1)
1 394
0 394
Name: Outcome, dtype: int64

Build the Machine Learning Model

To build an efficient model, we need to compare the performance of various machine learning classification algorithms. So that we can achieve best F1 score. F1 score is the harmonic mean of recall and precision.

    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler
    from sklearn.model_selection import cross_validate, cross_val_score,ShuffleSplit
    from sklearn.metrics import accuracy_score, confusion_matrix ,classification_report,roc_auc_score,auc
    from sklearn.svm import SVC
    from sklearn.linear_model import LogisticRegression
    from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
    from sklearn.neighbors import KNeighborsClassifier

    cv_split = ShuffleSplit(n_splits = 10, test_size = 0.1, train_size = 0.9, random_state = 0)
    pipeline = Pipeline([
        ('scale', StandardScaler()), # Step1 - normalize data
        ('clf', LogisticRegression()) # Step2 - classifier
    ])
    pipeline.steps


    clfs = []
    clfs.append(LogisticRegression())
    clfs.append(SVC())
    clfs.append(KNeighborsClassifier(n_neighbors=5))
    clfs.append(RandomForestClassifier(max_depth=6))          
    clfs.append(GradientBoostingClassifier(n_estimators=100))

    results=[]
    results1=[]
    results2=[]
    results3=[]
    for classifier in clfs:
        pipeline.set_params(clf = classifier)
        scores = cross_validate(pipeline, X_train, y_train, cv=cv_split, return_train_score=True)
        # default parameter fit for comparision with CV Score and Default fit score
        pipeline.set_params(clf = classifier)
        pipeline.fit(X_train,y_train)
        train_acc=pipeline.score(X_train, y_train)
        test_acc=pipeline.score(X_test, y_test)
        print('---------------------------------')
        print(str(classifier))
        print('-----------------------------------')
        for key, values in scores.items():
                if key=='test_score':
                    results.append({'Test_mean_Score': values.mean(), 'Test_Std': values.std()})
        for key, values in scores.items():
                if key=='train_score':
                    results1.append({'Train_mean_Score': values.mean(), 'Train_Std': values.std()})
        for key, values in scores.items():
                if key=='fit_time':
                    results2.append({'Fit_time': values.mean()})

        results3.append({'Def_Train_score':train_acc,'Def_Test_score':test_acc})
    results=pds.DataFrame(results,index=['LR','SVC','KNN','RF','GB'])
    results1=pds.DataFrame(results1,index=['LR','SVC','KNN','RF','GB'])
    results2=pds.DataFrame(results2,index=['LR','SVC','KNN','RF','GB'])
    results3=pds.DataFrame(results3,index= ['LR','SVC','KNN','RF','GB'])

    Compare_models=pds.concat((results,results1,results3,results2),axis=1)
    Compare_models.sort_values(by='Test_mean_Score',ascending=False)

Evaluation Metrics of models created using ML Pipelines

Figure 8 : Evaluation Metrics of models created using ML Pipelines

From figure 8, it is clear that Gradient Boosting model is performing very well with cross validation and with default setting also, so we will go with hyper tuning for Gradient Boosting model.

Hyper Tuning for Gradient Boosted Algorithm

    from sklearn.model_selection import RandomizedSearchCV
    pipeline.set_params(clf= GradientBoostingClassifier())
    pipeline.steps

    param_grid = {
        'clf__loss' : ['deviance', 'exponential'],
        'clf__n_estimators' : [20,35,55,75,85,95],
        'clf__random_state': np.arange(1,50),
        'clf__max_depth': [3,4,5,6,7,8,10]
    }

    gbm_grid = RandomizedSearchCV(pipeline,param_grid, cv=cv_split, n_iter = 10 )
    gbm_grid.fit(X_train, y_train)
    gbm_grid=gbm_grid.best_estimator_
    gbm_grid
    y_predict = gbm_grid.predict(X_test)
    y_prob= gbm_grid.predict_proba(X_test)
    accuracy = accuracy_score(y_test,y_predict)
    print(accuracy)
    print('auc=', roc_auc_score(y_test,y_prob[:,1]))

    print(classification_report(y_test,y_predict))
    confusion_matrix(y_test,y_predict)

0.8859060402684564

auc= 0.9648484848484848

precision recall f1-score support

0 0.94 0.89 0.91 99

1 0.80 0.88 0.84 50

accuracy 0.89 149

macro avg 0.87 0.88 0.88 149

weighted avg 0.89 0.89 0.89 149

array([[88, 11],

[ 6, 44]], dtype=int64)

In our final output we got F1 score of 84% for class 1 and 91% for class 0. This is pretty good score and we can go further to use this model by saving the model using pickle library as below:

    import pickle

    pickle.dump(gbm_grid, open("diabetes.pickle.dat", "wb"))

we can reuse this model again and again by using below code.

    model=pickle.load(open("diabetes.pickle.dat", "rb"))
    model

Conclusion

Automated pipelines are used exhaustively in product development wherein organizations offer cloud services for Auto ML. What we saw in this blog was an implementation on a pretty generic and exhausively used ML dataset - Diabetics data set. You should first execute this program once interpreting the details as you execute. Once done, I would suggest using a different data set and see, if it works for you. Although we have used pipelines for model selection, when it comes to hyper parameter tuning, we have again gone manual. As a next step, you can try automating that part by used of pipelines and user defined function. In some other blog, I would cover that as well. Do comment and let me know your inputs and any challenges faced while implementing this end to end project. Recommend you to read this blog on Machine learning using only Numpy.

About the Author:

Ganesh Kumar

Ganeshkumar Patel, is a Mechanical Engineer and applies data science for his process improvement. He loves to play with data and algorithms. During his spare time he likes to pen down his knowledge as a blogger.

Mohan Rai

Mohan Rai is an Alumni of IIM Bangalore , he has completed his MBA from University of Pune and Bachelor of Science (Statistics) from University of Pune. He is a Certified Data Scientist by EMC.Mohan is a learner and has been enriching his experience throughout his career by exposing himself to several opportunities in the capacity of an Advisor, Consultant and a Business Owner. He has more than 18 years’ experience in the field of Analytics and has worked as an Analytics SME on domains ranging from IT, Banking, Construction, Real Estate, Automobile, Component Manufacturing and Retail. His functional scope covers areas including Training, Research, Sales, Market Research, Sales Planning, and Market Strategy.

Understanding Data Pipelines For Machine Learning using Sci-kit learn: A blog around data pipelines and end to end machine learning in python.

Required packages are as below:

Loading the libraries and dataset

Observations :

Correlation Heatmap

Modal

Are you sure to delete this information ?