Using Knn On Coimbra Breast Cancer Data Set A Blog On How We Can Use The Knn Algorithm On A Machine Learning Data Set.

Table of Content

What is KNN ?
Where do you use KNN ?
What is K in K Nearest Neighbors ?
How KNN works ?
What are the distance based methods used in KNN ?
Disadvantages of KNN
Problem solved on Breast Cancer Dataset using KNN
Conclusion and Summary

What is KNN ?

KNN is expanded as K- Nearest Neighbors, which is a Supervised machine learning algorithm, a non -parametic method that solves both classification and regression problems. It predicts data using nearest neighbors by using distance based measures. Some distance based methods include Euclidean, Minkowsky and Manhattan.

Where do you use KNN ?

KNN is used for both Classification and Regression problems. KNN works well for small datasets and is best suitable for clinical datasets. For Classification problems, the output will be the class it belongs to and in case of Regression, the output will be the average of the values of neighbors considered.

What is the K in K Nearest Neighbors ?

The letter K is the paramter which tells the number of nearest neighbors. To predict a particular unknown point in a classification problem or regression problem, the k-value is required, which determines the number of nearest voting neighbors to find the value for the non labelled data. The K value is always a positive integer.

How KNN works ?

KNN predicts unknown value in a very interesting way ,i.e by taking the highest votes of class of its nearest neighbors. Its like telling that a person is a doctor by taking the majority votes of his nearest neighbors professions which are more of doctors than engineers. KNN algorithm finds the unkown value of data point by taking the K value of nearest neighbors and finds the distance between the points and unknown point using the given distance measure(such as Euclidean ,Minkowsky ,Manhattan or chi-sqare) .After finding the nearest neighbours by distance ,it predicts the class of unkonwn point by taking the highest number votes of the neighbor's classes. KNN doesn't learn the given dataset but simply takes it for prediction using the nearest neighbors. Hence its called the Lazy Algorithm. Just in case if the value is an unknown color, it will take majority votes as orange if the maximum number neighbors are orange and then the unkown value will be considered as orange.

The number given to k-value should always be taken into account during model building since K -value can affect accuracy and prediction and even overfitting and underfitting of data. Ideally a odd number is preferred so as to get a decision for every unlabelled data.

What are the distance based methods used in in KNN ?

Distance based methods used in KNN are Euclidean Distance, Minkowsky distance, Chisquare distance, Cosine similarity measure, Hamming distance, Chebychev distance and Mahalanobis distance. It's to be noted that the values in different variables in dataset should be scaled to unit norm before applying any distance based measure. Removal of outliers is also important, since KNN works on distance based algorithm and data points lying very faraway from the rest of the data can adversly create problem during distance measurement.

Disadvantages of KNN

KNN cannot be used for large datasets.
KNN doesnt learn the dataset and prediction will be always based on voting method using nearest neighbors, the prediction values wont always be correct .
KNN will be very slow when using large amount of data.
It is sensitive to outliers and data must be scaled to one unit.
KNN requires huge memory for storage and processing of large datasets.

Problem solved on Breast Cancer Dataset using KNN

STEP 1 : Initializing libraries

    import pandas as pd
    import numpy as np
    import seaborn as sns 
    import matplotlib.pyplot as plt

STEP 2 : Importing dataset

Click here to download the dataset.

    # use pandas read csv to ingest the downloaded data into your python environment
    data=pd.read_csv("E:\Imurgence\dataR2.csv")

STEP 3 : Dataset Variable Terms Explained

    #Age of person (years)
    #BMI of person (kg/m2)
    #Glucose level  (mg/dL)-Concentration of blood sugar in Humans Average blood sugar is 70 and 126 mg/dl.
    #Insulin level (µU/mL)-Insulin is a hormone produced by cells in Pancreas to control the amount of glucose(a type of sugar)flow in the blood and its absorption by the body .Normal Fasting Insulin level is 5 and 15 µU/mL.
    #HOMA  Homeostatic model assessment 
    #Leptin (ng/mL)  It helps inhibit hunger and regulate energy balance, so the body does not trigger hunger responses when it does not need energy.
    #Adiponectin (µg/mL) Adiponectin is a protein hormone and adipokine.It is involved in regulating glucose levels and fatty acid breakdown. In humans it is encoded by the ADIPOQ gene and it is produced in primarily in adipose tissue, but also in muscle, and even in the brain.
    #Resistin (ng/mL)-Resistin increases bad cholestrol in the liver. This results in heart disease.
    #MCP-1(pg/dL) Monocyte chemoattractant protein-1 (MCP-1/CCL2) is one of the key chemokines that regulate migration and infiltration of monocytes/macrophages.

    #Classification Labels:
    #1 for Healthy controls
    #2 for Patients

    data

Coimbra data set

Figure 1 : Coimbra data set

    #Data Set no of rows and columns
    print(len(data),len(data.columns))

116 10

STEP 4 : Finding missing values

    data.isna().sum()

Check missing value in data set

Figure 2 : Check missing value in data set

STEP 5 : Exploratory Data Analysis

    #Type of Data in Each Variable
    data.info()

Data type of individual variables in coimbra dataset.

Figure 3 : Data type of individual variables in coimbra dataset

Is multicollinearity a problem in KNN?

KNN is a non-parametic method and muticollinearity is not part of the assumptions here .The heat map shown below shows the correlation among the independent variables in the dataset .We are not going to consider multicollinearity as we go ahead.

    #Heatmap to find correlation
    plt.subplots(figsize=(20,20))
    sns.heatmap(data.corr(),cmap='RdYlGn',annot=True)

Figure 4 : Visualization of the Correlation grid

    #column Values
    data.columns

Index(['Age', 'BMI', 'Glucose', 'Insulin', 'HOMA', 'Leptin', 'Adiponectin',
       'Resistin', 'MCP.1', 'Classification'],
      dtype='object')

STEP 6 : Sorting Outlier issue

As we discussed earlier KNN is adversly affected by outliers since its a distance based measure.Now Finding outliers and to reduce them since they might affect prediction on increase in variability since we are using distance based measure.

    #No outliers for age
    sns.boxplot(data['Age'])

Figure 5 : Boxplot of age

    #NO outliers for BMI
    sns.boxplot(data['BMI'])

Figure 6 : Boxplot of BMI

    #Some outliers are there for Glucose and  data is Skewed 
    sns.boxplot(data['Glucose'])

Figure 7 : Boxplot of Glucose

    #Outliers are present in Insulin
    sns.boxplot(data['Insulin'])

Figure 8 : Boxplot of Insulin

    #lots of Outliers in Homa
    sns.boxplot(data['HOMA'])

Figure 9 : Boxplot of HOMA

    #Distribution plot of HOMA
    sns.distplot(data['HOMA'])

Figure 10 : Distribution plot of HOMA

    #Outliers present for Leptin
    sns.boxplot(data['Leptin'])

Figure 11 : Boxplot of Leptine

    #Outliers present for Adiponectin
    sns.boxplot(data['Adiponectin'])

Figure 12 : Boxplot of Adiponectin

    #Ouliers present for Resistin
    sns.boxplot(data['Resistin'])

Figure 13 : Boxplot of Resistin

    #Outliers present for MCP.1
    sns.boxplot(data['MCP.1'])

Figure 14 : Boxplot of MCP

    #Removing Outliers Since they may affect prediction for KNN (quantile method)
    cancer=data.copy()

    insulinQ1=cancer['Insulin'].quantile(0.25)
    insulinQ3=cancer['Insulin'].quantile(0.75)
    insulinIQR=insulinQ3-insulinQ1
    lowerliminsulin=insulinQ1-1.5*insulinIQR
    upperliminsulin=insulinQ3+1.5*insulinIQR
    insulrem=cancer[(cancer['Insulin']>lowerliminsulin)&(upperliminsulin > cancer['Insulin'])]
 

    sns.boxplot(insulrem['Glucose'])

Figure 15 : Boxplot of Glucose in censored Insuline outliers

    glucoseQ1=insulrem['Glucose'].quantile(0.25)
    glucoseQ3=insulrem['Glucose'].quantile(0.75)
    glucoseIQR=glucoseQ3-glucoseQ1
    upperlimglucose=glucoseQ3+1.5*glucoseIQR
    lowerlimglucose=glucoseQ1-1.5*glucoseIQR
    glucoserem=insulrem[(insulrem['Glucose'] > lowerlimglucose)&(upperlimglucose > insulrem['Glucose'])]
   
    sns.boxplot(glucoserem['HOMA'])

Figure 16 : Boxplot of HOMA in censored Glucose outliers

    homaQ1=glucoserem['HOMA'].quantile(0.25)
    homaQ3=glucoserem['HOMA'].quantile(0.75)
    homaIQR=homaQ3-homaQ1
    upperlimhoma=homaQ3+1.5*homaIQR
    lowerlimhoma=homaQ1-1.5*homaIQR
    homarem=glucoserem[(glucoserem['HOMA'] > lowerlimhoma)&(upperlimhoma > glucoserem['HOMA'])]
    
   sns.boxplot(homarem['Adiponectin'])

Figure 17 : Boxplot of Adiponectin in censored HOMA outliers

    AdiponectinQ1=homarem['Adiponectin'].quantile(0.25)
    AdiponectinQ3=homarem['Adiponectin'].quantile(0.75)
    AdiponectinIQR=AdiponectinQ3-AdiponectinQ1
    upperlimAdiponectin=AdiponectinQ3+1.5*AdiponectinIQR
    lowerlimAdiponectin=AdiponectinQ1-1.5*AdiponectinIQR
    adirem=homarem[(homarem['Adiponectin'] > lowerlimAdiponectin)&(upperlimAdiponectin > homarem['Adiponectin'])]

    sns.boxplot(adirem['Resistin'])

Figure 18 : Boxplot of Resistin in censored Adiponectin outliers

    resistinQ1=adirem['Resistin'].quantile(0.25)
    resistinQ3=adirem['Resistin'].quantile(0.75)
    resistinIQR=resistinQ3-resistinQ1
    lowerlimresistin=resistinQ1-1.5*resistinIQR
    upperlimresistin=resistinQ3+1.5*resistinIQR
    Resistinrem=adirem[(adirem['Resistin'] > lowerlimresistin)&(upperlimresistin > adirem['Resistin'])]
    
    sns.boxplot(Resistinrem['Leptin'])

Figure 19 : Boxplot of Leptin in censored Resistin outliers

    LeptinQ1=Resistinrem['Leptin'].quantile(0.25)
    LeptinQ3=Resistinrem['Leptin'].quantile(0.75)
    LeptinIQR=LeptinQ3-LeptinQ1
    lowerlimLeptin=LeptinQ1-1.5*LeptinIQR
    upperlimLeptin=LeptinQ3+1.5*LeptinIQR
    leptinrem=Resistinrem[(Resistinrem['Leptin'] > lowerlimLeptin)&(upperlimLeptin > Resistinrem['Leptin'])]
 
    sns.boxplot(leptinrem['MCP.1'])

Figure 20 : Boxplot of MCP in censored Leptin outliers

    MCPQ1=leptinrem['MCP.1'].quantile(0.25)
    MCPQ3=leptinrem['MCP.1'].quantile(0.75)
    MCPIQR=MCPQ3-MCPQ1
    lowerlimMCP=MCPQ1-1.5*MCPIQR
    upperlimMCP=MCPQ3+1.5*MCPIQR
    mcprem=leptinrem[(leptinrem['MCP.1'] > lowerlimMCP)&(upperlimMCP > leptinrem['MCP.1'])]
    mcprem.shape
 
    sns.boxplot(mcprem['MCP.1'])

Figure 21 : Boxplot of MCP in final data

    # create the features from data
    X=mcprem.iloc[:,0:9]

    # create the target variable from data
    Y=mcprem.iloc[:,9]

STEP 7 : Using Standardisation to bring all values to one unit since KNN is a distanced based method

    from sklearn.preprocessing import StandardScaler
    ss=StandardScaler()
    X=ss.fit_transform(X)
    X=pd.DataFrame(X)

STEP 8 : Splitting of Dataset to test and train

    from sklearn.model_selection import train_test_split
    xtrain,xtest,ytrain,ytest=train_test_split(X,Y,test_size=0.3)

STEP 9 : Building KNeighbors classifier using simulation of different k values

    #Importing KNeighbors Classifier from sklearn
    #Finding accuracies  on TrainData and Test data with euclidean distance(by default p=2)
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.metrics import accuracy_score
    for x in range(5,10,2):
        knn=KNeighborsClassifier(n_neighbors=x,metric='minkowski',weights='distance')
        knn.fit(xtrain,ytrain)
        train_ypred=knn.predict(xtrain)
        acc_train_score=accuracy_score(train_ypred,ytrain)
        test_ypred=knn.predict(xtest)
        acc_test_score=accuracy_score(test_ypred,ytest)
        print(f'Accuracy score for  train data and test data is {acc_train_score} and {acc_test_score} respectively for {x} neighbours')

Accuracy score for train data and test data is 1.0 and 0.7142857142857143 respectively for 5 neighbours

Accuracy score for train data and test data is 1.0 and 0.7857142857142857 respectively for 7 neighbours

Accuracy score for train data and test data is 1.0 and 0.7142857142857143 respectively for 9 neighbours

The Train Accuracies inferred from the above model building with odd number of neighbors (5,7,9) are all the same 100% and Test accuracies are each different for the test Data with their own highs and lows. For 5 neighbors, accuracy is 71.42% whereas for 7 neighbors, the accuracy is 78.57%. We have made a better choice of taking 7 neighbors for the model building. This example clearly shows that neighbors play a vital impact on voting and prediction.Hence choosing the right number of neighbors is important and neighbors should always be an odd number.

STEP 10 : Building KNeighbors classifier using Eucledian Distance and 7 neighbors as optimal

    knn=KNeighborsClassifier(n_neighbors=7,metric='minkowski',weights='distance')
    knn.fit(xtrain,ytrain)

STEP 11 : Predicting for train data

    trainypred=knn.predict(xtrain)

STEP 12 : Finding the precision, recall, f1-score ,support

    from sklearn.metrics import classification_report
    print(classification_report(trainypred,ytrain))

Classification report on train data set

Figure 22 : Classification report on train data set

STEP 13 : Finding accuracy Score for Train Data

    accuracy_score(trainypred,ytrain)

1.0

STEP 14 : Predicting Test Data

    testypredicted=knn.predict(xtest)

STEP 15 : Accuracy Score for test Data

    from sklearn.metrics import accuracy_score
    accuracy_score(testypredicted,ytest)

0.7857142857142857

STEP 16 : Pickle model as a file for usage at later stage

    import pickle
    #Save our model as a pickle to a file
    pickle.dump(knn, open("my_knn_model.pickle.dat", "wb"))

    # delete the existing knn model from the environment
    del knn

    #Load the pickled object from the file
    load_knn=pickle.load(open("my_knn_model.pickle.dat", "rb"))

    # Use the loaded model to make predictions
    load_knn.predict(xtest)

Conclusion and Summary

As you can see that the data set taken is a small clinical dataset which has got less than 150 values and is very much suitable for building KNN model. There were so many outliers in almost all the independent variables which had to be removed. We scaled the dataset before model building. Then the model was built using 6 nearest neighbours with distance measure taken as Euclidean. We got the accuracy of train and test data as 100% and 78.57% respectively. Also note that we haven't specified the seed while we were doing the split in the data set. So the train and test sets would have different instances or records. Acordingly your accurracy and precision would vary. The precision and recall for the classes stands on good score. To learn more about the subject in detail, recommend to go through these free video courses. Here we should always consider to take the best value of K for best accuracy and avoiding overfitting or underfitting of dataset.You can insert different k values and use different distance methods to get to the best optimised model. Remember to use KNN for small datasets and its good to use other parametric or non-parametric models for large datasets.

About the Author's:

S. Ebenezer

S. Ebenezer is a Data Science intern at Simple & Real Analytics. Having a strong passion towards DataScience, he loves contributing to open source platforms. Apart from being a creative learner, he is loves working on challenging opportunities. He holds a Bachelors degree in Mechanical Engineering.

Mohan Rai

Mohan Rai is an Alumni of IIM Bangalore , he has completed his MBA from University of Pune and Bachelor of Science (Statistics) from University of Pune. He is a Certified Data Scientist by EMC. Mohan is a learner and has been enriching his experience throughout his career by exposing himself to several opportunities in the capacity of an Advisor, Consultant and a Business Owner. He has more than 18 years’ experience in the field of Analytics and has worked as an Analytics SME on domains ranging from IT, Banking, Construction, Real Estate, Automobile, Component Manufacturing and Retail. His functional scope covers areas including Training, Research, Sales, Market Research, Sales Planning, and Market Strategy.

Using KNN on Coimbra Breast Cancer Data Set: A blog on how we can use the knn algorithm on a machine learning data set

Modal

Are you sure to delete this information ?