Table of Content

## What is KNN ?

KNN is expanded as K- Nearest Neighbors, which is a Supervised machine learning algorithm, a non -parametic method that solves both classification and regression problems. It predicts data using nearest neighbors by using distance  based measures. Some distance based  methods include Euclidean, Minkowsky and Manhattan.

## Where do you use KNN ?

KNN is  used for both Classification and Regression problems. KNN works well for small datasets and is best suitable for clinical datasets. For Classification problems, the output will be  the class it belongs to and in case of Regression, the output will be the average of the values of neighbors considered.

## What is the K in K Nearest Neighbors ?

The letter K is the paramter which tells  the number of  nearest neighbors. To predict a particular unknown point in a classification problem or regression problem, the k-value is required, which determines the number of nearest voting neighbors to find the value for the non labelled data. The K value is always a positive integer.

## How KNN works ?

KNN predicts unknown value in a very interesting way ,i.e by taking the highest votes of class of its nearest neighbors. Its like telling that a person is a doctor by taking the majority votes of  his nearest neighbors  professions  which are more of doctors than engineers. KNN algorithm finds the unkown value of data point by taking the K value of nearest neighbors and finds the distance between the points and unknown point using the given distance measure(such as Euclidean ,Minkowsky ,Manhattan  or chi-sqare) .After finding the nearest neighbours by distance ,it predicts the class of unkonwn point by taking the highest number votes of the neighbor's classes. KNN doesn't learn the given dataset but simply takes it for prediction using the nearest neighbors. Hence its called the Lazy Algorithm. Just in case if the value is an unknown color, it will take majority votes as orange if the maximum number neighbors are orange and then the unkown value will be considered as orange.

The number given  to k-value should always  be taken into account during model building  since K -value can affect  accuracy and  prediction and even overfitting and underfitting of data. Ideally a odd number is preferred so as to get a decision for every unlabelled data.

## What are the distance based methods used in in KNN ?

Distance based methods used in KNN are Euclidean Distance, Minkowsky distance, Chisquare distance, Cosine similarity measure, Hamming distance, Chebychev distance and Mahalanobis distance. It's to be noted that the values in different variables in dataset should be scaled to unit norm before applying any distance based measure. Removal of outliers is also important, since KNN works on distance based algorithm and data points lying very faraway from the rest of the data can adversly create problem during distance measurement.

## Disadvantages of KNN

1. KNN cannot be used for large datasets.
2. KNN doesnt learn the dataset and  prediction will be always based on voting method using nearest neighbors, the prediction values wont always be correct .
3. KNN will be very slow when using large amount of data.
4. It is sensitive to outliers and data must be scaled to one unit.
5. KNN requires huge memory for storage and processing of large datasets.

## Problem solved  on Breast Cancer Dataset using KNN

### STEP 1 : Initializing libraries

```    import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

```

### STEP 2 : Importing dataset

```    # use pandas read csv to ingest the downloaded data into your python environment

```

### STEP 3 : Dataset Variable Terms Explained

```    #Age of person (years)
#BMI of person (kg/m2)
#Glucose level  (mg/dL)-Concentration of blood sugar in Humans Average blood sugar is 70 and 126 mg/dl.
#Insulin level (µU/mL)-Insulin is a hormone produced by cells in Pancreas to control the amount of glucose(a type of sugar)flow in the blood and its absorption by the body .Normal Fasting Insulin level is 5 and 15 µU/mL.
#HOMA  Homeostatic model assessment
#Leptin (ng/mL)  It helps inhibit hunger and regulate energy balance, so the body does not trigger hunger responses when it does not need energy.
#Adiponectin (µg/mL) Adiponectin is a protein hormone and adipokine.It is involved in regulating glucose levels and fatty acid breakdown. In humans it is encoded by the ADIPOQ gene and it is produced in primarily in adipose tissue, but also in muscle, and even in the brain.
#Resistin (ng/mL)-Resistin increases bad cholestrol in the liver. This results in heart disease.
#MCP-1(pg/dL) Monocyte chemoattractant protein-1 (MCP-1/CCL2) is one of the key chemokines that regulate migration and infiltration of monocytes/macrophages.

#Classification Labels:
#1 for Healthy controls
#2 for Patients

data

``` Figure 1 : Coimbra data set

```    #Data Set no of rows and columns
print(len(data),len(data.columns))

```
`116 10`

### STEP 4 : Finding missing values

```    data.isna().sum()

``` Figure 2 : Check missing value in data set

### STEP 5 : Exploratory Data Analysis

```    #Type of Data in Each Variable
data.info()

``` Figure 3 : Data type of individual variables in coimbra dataset

Is multicollinearity a problem in KNN?

KNN is a non-parametic method and muticollinearity is not part of the assumptions here .The heat map shown below shows the correlation among the independent  variables in the dataset .We are not going to consider multicollinearity as we go ahead.

```    #Heatmap to find correlation
plt.subplots(figsize=(20,20))
sns.heatmap(data.corr(),cmap='RdYlGn',annot=True)

``` Figure 4 : Visualization of the Correlation grid

```    #column Values
data.columns

```
```Index(['Age', 'BMI', 'Glucose', 'Insulin', 'HOMA', 'Leptin', 'Adiponectin',
'Resistin', 'MCP.1', 'Classification'],
dtype='object')```

### STEP 6 : Sorting Outlier issue

As we discussed earlier KNN is adversly affected by outliers since its a distance based measure.Now Finding outliers and to reduce them since they might affect prediction on increase in variability since  we are using distance based measure.

```    #No outliers for age
sns.boxplot(data['Age'])

``` Figure 5 : Boxplot of age

```    #NO outliers for BMI
sns.boxplot(data['BMI'])

``` Figure 6 : Boxplot of BMI

```    #Some outliers are there for Glucose and  data is Skewed
sns.boxplot(data['Glucose'])

``` Figure 7 : Boxplot of Glucose

```    #Outliers are present in Insulin
sns.boxplot(data['Insulin'])

``` Figure 8 : Boxplot of Insulin

```    #lots of Outliers in Homa
sns.boxplot(data['HOMA'])

``` Figure 9 : Boxplot of HOMA

```    #Distribution plot of HOMA
sns.distplot(data['HOMA'])

``` Figure 10 : Distribution plot of HOMA

```    #Outliers present for Leptin
sns.boxplot(data['Leptin'])

``` Figure 11 : Boxplot of Leptine

```    #Outliers present for Adiponectin

``` Figure 12 : Boxplot of Adiponectin

```    #Ouliers present for Resistin
sns.boxplot(data['Resistin'])

``` Figure 13 : Boxplot of Resistin

```    #Outliers present for MCP.1
sns.boxplot(data['MCP.1'])

``` Figure 14 : Boxplot of MCP

```    #Removing Outliers Since they may affect prediction for KNN (quantile method)
cancer=data.copy()

insulinQ1=cancer['Insulin'].quantile(0.25)
insulinQ3=cancer['Insulin'].quantile(0.75)
insulinIQR=insulinQ3-insulinQ1
lowerliminsulin=insulinQ1-1.5*insulinIQR
upperliminsulin=insulinQ3+1.5*insulinIQR
insulrem=cancer[(cancer['Insulin']>lowerliminsulin)&(upperliminsulin > cancer['Insulin'])]

sns.boxplot(insulrem['Glucose'])

``` Figure 15 : Boxplot of Glucose in censored Insuline outliers

```    glucoseQ1=insulrem['Glucose'].quantile(0.25)
glucoseQ3=insulrem['Glucose'].quantile(0.75)
glucoseIQR=glucoseQ3-glucoseQ1
upperlimglucose=glucoseQ3+1.5*glucoseIQR
lowerlimglucose=glucoseQ1-1.5*glucoseIQR
glucoserem=insulrem[(insulrem['Glucose'] > lowerlimglucose)&(upperlimglucose > insulrem['Glucose'])]

sns.boxplot(glucoserem['HOMA'])

``` Figure 16 : Boxplot of HOMA in censored Glucose outliers

```    homaQ1=glucoserem['HOMA'].quantile(0.25)
homaQ3=glucoserem['HOMA'].quantile(0.75)
homaIQR=homaQ3-homaQ1
upperlimhoma=homaQ3+1.5*homaIQR
lowerlimhoma=homaQ1-1.5*homaIQR
homarem=glucoserem[(glucoserem['HOMA'] > lowerlimhoma)&(upperlimhoma > glucoserem['HOMA'])]

``` Figure 17 : Boxplot of Adiponectin in censored HOMA outliers

```    AdiponectinQ1=homarem['Adiponectin'].quantile(0.25)

``` Figure 18 : Boxplot of Resistin in censored Adiponectin outliers

```    resistinQ1=adirem['Resistin'].quantile(0.25)
resistinIQR=resistinQ3-resistinQ1
lowerlimresistin=resistinQ1-1.5*resistinIQR
upperlimresistin=resistinQ3+1.5*resistinIQR

sns.boxplot(Resistinrem['Leptin'])

``` Figure 19 : Boxplot of Leptin in censored Resistin outliers

```    LeptinQ1=Resistinrem['Leptin'].quantile(0.25)
LeptinQ3=Resistinrem['Leptin'].quantile(0.75)
LeptinIQR=LeptinQ3-LeptinQ1
lowerlimLeptin=LeptinQ1-1.5*LeptinIQR
upperlimLeptin=LeptinQ3+1.5*LeptinIQR
leptinrem=Resistinrem[(Resistinrem['Leptin'] > lowerlimLeptin)&(upperlimLeptin > Resistinrem['Leptin'])]

sns.boxplot(leptinrem['MCP.1'])

``` Figure 20 : Boxplot of MCP in censored Leptin outliers

```    MCPQ1=leptinrem['MCP.1'].quantile(0.25)
MCPQ3=leptinrem['MCP.1'].quantile(0.75)
MCPIQR=MCPQ3-MCPQ1
lowerlimMCP=MCPQ1-1.5*MCPIQR
upperlimMCP=MCPQ3+1.5*MCPIQR
mcprem=leptinrem[(leptinrem['MCP.1'] > lowerlimMCP)&(upperlimMCP > leptinrem['MCP.1'])]
mcprem.shape

sns.boxplot(mcprem['MCP.1'])

``` Figure 21 : Boxplot of MCP in final data

```    # create the features from data
X=mcprem.iloc[:,0:9]

# create the target variable from data
Y=mcprem.iloc[:,9]

```

### STEP 7 : Using Standardisation to bring all values to one unit since KNN is a distanced based method

```    from sklearn.preprocessing import StandardScaler
ss=StandardScaler()
X=ss.fit_transform(X)
X=pd.DataFrame(X)

```

### STEP 8 : Splitting of Dataset to test and train

```    from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest=train_test_split(X,Y,test_size=0.3)

```

### STEP 9 : Building KNeighbors classifier using simulation of different k values

```    #Importing KNeighbors Classifier from sklearn
#Finding accuracies  on TrainData and Test data with euclidean distance(by default p=2)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
for x in range(5,10,2):
knn=KNeighborsClassifier(n_neighbors=x,metric='minkowski',weights='distance')
knn.fit(xtrain,ytrain)
train_ypred=knn.predict(xtrain)
acc_train_score=accuracy_score(train_ypred,ytrain)
test_ypred=knn.predict(xtest)
acc_test_score=accuracy_score(test_ypred,ytest)
print(f'Accuracy score for  train data and test data is {acc_train_score} and {acc_test_score} respectively for {x} neighbours')

```

Accuracy score for train data and test data is 1.0 and 0.7142857142857143 respectively for 5 neighbours

Accuracy score for train data and test data is 1.0 and 0.7857142857142857 respectively for 7 neighbours

Accuracy score for train data and test data is 1.0 and 0.7142857142857143 respectively for 9 neighbours

The Train Accuracies inferred from the above model building with odd number of neighbors (5,7,9) are all the same 100% and Test accuracies are each different for the test Data with their own highs and lows. For 5 neighbors, accuracy is 71.42% whereas for 7 neighbors, the accuracy is 78.57%. We have made a better choice of taking 7 neighbors for the model building. This example clearly shows that neighbors play a vital impact on voting and prediction.Hence choosing the right number of neighbors is important and neighbors should always be an odd number.

### STEP 10 : Building KNeighbors classifier using Eucledian Distance and 7 neighbors as optimal

```    knn=KNeighborsClassifier(n_neighbors=7,metric='minkowski',weights='distance')
knn.fit(xtrain,ytrain)

```

### STEP 11 : Predicting for train data

```    trainypred=knn.predict(xtrain)

```

### STEP 12 : Finding the precision, recall, f1-score ,support

```    from sklearn.metrics import classification_report
print(classification_report(trainypred,ytrain))

``` Figure 22 : Classification report on train data set

### STEP 13 : Finding accuracy Score for Train Data

```    accuracy_score(trainypred,ytrain)

1.0
```

### STEP 14 : Predicting Test Data

```    testypredicted=knn.predict(xtest)

```

### STEP 15 : Accuracy Score for test Data

```    from sklearn.metrics import accuracy_score
accuracy_score(testypredicted,ytest)

```
`0.7857142857142857`

### STEP 16 : Pickle model as a file for usage at later stage

```    import pickle
#Save our model as a pickle to a file
pickle.dump(knn, open("my_knn_model.pickle.dat", "wb"))

# delete the existing knn model from the environment
del knn

#Load the pickled object from the file

# Use the loaded model to make predictions

```

## Conclusion and Summary

As you can see that the data set taken is a small clinical dataset which  has got less than 150 values and is very much suitable for building  KNN model. There were so many outliers in  almost all the independent variables which had to be removed. We scaled the dataset before model building. Then the model was built using 6 nearest neighbours with distance measure taken as Euclidean. We got the accuracy of train and test data as 100% and 78.57% respectively. Also note that we haven't specified the seed while we were doing the split in the data set. So the train and test sets would have different instances or records. Acordingly your accurracy and precision would vary. The precision and recall for the classes stands on good score. To learn more about the subject in detail, recommend to go through these free video courses. Here we should always consider to  take the best value of K for best accuracy and avoiding overfitting or underfitting of dataset.You can insert different k values and use different distance methods to get to the best optimised model. Remember to  use KNN for small datasets and its good to use other parametric or non-parametric models for large datasets. 