Table of Content

- What is KNN ?
- Where do you use KNN ?
- What is K in K Nearest Neighbors ?
- How KNN works ?
- What are the distance based methods used in KNN ?
- Disadvantages of KNN
- Problem solved on Breast Cancer Dataset using KNN
- Initializing libraries
- Importing dataset
- Dataset Variable Terms Explained
- Finding missing values
- Exploratory Data Analysis
- Sorting Outlier issue
- Using Standardisation
- Splitting of Dataset to test and train
- Building KNeighbors classifier using simulation of different k values
- Building KNeighbors classifier
- Predicting for train data
- Finding the precision, recall, f1-score ,support
- Finding accuracy Score for Train Data
- Predicting Test Data
- Accuracy Score for test Data
- Pickle model as a file for usage at later stage

- Conclusion and Summary

## What is KNN ?

KNN is expanded as K- Nearest Neighbors, which is a Supervised machine learning algorithm, a non -parametic method that solves both classification and regression problems. It predicts data using nearest neighbors by using distance based measures. Some distance based methods include Euclidean, Minkowsky and Manhattan.

## Where do you use KNN ?

KNN is used for both Classification and Regression problems. KNN works well for small datasets and is best suitable for clinical datasets. For Classification problems, the output will be the class it belongs to and in case of Regression, the output will be the average of the values of neighbors considered.

## What is the K in K Nearest Neighbors ?

The letter K is the paramter which tells the number of nearest neighbors. To predict a particular unknown point in a classification problem or regression problem, the k-value is required, which determines the number of nearest voting neighbors to find the value for the non labelled data. The K value is always a positive integer.

## How KNN works ?

KNN predicts unknown value in a very interesting way ,i.e by taking the highest votes of class of its nearest neighbors. Its like telling that a person is a doctor by taking the majority votes of his nearest neighbors professions which are more of doctors than engineers. KNN algorithm finds the unkown value of data point by taking the K value of nearest neighbors and finds the distance between the points and unknown point using the given distance measure(such as Euclidean ,Minkowsky ,Manhattan or chi-sqare) .After finding the nearest neighbours by distance ,it predicts the class of unkonwn point by taking the highest number votes of the neighbor's classes. KNN doesn't learn the given dataset but simply takes it for prediction using the nearest neighbors. Hence its called the Lazy Algorithm. Just in case if the value is an unknown color, it will take majority votes as orange if the maximum number neighbors are orange and then the unkown value will be considered as orange.

The number given to k-value should always be taken into account during model building since K -value can affect accuracy and prediction and even overfitting and underfitting of data. Ideally a odd number is preferred so as to get a decision for every unlabelled data.

## What are the distance based methods used in in KNN ?

Distance based methods used in KNN are Euclidean Distance, Minkowsky distance, Chisquare distance, Cosine similarity measure, Hamming distance, Chebychev distance and Mahalanobis distance. It's to be noted that the values in different variables in dataset should be scaled to unit norm before applying any distance based measure. Removal of outliers is also important, since KNN works on distance based algorithm and data points lying very faraway from the rest of the data can adversly create problem during distance measurement.

## Disadvantages of KNN

- KNN cannot be used for large datasets.
- KNN doesnt learn the dataset and prediction will be always based on voting method using nearest neighbors, the prediction values wont always be correct .
- KNN will be very slow when using large amount of data.
- It is sensitive to outliers and data must be scaled to one unit.
- KNN requires huge memory for storage and processing of large datasets.

## Problem solved on Breast Cancer Dataset using KNN

### STEP 1 : Initializing libraries

importpandasaspdimportnumpyasnpimportseabornassnsimportmatplotlib.pyplotasplt

### STEP 2 : Importing dataset

Click here to download the dataset.

# use pandas read csv to ingest the downloaded data into your python environment data=pd.read_csv("E:\Imurgence\dataR2.csv")

### STEP 3 : Dataset Variable Terms Explained

#Age of person (years) #BMI of person (kg/m2) #Glucose level (mg/dL)-Concentration of blood sugar in Humans Average blood sugar is 70 and 126 mg/dl. #Insulin level (µU/mL)-Insulin is a hormone produced by cells in Pancreas to control the amount of glucose(a type of sugar)flow in the blood and its absorption by the body .Normal Fasting Insulin level is 5 and 15 µU/mL. #HOMA Homeostatic model assessment #Leptin (ng/mL) It helps inhibit hunger and regulate energy balance, so the body does not trigger hunger responses when it does not need energy. #Adiponectin (µg/mL) Adiponectin is a protein hormone and adipokine.It is involved in regulating glucose levels and fatty acid breakdown. In humans it is encoded by the ADIPOQ gene and it is produced in primarily in adipose tissue, but also in muscle, and even in the brain. #Resistin (ng/mL)-Resistin increases bad cholestrol in the liver. This results in heart disease. #MCP-1(pg/dL) Monocyte chemoattractant protein-1 (MCP-1/CCL2) is one of the key chemokines that regulate migration and infiltration of monocytes/macrophages. #Classification Labels: #1 for Healthy controls #2 for Patients data

__Figure 1 : Coimbra data set__

#Data Set no of rows and columnslen(data),len(data.columns))

`116 10`

### STEP 4 : Finding missing values

data.isna().sum()

__Figure 2 : Check missing value in data set__

### STEP 5 : Exploratory Data Analysis

```
#Type of Data in Each Variable
data.info()
```

__Figure 3 : Data type of individual variables in coimbra dataset__

Is multicollinearity a problem in KNN?

KNN is a non-parametic method and muticollinearity is not part of the assumptions here .The heat map shown below shows the correlation among the independent variables in the dataset .We are not going to consider multicollinearity as we go ahead.

#Heatmap to find correlation plt.subplots(figsize=(20,20)) sns.heatmap(data.corr(),cmap='RdYlGn',annot=True)

__Figure 4 : Visualization of the Correlation grid__

```
#column Values
data.columns
```

```
Index(['Age', 'BMI', 'Glucose', 'Insulin', 'HOMA', 'Leptin', 'Adiponectin',
'Resistin', 'MCP.1', 'Classification'],
dtype='object')
```

### STEP 6 : Sorting Outlier issue

As we discussed earlier KNN is adversly affected by outliers since its a distance based measure.Now Finding outliers and to reduce them since they might affect prediction on increase in variability since we are using distance based measure.

#No outliers for age sns.boxplot(data['Age'])

__Figure 5 : Boxplot of age__

#NO outliers for BMI sns.boxplot(data['BMI'])

__Figure 6 : Boxplot of BMI__

#Some outliers are there for Glucose and data is Skewed sns.boxplot(data['Glucose'])

__Figure 7 : Boxplot of Glucose__

#Outliers are present in Insulin sns.boxplot(data['Insulin'])

__Figure 8 : Boxplot of Insulin__

#lots of Outliers in Homa sns.boxplot(data['HOMA'])

__Figure 9 : Boxplot of HOMA__

#Distribution plot of HOMA sns.distplot(data['HOMA'])

__Figure 10 : Distribution plot of HOMA__

#Outliers present for Leptin sns.boxplot(data['Leptin'])

__Figure 11 : Boxplot of Leptine__

#Outliers present for Adiponectin sns.boxplot(data['Adiponectin'])

__Figure 12 : Boxplot of Adiponectin__

#Ouliers present for Resistin sns.boxplot(data['Resistin'])

__Figure 13 : Boxplot of Resistin__

#Outliers present for MCP.1 sns.boxplot(data['MCP.1'])

__Figure 14 : Boxplot of MCP__

#Removing Outliers Since they may affect prediction for KNN (quantile method) cancer=data.copy() insulinQ1=cancer['Insulin'].quantile(0.25) insulinQ3=cancer['Insulin'].quantile(0.75) insulinIQR=insulinQ3-insulinQ1 lowerliminsulin=insulinQ1-1.5*insulinIQR upperliminsulin=insulinQ3+1.5*insulinIQR insulrem=cancer[(cancer['Insulin']>lowerliminsulin)&(upperliminsulin > cancer['Insulin'])] sns.boxplot(insulrem['Glucose'])

__Figure 15 : Boxplot of Glucose in censored Insuline outliers__

glucoseQ1=insulrem['Glucose'].quantile(0.25) glucoseQ3=insulrem['Glucose'].quantile(0.75) glucoseIQR=glucoseQ3-glucoseQ1 upperlimglucose=glucoseQ3+1.5*glucoseIQR lowerlimglucose=glucoseQ1-1.5*glucoseIQR glucoserem=insulrem[(insulrem['Glucose'] > lowerlimglucose)&(upperlimglucose > insulrem['Glucose'])] sns.boxplot(glucoserem['HOMA'])

__Figure 16 : Boxplot of HOMA in censored Glucose outliers__

homaQ1=glucoserem['HOMA'].quantile(0.25) homaQ3=glucoserem['HOMA'].quantile(0.75) homaIQR=homaQ3-homaQ1 upperlimhoma=homaQ3+1.5*homaIQR lowerlimhoma=homaQ1-1.5*homaIQR homarem=glucoserem[(glucoserem['HOMA'] > lowerlimhoma)&(upperlimhoma > glucoserem['HOMA'])] sns.boxplot(homarem['Adiponectin'])

__Figure 17 : Boxplot of Adiponectin in censored HOMA outliers__

AdiponectinQ1=homarem['Adiponectin'].quantile(0.25) AdiponectinQ3=homarem['Adiponectin'].quantile(0.75) AdiponectinIQR=AdiponectinQ3-AdiponectinQ1 upperlimAdiponectin=AdiponectinQ3+1.5*AdiponectinIQR lowerlimAdiponectin=AdiponectinQ1-1.5*AdiponectinIQR adirem=homarem[(homarem['Adiponectin'] > lowerlimAdiponectin)&(upperlimAdiponectin > homarem['Adiponectin'])] sns.boxplot(adirem['Resistin'])

__Figure 18 : Boxplot of Resistin in censored Adiponectin outliers__

resistinQ1=adirem['Resistin'].quantile(0.25) resistinQ3=adirem['Resistin'].quantile(0.75) resistinIQR=resistinQ3-resistinQ1 lowerlimresistin=resistinQ1-1.5*resistinIQR upperlimresistin=resistinQ3+1.5*resistinIQR Resistinrem=adirem[(adirem['Resistin'] > lowerlimresistin)&(upperlimresistin > adirem['Resistin'])] sns.boxplot(Resistinrem['Leptin'])

__Figure 19 : Boxplot of Leptin in censored Resistin outliers__

LeptinQ1=Resistinrem['Leptin'].quantile(0.25) LeptinQ3=Resistinrem['Leptin'].quantile(0.75) LeptinIQR=LeptinQ3-LeptinQ1 lowerlimLeptin=LeptinQ1-1.5*LeptinIQR upperlimLeptin=LeptinQ3+1.5*LeptinIQR leptinrem=Resistinrem[(Resistinrem['Leptin'] > lowerlimLeptin)&(upperlimLeptin > Resistinrem['Leptin'])] sns.boxplot(leptinrem['MCP.1'])

__Figure 20 : Boxplot of MCP in censored Leptin outliers__

MCPQ1=leptinrem['MCP.1'].quantile(0.25) MCPQ3=leptinrem['MCP.1'].quantile(0.75) MCPIQR=MCPQ3-MCPQ1 lowerlimMCP=MCPQ1-1.5*MCPIQR upperlimMCP=MCPQ3+1.5*MCPIQR mcprem=leptinrem[(leptinrem['MCP.1'] > lowerlimMCP)&(upperlimMCP > leptinrem['MCP.1'])] mcprem.shape sns.boxplot(mcprem['MCP.1'])

__Figure 21 : Boxplot of MCP in final data__

# create the features from data X=mcprem.iloc[:,0:9] # create the target variable from data Y=mcprem.iloc[:,9]

### STEP 7 : Using Standardisation to bring all values to one unit since KNN is a distanced based method

fromsklearn.preprocessingimportStandardScaler ss=StandardScaler() X=ss.fit_transform(X) X=pd.DataFrame(X)

### STEP 8 : Splitting of Dataset to test and train

fromsklearn.model_selectionimporttrain_test_split xtrain,xtest,ytrain,ytest=train_test_split(X,Y,test_size=0.3)

### STEP 9 : Building KNeighbors classifier using simulation of different k values

#Importing KNeighbors Classifier from sklearn #Finding accuracies on TrainData and Test data with euclidean distance(by default p=2)fromsklearn.neighborsimportKNeighborsClassifierfromsklearn.metricsimportaccuracy_scoreforxinrange(5,10,2): knn=KNeighborsClassifier(n_neighbors=x,metric='minkowski',weights='distance') knn.fit(xtrain,ytrain) train_ypred=knn.predict(xtrain) acc_train_score=accuracy_score(train_ypred,ytrain) test_ypred=knn.predict(xtest) acc_test_score=accuracy_score(test_ypred,ytest)

Accuracy score for train data and test data is 1.0 and 0.7142857142857143 respectively for 5 neighbours

Accuracy score for train data and test data is 1.0 and 0.7857142857142857 respectively for 7 neighbours

Accuracy score for train data and test data is 1.0 and 0.7142857142857143 respectively for 9 neighbours

The Train Accuracies inferred from the above model building with odd number of neighbors (5,7,9) are all the same 100% and Test accuracies are each different for the test Data with their own highs and lows. For 5 neighbors, accuracy is 71.42% whereas for 7 neighbors, the accuracy is 78.57%. We have made a better choice of taking 7 neighbors for the model building. This example clearly shows that neighbors play a vital impact on voting and prediction.Hence choosing the right number of neighbors is important and neighbors should always be an odd number.

### STEP 10 : Building KNeighbors classifier using Eucledian Distance and 7 neighbors as optimal

knn=KNeighborsClassifier(n_neighbors=7,metric='minkowski',weights='distance') knn.fit(xtrain,ytrain)

### STEP 11 : Predicting for train data

trainypred=knn.predict(xtrain)

### STEP 12 : Finding the precision, recall, f1-score ,support

fromsklearn.metricsimportclassification_report

__Figure 22 : Classification report on train data set__

### STEP 13 : Finding accuracy Score for Train Data

```
accuracy_score(trainypred,ytrain)
1.0
```

### STEP 14 : Predicting Test Data

testypredicted=knn.predict(xtest)

### STEP 15 : Accuracy Score for test Data

fromsklearn.metricsimportaccuracy_score accuracy_score(testypredicted,ytest)

`0.7857142857142857`

### STEP 16 : Pickle model as a file for usage at later stage

importpickle #Save our model as a pickle to a file pickle.dump(knn, open("my_knn_model.pickle.dat", "wb")) # delete the existing knn model from the environmentdelknn #Load the pickled object from the file load_knn=pickle.load(open("my_knn_model.pickle.dat", "rb")) # Use the loaded model to make predictions load_knn.predict(xtest)

## Conclusion and Summary

As you can see that the data set taken is a small clinical dataset which has got less than 150 values and is very much suitable for building KNN model. There were so many outliers in almost all the independent variables which had to be removed. We scaled the dataset before model building. Then the model was built using 6 nearest neighbours with distance measure taken as Euclidean. We got the accuracy of train and test data as 100% and 78.57% respectively. Also note that we haven't specified the seed while we were doing the split in the data set. So the train and test sets would have different instances or records. Acordingly your accurracy and precision would vary. The precision and recall for the classes stands on good score. To learn more about the subject in detail, recommend to go through these free video courses. Here we should always consider to take the best value of K for best accuracy and avoiding overfitting or underfitting of dataset.You can insert different k values and use different distance methods to get to the best optimised model. Remember to use KNN for small datasets and its good to use other parametric or non-parametric models for large datasets.

**About the Author's:**

## Write A Public Review