# Understanding the KNN-Algorithm

KNN is a supervised learning algorithm used for both regression and classification problems. Mostly used for Classification though. KNN tries to predict the correct class of test data by calculating the distance between the test data and all the training points. It then selects the k-points which are closest to the data.

1. Given a training dataset as given below. We have a new test data that we need to assign to one of the two classes.

2) Now, the k-NN algorithm calculates the distance between the test data and the given training data.

3) After calculating the distance, it will select the k training points which are nearest to the test data. Let’s assume the value of k is 3 for our example.

So here K=3, green=2 and red=1. The KNN will classify this point as green. This is a classification example, for regression, the predicted value is the mean of the k-selected points. This approach is known as the Brute force KNN where the distance is calculated, there are many ways of calculating distance like Euclidean, Manhattan and Hamming etc.

Euclidean Distance:

• Most Common method.
• Consider two points P(p1,p2) and Q(q1,q2) the Euclidean Distance will be ED(p,q)= squareroot[(q1-p1)**2 + (q2-p2)**2]

Manhattan Distance:

• The distance represents the sum of the absolute difference between the absolute values in a vector.
• MD(p,q) = |q1-p1|+|q2-p2|
• Less influenced by outliers, than Euclidean distance. Preferred with a very high dimensional dataset.

Brute Force KNN is computationally very expensive, so there are other methods like kd-tree and Ball Tree(which are more efficient and can work better in a high dimensional dataset).

Choosing the Value of K:

• Low k-value has high variance and low bias. With low k-value, there is a chance of overfitting in the data.
• High k-value has decreasing variance and increasing bias. With high k-value, there is a chance of underfitting in the data.

KNN algorithm is a lazy learner? Yes, it is. Unlike other algorithms like SVM, Logistic and Bayesian which are eager learners and generalize on the training dataset, before getting the test dataset. In other words, they create the model based on the training dataset and then make predictions on the test dataset. KNN is a lazy learner so it waits for the test dataset, after getting the test dataset it then starts generalizing on the training dataset. Lazy algorithms work less while training and more while testing.

I am going to show you an example of the KNN algorithm using Diabetes dataset from UCI Repository. Steps that I am going to perform are:

1. Data Cleaning
2. Data Visualization
3. Model Creation
4. Selecting a k-value using GridSearchCV
5. Model Validation using cross-val-score
`#importing librariesimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.preprocessing import StandardScalerfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.model_selection import train_test_split,cross_val_score,GridSearchCVfrom statsmodels.stats.outliers_influence import variance_inflation_factor from sklearn.metrics import accuracy_score%matplotlib inline#reading datadata=pd.read_csv('diabetes.csv')data.head()`

There is no missing value in the dataset however, there are outliers present and some of the variables are skewed.

`x_train,x_test,y_train,y_test = train_test_split(df,y, test_size= 0.3,random_state=42)knn = KNeighborsClassifier()knn.fit(x_train,y_train)KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',metric_params=None, n_jobs=None, n_neighbors=5, p=2,weights='uniform')y_pred = knn.predict(x_test)knn.score(x_train,y_train)0.8435754189944135print("The accuracy score is : ", accuracy_score(y_test,y_pred))The accuracy score is :  0.7272727272727273param_grid = { 'algorithm' : ['ball_tree', 'kd_tree', 'brute'],               'leaf_size' : [18,20,25,27,30,32,34],               'n_neighbors' : [3,5,7,9,10,11,12,13]              }gridsearch = GridSearchCV(knn, param_grid).fit(x_train,y_train)gridsearch.best_params_{'algorithm': 'ball_tree', 'leaf_size': 18, 'n_neighbors': 11}knn = KNeighborsClassifier(algorithm = 'ball_tree', leaf_size =18, n_neighbors =11)knn.fit(x_train,y_train)y_pred = knn.predict(x_test)knn.score(x_train,y_train)0.8063314711359404print("The accuracy score is : ", accuracy_score(y_test,y_pred))The accuracy score is :  0.7445887445887446cv_results = cross_val_score(knn, x_test, y_test)msg = "%s: %f (%f)" % ('KNN', cv_results.mean(), cv_results.std())print(msg)KNN: 0.807018 (0.040441)`