KNN Classification using Scikit Learn

Vishakha Ratnakar
6 min readJan 29, 2022

--

Implementation of K-nearest neighbor algorithm using scikit learn

We saw the workflow of a supervised machine learning model in my last blog (Workflow-of-Supervised-Learning-Algorithms). Let’s study supervised learning algorithms by putting them into practice according to the workflow.

Overview of KNN Model

Using KNN we can solve both classification and regression problems. In this blog, we would focus on the classification task.

K-Nearest Neighbors (KNN) is a supervised machine learning model. KNN makes predictions based on how similar training observations are to new, incoming, unlabelled observations. KNN is also called a lazy learning algorithm. A lazy learning algorithm saves the data but does not build the model until new, unlabeled input is passed. As a result, learning takes less time whereas classification takes more time.

The number of nearest neighbors or “K” is the main deciding factor in KNN. The majority voting principle lies at the core of KNN learning.

KNN Classification

Assume we have a dataset with two classes “blue” and “green”. Now we must classify the new data point “?” to determine whether it belongs to the blue or green class. Let the value of K be 3. As seen in the diagram, the algorithm will find the three closest data points. Two of them are colored green, while one is colored blue. As a result, the new data point will be assigned to the majority class, which is “green”.

Working of KNN Algorithm

  1. KNN algorithm calculates the distance from test data to each row of training data.
  2. The distance can be calculated by using Euclidean, Manhattan, Hamming distance, or Minkowski distance. Generally, Euclidean distance is used.
  3. The distance is then sorted in ascending order from which top K values are chosen.
  4. Based on the majority of classes, the classification is done on test data.

Implementation

Step 1: Data Gathering

The first step is to read the data that we will use as input. For this example, we will be using the Titanic dataset. For loading the dataset, we would use the pandas library.

#Import initial libraries and datasetimport numpy as np
import pandas as pd
df=pd.read_csv(‘data/titanic.csv’)
df.head()
  • 891 Observations
  • The 9 features are Passenger Id, Pclass, Name, Sex, Age, Ticket, Fare, Cabin, and Embarked.
  • Target label (survived: 0 = did not survive, 1 = did survive)

Step 2: Data pre-processing

  • Filling Missing values

We check the number of missing values for each feature using the “isnull” function. When we call the “sum” function along with isnull, the sum of missing data in each column is the output.

#Finding the number of missing values for each featuredf.isnull().sum()

Age, Cabin and Embarked are three columns that have missing values.

We can observe that Column Age has a total of 177 missing data. Because this column is numerical, we can fill in the missing values with statistical values such as mean, median, or mode. We would use the mean of the non-null values to fill in the missing data in this case.

#finding and replacing the missing values with mean of non-null values of agedf[‘Age’]=df[‘Age’].replace(np.NaN,df[‘Age’].mean())
df[‘Age’][:10]
  • Encoding of Categorical data

Then we use a basic “search and replace” strategy to encode the categorical data. “Male” and “female” are the categorical values in the column “Sex”. The male is replaced with 0 while the female is replaced with 1.

#encoding categorical data to numerical:Gender = {“Sex”:{“male”: 0, “female”: 1}}df = df.replace(Gender)
df.head()
  • Removing Irrelevant data

The irrelevant features in the data can decrease the accuracy of the model and make it learn based on irrelevant features. These features can be removed automatically or manually.

Features like name, Passenger id, ticket, the cabin does not provide any extra details for the target variables. Therefore, we remove these columns by using drop functions

df.drop([‘Name’, ‘PassengerId’, ‘Ticket’, ‘Embarked’, ‘Cabin’], axis=1, inplace=True)
df.head()

After pre-processing the data, we have the final dataset ready.

Note that different methods for preparing data are available, such as one-hot encoder, standard scaler, and so on.

Step 3: Decide on a model.

As the target variable is categorical in nature, we would perform a classification task with KNN Algorithm

Step 4: Split the Dataset.

First, we will split the dataset into inputs and target. Input will be every column except ‘Survived’ because it will be the target variable. For this purpose, we will be using pandas “drop” function.

# Split the dataset into input and target featuresX=df.drop(‘Survived’, axis=1)
y=df.Survived

The dataset will then be divided into training and testing data. The training data will be used to train the model, while the testing data will be used to see how well the model performs on data that hasn’t been seen before.

The function “train_test_split” in Scikit-Learn can be used to split our dataset.

# Split the input and target features into training and testing setfrom sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

The test_size parameter has been set to 0.3. This means that 30% of the data will be used for testing, while the remaining 70% will be used for training purpose. Setting a random_state to 1 guarantee that the split will be the same each time.

Step 5: Train the Model.

The next step is to train the model.

from sklearn.neighbors import KNeighborsClassifier# create knn classifier
knn = KNeighborsClassifier(n_neighbors = 3)
#fit the data
knn.fit(X_train,y_train)

We utilized the sklearn KNeighborsClassifier built-in library for this purpose. “n_neighbors” has been set to 3. The majority from the three closest neighbors will be used to label a new data point. Next, we will use the “fit” function to train the model, which has two parameters: X_train and y_train.

Step 6: Evaluation

The model is now ready to generate predictions based on the test data. We use the “predict” function to make predictions.

#make predictiony_pred=knn.predict(X_test)

Finally, the accuracy_score of the metrics class is used to calculate accuracy.

from sklearn.metrics import accuracy_score,confusion_matrixknn_accuracy = round(accuracy_score(y_test, y_pred), print(“Accuarcy”,knn_accuracy*100)

Our model is 70.15% accurate.

A confusion matrix could be used to understand the performance of a classification model efficiently. It demonstrates how our model becomes confused when making predictions.

#Confusion matrixprint(confusion_matrix(y_test,y_pred))

We can see the Values of TP = 125, FP = 28, FN = 52, TN = 63

Here TP = True Positive, FP = False Positive, FN = False Negative, TN = True Negative.

Advantage

KNN does not make any assumptions with respect to the dataset. As new values will be added it will successfully adjust based on k values provided. As KNN is a lazy learner it makes the algorithm much faster than other algorithms.

Disadvantage

The fundamental issue with the KNN algorithm is the curse of dimensionality. With a small number of input variables, KNN performs well, but as the number grows, it fails to predict the output of a new data point. Second, determining the best number of K values for classification is difficult. KNN also does not perform well with imbalanced data.

You can find the complete code along with the dataset on Github.

Thanks for reading.

--

--

Vishakha Ratnakar

Masters in Data analytics from National University of Ireland, Galway . LinkedIn: www.linkedin.com/in/vishakha-ratnakar