KNN Classifier Implementation: Best Practices and Tips (PART I)

Madhuri Patil
7 min readNov 18, 2023

--

KNN or k-nearest neighbors is a simple, yet powerful machine learning algorithm used for classification and regression tasks.

It works with a basic idea that similar data points tend to cluster together in the feature space. By examining the nearest data points, KNN applies the concepts that neighboring data points can provide valuable insights into the classification or estimation of a given data point.

KNN is a non-parametric or an instance-based algorithm. It doesn’t build a model during the training like a model-based algorithm; instead, it memorizes the entire training dataset.

When a prediction is needed, the KNN examines the entire dataset to find the ‘k’ nearest neighbors to the query point (data point to be predicted).

You can follow these basic steps to build the KNN algorithm:

  1. Choose a value for k, the number of nearest neighbors to consider.
  2. For new data points, calculate the distance to all the training data points using a distance metric. It can be either Euclidean, Manhattan, or Minkowski.
  3. Find the k closest training data points to the new query point based on the distance metric.
  4. Assign the value for the query point with the help of k closest training points.

In this article, you will learn how to apply the KNN algorithm to build a classifier using the scikit-learn library in Python.

Scikit-learn is a popular and easy-to-use Machine Learning library that provides various tools and algorithms for data analysis and modeling.

Along with scikit-learn, we will use the NumPy library for numerical computations, Pandas for data manipulation, Matplotlib, and Seaborn for data visualization.

Additionally, you can use Jupyter Notebook for implementation or any other code editor.

K-neighbor Classifier

Data Points and Features

At the heart of KNN are data points, each representing an observation in the dataset. Each of these data points has a set of attributes that describe it. Now, these features can be numerical, categorical, or a mix of both, depending on the nature of the data.

One thing you must consider before working with KNN is that transforming the data as KNN works only with numerical data.

It is recommended to transform data into a numerical format and standardize or normalize it before using the KNN algorithm.

For demonstration purpose, we will be using randomly generated data. Here, we are not seeking to create an accurate model but rather try to understand how the algorithm works.

So, let’s keep it simple. We will use a dataset of 1000 samples with two features. The below scatterplot shows the distribution for our randomly generated data values.

In the above plot, you can see there are two clusters, one for each class. The data in blue represent class 0, and orange class 1.

Before implementing the KNN algorithm, let’s split the data into training and test sets to evaluate the model performance on unseen data and to avoid overfitting.

One of the common tools for that is train_test_split function from sklearn.model_selection module.

# split the dataset into training and testing sets in ratio of 80:20
x_train, x_test, y_train, y_test = train_test_split(X, y,
test_size = 0.2,
random_state = 42)

Model Building

To implement KNN, scikit-learn provide a KNeighborsClassifier class from the neighbors module.

Now, we import and initialize the class. This is when you can set parameters like the number of neighbors to use. For that, you can set the value for the n_neighbors parameter. Let’s set it to 3.

# Import classfier
from sklearn.neighbors import KNeighborsClassifier

# Create an KNN classifier instance
clf = KNeighborsClassifier(n_neighbors=3)

In the above example, we instantiate the classifier with three neighbors and keep all other settings default.

Model Training

Next, we need to fit the model on training data. For that, we’ll use the fit method of the classifier by providing the training sets as parameter values.

# Fit the model on training data
clf.fit(x_train, y_train)

Here, KNeighborsClassifier stores the entire training data into the memory so that we can compute the neighbors during predictions.

Model Predictions

Now, we make predictions on the test data using predict method of the classifier.

# Make predictions
predictions = clf.predict(x_test)

For each data in the test set, the classifier computes the three nearest neighbors as we set three for the number of neighbors in our classifier and finds the most common class among these.

Model Evaluation

To evaluate the model performance on test data, we can use the accuracy_score function from the scikit-learn. It tells us how well our model generalizes on our test data.

from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, predictions)

print(f"Test Accuracy: {score}")
# Output - Test Accuracy : 0.97

You can see that our model is about 97% accurate, meaning the model predicted the class correctly for 97% of the samples in the test dataset.

Analyzing KNeighborsClassifier

The classifier provides thekneighbors method. It returns the information about the k number of neighbors of a query point.

It includes the arrays of the distance of nearest neighbors for the given data point and the index of that nearest neighbor for each neighbor.

Using this information, let’s analyze the predictions of a single query point using two classifiers that define with different numbers of neighbors.

The above plot shows the predictions made by two different models of KNN, one with a single neighbor and the other with two nearest neighbors for the same query point which is showed plus in the plots.

The plus data point is colored the same as the class label predicted by the classifier model.

In the left figure, the classifier finds the single nearest neighbor and considers that neighbor while making predictions for the query point.
You can see that the new data point is predicted as class 2.

In the right figure, the classifier considers two nearest neighbors to make predictions. We can see that one neighbor belongs to class 0 (Blue dot) while the other is class 2 (orange dot).

However, now our model predicts the same query points as class 0, marked in blue, even though the query point is slightly closer to the orange data point than the blue data.

So, what is happening here? Let’s figure it out by exploring the other parameters of the KNN classifier.

The get_params()method returns the information of all parameters and their corresponding current values of the classifier.

## Get the parameters
>>> clf.get_params()
{'algorithm': 'auto',
'leaf_size': 30,
'metric': 'minkowski',
'metric_params': None,
'n_jobs': None,
'n_neighbors': 2,
'p': 2,
'weights': 'uniform'}

This is the list of parameters for classifier with two nearest neighbors. You can see the last parameter weights currently it is set to uniform which is also its default value.

What does it mean is that the classifier uses uniform weights for making prediction of the query points, meaning the predictions are calculated by simple majority votes of the nearest neighbors treating all the neighbors equally.

Therefore, in case of a tie, there should be no reason for the classifier to prefer one of the classes over the other.

Let’s first understand how predictions are made.

For classification, KNN employs a majority voting mechanism among the ‘k’ neighbors. It counts the occurrences of each class within the ‘k’ neighbors and assigns the class with the highest count as the predicted class to the query point.

The scikit-learn uses the mode method from the scipy.stats module when weights are uniform to find the most common class label among the nearest neighbors.

If more than one such value present, then only the smallest value is returned. But in our case, there is competition between the two classes. Since both occur precisely once. The one with the smallest index is returned, namely class 0.

# Get the nearest neighbors information
>>> neighbors = clf.kneighbors([[f1, f2]])
>>> neighbors
(array([[0.30681531, 0.52552856]]), array([[737, 30]], dtype=int64))
>>> y_train[737]
2
>>> y_train[30]
0

Therefore, it doesn’t matter at all if the blue point is further away from the query point. What matter is smaller index.

However, KNN is a distance-based algorithm, and it would be better to take distance into the account while making decision right?

For that, you can change the weights parameter value from uniform to distance. Now, the classifier will assign a weight to each nearest neighbor which is equal to the inverse of its distance from the query point.

Therefore, shorter distance would have higher weights and vice-versa.

Now, let’s see the effect of weights on prediction if we set weights to distance.

# Predictions of new query point with 2 nearest neighbors and distance weights.
>>> clf = KNeighborsClassifier(n_neighbors=2, weights='distance')
>>> clf.fit(x_train, y_train)

>>> clf.predict([[f1, f2]])
array([2])

Now classifier, clf predicts the query point belongs to class 2 instead of 0 as the predictions are made by considering distance rather than the index position of data points.

However, using distance weights is not always a good choice. Sometimes, uniform weights improve performance by giving better results.

In this article, we discussed the implementation of the ‘k’ nearest neighbor algorithm using Scikit-learn and we also analyze the classifier with single and two nearest neighbors, where we discussed the effect of uniform and distance weights on the performance of the classifier.

In the next part, we will explore some of the best practices and tips for optimizing our classifier and the effect of different ranges of values of ‘k’ on the performance of the classifier using decision boundaries. You can read that article here.

I hope this article helps you understand the workings of the K Nearest Neighbor algorithm.

Thank you so much for reading! 🙏😊

🔗 Affiliate links
Master machine learning with scikit-learn, check out this amazing course by Kevin at Data School if you want to learn machine learning in depth with scikit-learn.

--

--