Introduction to Machine Learning: k-Nearest Neighbors

Grace Zhang
4 min readOct 25, 2018

--

As someone new to the world of Machine Learning, the very first Machine Learning algorithm I learned is the k-NN (k-Nearest Neighbors) algorithm. Even though it is one of the simplest algorithms you’ve ever know, it is actually quite useful.

How k-NN works?

Here’s the basic idea of how k-NN works: suppose there are two different classes, Class A and Class B, in the dataset. Now we have a new data point, which is the red, pentagon-shaped point on the plot below and we want to predict which class this new data point belongs to. When k = 3, there are 2 Class B and 1 Class A in the 3 nearest neighbors of the new data point. The simple majority vote is Class B. Therefore we predict the new data point belongs to Class B.

picture from https://helloacm.com/a-short-introduction-to-k-nearest-neighbors-algorithm/

To generalize, here are the steps of how k-NN works:

  1. Getting the labeled data ready
  2. Pick an appropriate k
  3. Get the new sample to classify
  4. Select the k entries that are closest to the new sample
  5. Take a simple majority vote to pick a category for new sample

Properties of k-NN Algorithm

k-NN (k-Nearest Neighbors) is a supervised, classification, non-parametric, and instance-based model.

It is a supervised and classification model because the data we have are labeled and we want to classify new data points to the existing categories. Take the most popular dataset Iris as an example, the dataset contains 3 types of iris plant with 50 instances each. In this case, each data point is labeled with the type of iris plan it belongs to. Without the labels, we will not be able to make predictions for the new data point. However, note that the k-NN algorithm sometimes can also be used as a regression model. We usually use the average (or median) of the k nearest neighbors to predict the output value.

It is non-parametric because there is no assumed functional form of the data. Before making the prediction, we don’t actually have any assumptions regarding the data distribution. This also leads to the other property of the k-NN algorithm — it is instance-based. There is no training phase. As long as we have the labeled data, and the data point that we want to make predictions, we will be able to make the prediction.

Two Parameters of k-NN

Although the k-NN algorithm is non-parametric, there are two parameters we usually use to build the model: k (the number of neighbors that will vote)and the distance metric.

There are no strict rules around the selection of k. It really depends on the dataset as well as your experience when it comes to choosing an optimal k. When k is small, the prediction would be easily impacted by noise. When k is getting larger, although it reduces the impact of outliers, it will introduce more bias (Think about when we increase k to n, the number of data points in the dataset. The prediction will be the majority class in the dataset).

The selection of the other parameter, distance metric, also varies on different cases. By default, people use Euclidean distance (L2 norm). However, Manhattan distance and Minkowski distance might also be great choices in certain scenarios.

Pros and Cons of k-NN

There are multiple advantages of using k-NN:

  1. It is a simple machine learning model. Also very easy to implement and interpret.
  2. There is no training phase of the model.
  3. There are no prior assumptions on the distribution of the data. This is especially helpful when we have ill-tempered data.
  4. Believe it or not, k-NN has a relatively high accuracy.

Of course, there are disadvantages of the model:

  1. High requirements on memory. We need to store all the data in memory in order to run the model.
  2. Computationally expensive. Recall that the model works in the way that it selects the k nearest neighbors. This means that we need to compute the distance between the new data point to all the existing data points, which is quite expensive in computation.
  3. Sensitive to noise. Imagine we pick a really small k, the prediction result will be highly impacted by the noise if there is any.

Implementation of k-NN

It is actually quite easy to implement the k-NN algorithm using the Python scikit-learn library. You can even implement your own algorithm from scratch. Here is a great example: Introduction to k-Nearest Neighbors: Simplified (with implementation in Python).

The End

This is a summary of my understanding of the k-NN algorithm after taking my first real machine learning class from the MS Data Science program at the University of San Francisco. I hope this summary helps. I am more than happy to take any comments.

Reference:

MSDS621 Machine Learning by Brian Spiering

A Quick Introduction to K-Nearest Neighbors Algorithm by Adi Bronshtein

A Complete Guide to K-Nearest-Neighbors with Applications in Python and R by Kevin Zakka

--

--