Machine Learning. k-Nearest Neighbors.

Machine Learning & Data Science A-Z Guide.

Published in

Pharos Production

7 min readSep 4, 2017

Give us a message if you’re interested in Blockchain and FinTech software development or just say Hi at Pharos Production Inc.

Or follow us on Youtube to know more about Software Architecture, Distributed Systems, Blockchain, High-load Systems, Microservices, and Enterprise Design Patterns.

Pharos Production Youtube channel

Today we will talk about a simple classification task — one of the basic machine learning examples you can find over different articles or books — flowers classification. Also, we will dive into details of k-Nearest Neighbors algorithm and it’s options.

Iris Classifier.

Brief.

We will use Iris Dataset from Scikit-Learn. We’re going to classify every new iris flower by defining it’s 4 features-measurements: petal width and length and sepal width and length. In our target classification result, we should put this flower into 3 available groups — Setosa, Virginica, and Versicolour.

Data Set Characteristics.

Here you go with full dataset characteristics. We have 50 elements in each subclass with 4 attributes defined as a real number and they are in cm. And as we have mentioned above — 3 target classes.

Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica

Let’s get all available information from the data set and split it into 75% of training data and 25% of test data. We make 25% of the data unavailable for training to be sure our model is generalized well. That means it will work well with new data and not only with a trained dataset.

Big X is because we’re using conventions from math here — X is a matrix, y — is a vector. Also, we define a random state for a pseudo-random number generator to be sure we get the same state every time we run the code.

Keys of iris_dataset: 
dict_keys(['target', 'data', 'feature_names', 'target_names', 'DESCR'])

Target names: ['setosa' 'versicolor' 'virginica']

Feature names: 
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Shape of data: (150, 4)

First 5 columns of data:
[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]]

Shape of target: (150,)

Target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]

X_train shape: (112, 4), y_train shape: (112,)
X_test shape: (38, 4), y_test shape: (38,)

It’s a good idea to look through the dataset by plotting it's scattered matrix of all feature combinations. Also, we need to be sure there is no abnormalities in the data like some petals are in inches and some in cm. We will use Pandas library for it to build pair plot — all possible pairs of features.

And this is the result. As you can see, it’s linearly separable. That means a model will be able to learn to separate them.

Supervised Model.

Now let’s train a simple classifier. We will use k-Nearest Neighbors classifier for this time. It’s not the best one, but it should work pretty well when you have a small number of features and elements in a dataset that are clearly separable from each other. The algorithm search for the nearest point in the training set. Then it assigns the label of the training point to the new data point. It can be a mean of N closest neighbors, but for now, it’s just one. The accuracy is 97%, pretty good!

k-Nearest Neighbors.

Now let’s take a look at k-NN itself. This is the simplest Machine Learning algorithm. To make a prediction for the new data point algorithm finds the closest data points in the training data set. k-NN can be used both for classification and for regression tasks.

Classification.

In the simplest version, k-NN takes the closest point from the training set and assigns its class to a new predicting point. Instead of considering only the closest neighbor, we can also consider an arbitrary number of neighbors — k-nearest. We use voting to assign the label. For each test point, we count how many neighbors belong to each class and assign the class that is more frequent.

Let’s try it on a synthetic forge dataset from the MGLearn kit. And the result is visualizations of decision boundaries for 1, 3 and 9 neighbors. As you can see the one neighbor creates a boundary that follows the training data closely. More neighbors lead to a smoother decision boundary. Smoother boundary corresponds to a simpler model.

Let’s check how accuracy depends on the number of neighbors. We will use real-world breast cancer dataset from SciKit Learn. With one nearest neighbor, the prediction on the training set is perfect. With more neighbors, the model becomes simpler and training accuracy drops. With one neighbor test accuracy is low indicating that using the single nearest neighbor leads to a model that is too complex. On the other hand, with 10 neighbors the model is too simple and performance is even worse. The worst performance is in the middle using around six neighbors and is around 88%, which is acceptable.

Regression.

There is also a regression variant of k-NN algorithm. Let’s look at how it will use training set to make a prediction. When using multiple nearest neighbors, the prediction is average, or mean, of the relevant neighbors. Score method for the regression returns R² score — coefficient of determination and yields a score between 0 and 1. 1 — is a perfect prediction and 0 — constant model that predicts the mean of the training set responses.

Let’s see how the distribution of predictions made by k-NN for different values of neighbors. Using only a single neighbor, each point in the training set has an obvious influence on the predictions, and the predicted values go through all of the data points. Considering more neighbors leads to smoother predictions, but doesn’t fit in the training data as well.

k-NN prediction distribution based on number of neighbors

Conclusion.

There are two important parameters to the k-NN classifier — the number of neighbors and how it measures the distance between points. In practice using a small number like 3 or 5 often works well. By default, Euclidian distance is used, which works well too. Model is very easy to understand and it gives a reasonable performance without much adjustments for small training sets. On a large training set model can be slow. Also, it doesn’t perform well on datasets with many features(>100) and it does badly on sparse datasets(most features are 0).

That’s all for now. We should say thanks to the author of this book. Feel free to buy it, it’s really cool. All source code is available at Github repo.

Thanks for the reading!