K-Nearest Neighbors: All you need to know

Published in

Analytics Vidhya

8 min readNov 11, 2019

Have never been a fan of showing off my knowledge by blogging rather than writing papers or even a book but felt rusty explaining K- Nearest Neighbors(K-NN) to one of my colleagues earlier today. Hence in this post i’m explaining everything you need to know about K-NN .

You are the average of those five humans living around you.

Table of Contents:

Introduction
How it works
What is this ‘K’ in K-NN(also called hyper parameter tuning in this case)
Creating Movies Recommender System using K-NN
Where to go from here

Nearest Neighbor — My neighbor does this so i’m doing it as well.

{1} Introduction

Despite being considered as one of the simplest machine learning algorithms(maybe because of easy to relate examples around us), K-NN has proved to be insanely useful while solving the real world problems.

When to use:

It is a supervised machine learning algorithm(output labels are provided) which solves both Classification and Regression problems. Although it is widely used for the classification problems, it’s effective at regression problems as well but the usage is minuscule.

{2} How does it work

Let’s understand K-NN with a simple example where we have to find the value of the empty circle based on distance to its nearest neighbors(Mind you that i’ve considered Minkowski distance of degree ‘2’ here). Since in the red boundary we could see that there is one negative values nearest to our empty circle so we assign the value of the empty circle as negative. Obvious, right??

NOTE: Mind you that ideally, you should not consider an even value of K while executing K-NN, as deciding the outcome on 50–50 split might create an ambiguity.

Different ways to calculate the distance

Minkowski distance( Euclidean, Manhattan etc)
Chebyshev distance
Cosine Similarity
Hamming distance

Now, let's look at the picture below where we need to find the value of the empty circle on the basis of its neighbors, but we’ve just encountered a dilemma of choice.

The inner purplish boundary points the value to be negative while the outer yellow boundary says it’s positive. That’s where the concept of K in K-NN comes along.

{3} What is this ‘K’ in K-NN

Let’s consider a real time example to find the value of K in it. That’s how our dataset looks like and we’re going to find the value of K for this problem.

Now, There are two popular ways to find the value of K in K-NN

(a) Elbow Curve
(b) Grid Search

Let’s look at Elbow Curve to find the value of K

So, our elbow curve says the value of K = 8.

So, once we find out the value of K, the problem is solved. Now all we need to do is predict the value and look out for our accuracy measures. Let’s have a look at it.(you may find the Notebook here)

{4} Movies Recommender System using K-NN

Recommender systems can be loosely broken down into three categories:

>>> Content Based Systems: It utilizes a series of discrete characteristics of an item in order to recommend additional items with similar properties.
>>> Collaborative Filtering Systems: It builds a model from a user’s past behaviors (items previously purchased or selected and/or numerical ratings given to those items) as well as similar decisions made by other users. This model is then used to predict items (or ratings for items) that the user may have an interest in.)
>>> Hybrid Systems: It combines the previous two approaches. Most businesses prefer to use hybrid approach in their production recommender systems.

Step 1: Importing necessary libraries

# Import necessary libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline
from IPython.display import Image
# utils import
from fuzzywuzzy import fuzz

Step 2: Reading the dataset

# read data
df_movies = pd.read_csv(
‘http://khansingh.xyz/Datasets/movies.csv',
usecols=[‘movieId’, ‘title’],
dtype={‘movieId’: ‘int32’, ‘title’: ‘str’})
df_ratings = pd.read_csv(
‘http://khansingh.xyz/Datasets/ratings.csv',
usecols=[‘userId’, ‘movieId’, ‘rating’],
dtype={‘userId’: ‘int32’, ‘movieId’: ‘int32’, ‘rating’: ‘float32’})

Note: Although KNN does not make any assumptions on the underlying data distribution but it relies on item feature similarity. When KNN makes inference about a movie, it will calculate the “distance” between the target movie and every other movie in its database, then it ranks its distances and returns the top K nearest neighbor movies as the most similar movie recommendations.

Step 3: Data Manipulation

from scipy.sparse import csr_matrix
# pivot ratings into movie features
df_movie_features = df_ratings.pivot(
index=’movieId’,
columns=’userId’,
values=’rating’
).fillna(0)
# create mapper from movie title to index
movie_to_idx = {
movie: i for i, movie in
enumerate(list(df_movies.set_index(‘movieId’).loc[df_movie_features.index].title))
}
# convert dataframe of movie features to scipy sparse matrix
mat_movie_features = csr_matrix(df_movie_features.values)

Now, Have a look at our derived dataset which is sparse for the obvious reason.

Let’s enhance the visibility of the table

df_movie_features_name = df_complete.pivot_table(
index=’userId’,
columns=’title’,
values=’rating’
).fillna(0)

Step 4: Data Visualization

ax = df_movie_cnt \
.sort_values(‘movie_cnt’, ascending=False) \
.reset_index(drop=True) \
.plot(
figsize=(12, 8),
title=’Rating Frequency of All Movies’,
fontsize=12
)
ax.set_xlabel(“movie Id”)
ax.set_ylabel(“number of ratings”)

Now, Let’s look at this graph

As we can see it through the data that vast majority of people aren’t interested in rating a movie so, we could consider only top 40% of our data.

# filter data
ratings_thres = 50
active_users = list(set(df_users_cnt.query(‘movie_cnt >= @ratings_thres’).index))
df_ratings_drop_users = df_ratings_drop_movies[df_ratings_drop_movies.userId.isin(active_users)]
print(‘shape of original ratings data: ‘, df_ratings.shape)
print(‘shape of ratings data after dropping both unpopular movies and inactive users: ‘, df_ratings_drop_users.shape)

Step 5: Now, one of the most crucial part of creating the model, choosing the appropriate distance measure to calculate the value of K

Notice that we have got a pretty high dimensional dataset to deal with While it was required to do what we did, but high dimensionality is not good for KNN.
Also, Euclidean distance is unhelpful in high dimensions because all vectors are almost equidistant to the search query vector (target movie’s features). Instead, we will use cosine similarity for nearest neighbor search.

The cosine similarity is advantageous because even if the two similar vectors or documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together.
The smaller the angle, higher the cosine similarity.

When plotted on a multi-dimensional space, where each dimension corresponds to a feature in the feature vector, the cosine similarity captures the orientation (the angle) of the feature and not the magnitude. If you want the magnitude, compute the Euclidean distance instead.

There is also another popular approach to handle nearest neighbor search in
high dimensional data, locality sensitive hashing.

from sklearn.neighbors import NearestNeighbors
model_knn = NearestNeighbors(metric=’cosine’, algorithm=’brute’, n_neighbors=20, n_jobs=-1)

Fitting the model:

model_knn.fit(mat_movie_features)

Let’s make the prediction readable

def fuzzy_matching(mapper, fav_movie, verbose=True):
 “””
 return the closest match via fuzzy ratio. If no match found, return None
 
 Parameters
 — — — — — 
 mapper: dict, map movie title name to index of the movie in datafav_movie: str, name of user input movie
 
 verbose: bool, print log if TrueReturn
 — — — 
 index of the closest match
 “””
 match_tuple = []
 # get match
 for title, idx in mapper.items():
 ratio = fuzz.ratio(title.lower(), fav_movie.lower())
 if ratio >= 60:
 match_tuple.append((title, idx, ratio))
 # sort
 match_tuple = sorted(match_tuple, key=lambda x: x[2])[::-1]
 if not match_tuple:
 print(‘Oops! No match is found’)
 return
 if verbose:
 print(‘Found possible matches in our database: {0}\n’.format([x[0] for x in match_tuple]))
 return match_tuple[0][1]def make_recommendation(model_knn, data, mapper, fav_movie, n_recommendations):
 “””
 return top n similar movie recommendations based on user’s input movieParameters
 — — — — — 
 model_knn: sklearn model, knn modeldata: movie-user matrixmapper: dict, map movie title name to index of the movie in datafav_movie: str, name of user input movien_recommendations: int, top n recommendationsReturn
 — — — 
 list of top n similar movie recommendations
 “””
 # fit
 model_knn.fit(data)
 # get input movie index
 print(‘You have input movie:’, fav_movie)
 idx = fuzzy_matching(mapper, fav_movie, verbose=True)
 # inference
 print(‘Recommendation system start to make inference’)
 print(‘……\n’)
 distances, indices = model_knn.kneighbors(data[idx], n_neighbors=n_recommendations+1)
 # get list of raw idx of recommendations
 raw_recommends = \
 sorted(list(zip(indices.squeeze().tolist(), distances.squeeze().tolist())), key=lambda x: x[1])[:0:-1]
 # get reverse mapper
 reverse_mapper = {v: k for k, v in mapper.items()}
 # print recommendations
 print(‘Recommendations for {}:’.format(fav_movie))
 for i, (idx, dist) in enumerate(raw_recommends):
 print(‘{0}: {1}, with distance of {2}’.format(i+1, reverse_mapper[idx], dist))

Step 6: Let’s predict the movie:

my_favorite = ‘Pulp Fiction (1994)’
make_recommendation(
model_knn=model_knn,
data=mat_movie_features,
fav_movie=my_favorite,
mapper=movie_to_idx,
n_recommendations=10)

Although we’ve successfully recommended the movie(Link to Notebook is here but in the mean time we’ve just effectively identified that there are two shortcomings in item based collaborative filtering:

1. Popularity Bias: recommender is prone to recommender popular items
2. Item Cold-Start Problem: recommender fails to recommend new or less-known items because items have either none or very little interactions imagine for a new movie with no or little ratings

{5} Where to go from here

Improvements to the current system

This movie recommender could be further improvised using “Matrix Factorization”

In a real world setting, the vast majority of movies receive very few or
even no ratings at all by users. We are looking at an extremely sparse matrix
with more than 99% of entries are missing values. With such a sparse matrix, what ML algorithms can be trained and reliable to make inference? So, data scarcity problem has to be addressed to build a better model and Matrix Factorization is the solution.

And that will be covered in my next post.

References:

Springer’s An Introduction to Statistical Learning
Mit opencourseware for machine learning
Introductory Applied Machine Learning (IAML) course at the University of Edinburgh
Analytics Vidhya, Medium, KDnuggets