Fast Custom KNN in Sklearn Using Cython

Reda Boumahdi
Sicara's blog
Published in
4 min readJul 5, 2017

Read the original article on Sicara’s blog here.

Scikit-learn is a great library for machine learning, but quite slow to solve some problems, especially for custom enrichment such as custom metrics. k-NN or KNN is an intuitive algorithm for classification or regression. I will mainly use it for classification, but the same principle works for regression and for other algorithms using custom metrics.

Let’s dive into how you can implement a fast custom KNN in Scikit-learn.

A quick taste of Cython

The fundamental nature of Cython can be summed up as follows: Cython is Python with C data types.

Cython is actually Python code that will be compiled to C file and create a library. The calls to this library will be faster than calls to python files. How fast ? See for yourself !

First, let’s create a simple loop in python, for instance like this:

Then, let’s do the same in cython:

To build the cython library, the command line is:

python setup.py build_ext --inplace

Then we need to execute the main file:

Screenshot of the output of main.py

Surprise… Cython is 1000 times faster!

What is KNN?

KNN is a great tool for interpreting the outputs and not having to handle black boxes. The idea behind this clustering algorithm is to compare a new point (the green circle) to the K most similar points in the data set (the closest ones), and to give it the mainly represented label (square or triangle).

K-NN classification

In the following case, if K = 3, the algorithm will predict a triangle, if K = 5, the algorithm will predict a square. Thus, the choice of K is quite important.

The main computational complexity is to calculate all the distances between the current point and all the remaining points. Let’s see how we can apply what we learned on Cython in the first part of this tutorial to K-NN.

Why use a custom distance?

When it comes to data, the more features you can use, the better it is. Often, you will be faced with continuous and discrete features at the same time. In this case, a classic distance may fail to give good results. A custom distance needs to be chosen. There is an infinity of distances choices, and you could combine them to create your own for a better predictive power… Just like the avengers they become better if you combine them !!

Here you can find a list of some common metrics that are implemented in Scikit-learn :

  • Euclidean : sqrt(sum((x - y)^2))
  • Manhattan : sum(|x - y|)
  • Chebyshev : max(|x - y|)
  • Minkowski with parameter p : sum(|x - y|^p)^(1/p)
  • W-Minkowski with parameters p,w : sum(w * |x - y|^p)^(1/p)
  • S-Euclidean with parameter V : sqrt(sum((x - y)^2 / V))
  • Mahalanobis with parameter V : sqrt((x - y)' V^-1 (x - y))

Click here to learn more about importing these distances with Scikit-learn.

As an example, custom metrics are useful for recommendation. Imagine if Booking wanted to suggest hostels as a function of what you already booked in the past using the price (price_i), the location (geoloc_i) and the stars (stars_i) as features. To apply K-NN we must be careful to compare the same level of objects. One way to overcome this difficulty is to normalize, another one is to use the following distance :

α‖geoloc_1-geoloc_2‖² + β‖price_1-price_2‖² + γ‖stars_1-stars_2‖²

And to choose α, β and γ so that the learning rate is better.

Custom distance syntax

The first step is the definition of our custom distance. We define a metric in a file called cython_metric.pyx. My custom distance takes two arrays and returns a double.

In simple Python, it would look like this.

Then in another file, we load the data and call KNeighborsClassifier from Scikit-learn, and give it our custom distance.

We output the time taken by the algorithm and the precision as well.

Performance benchmark

For the same distance (Euclidean) we compare the performance of python code vs cython code by running the previous code several time for a different number of observations. We can then plot the following graph.

Conclusion

Python is a great language to solve several problems. If you are interested in performance and want to speed some part of your code, you have the possibility to move it in a Cython module. As we did with the calculation of the distance, your code will run much much faster! I hope this tutorial will help your algorithms learn blazing-fast!

External links

Did you like this tutorial? Do you want to hire us? Click here!

--

--

Reda Boumahdi
Sicara's blog

After working in finance, developing algorithmic trading, and doing research in natural language processing, I joined Sicara as an agile data scientist.