Simple words: K nearest neighbors algorithm — Machine Learning

Bruno Fosados
4 min readOct 31, 2019

--

How can you teach a computer to clap? Mmm… not that easy, right? So we’re going to learn about K nearest neighbors algorithm and how it works for machine learning applications.

People tend to clap, when? Something is really cool for some reason or perhaps they have some personal preference about the subject or simply because they fell so, in this article we’re going to tech our computer to clap to good jokes, so let’s get started.

First… How can we define a good joke? Well, a good joke must have some characteristics or features, e.g. it could talk about animals vs real life or maybe a knock knock which tend to be kind of ridiculous, it could be long or short or perhaps it cold be told by Kevin Hart or Ellen DeGeneres which both have different skills for told a joke and so on…

So from this we would say that there are factors or features related to the joke that makes a joke good or bad so for this features we’re going to give an scale from 0 to 1, e.g. 0.45, 0.98 and we’re going to define the actual list of features.

We’re going to use two features “Duration” (if the duration was good enough) and “told skill” (how well was told) with values of 0.55 and 0.61 respectively, in a real life example there could be as many features as needed.

In the previews code we’ve defined our joke, the features, and assigned some values, in real life this values needs to be actually measured to reflect reality by some method. Think of this example that we’ve asked some body to give a rate for both features between 0 and 1 and that was the result.

Now, we’ve asked some people about 25 different jokes and we got some data, with this data we’re going to tech or computer so it can differentiate from a good joke or a bad one, based on they’re features, using the K nearest neighbors algorithm this process is called training.

So what is all the fuzz about the KNN (K nearest neighbors) algo anyway? In common words it says that points than tend to be together tend to be similar. Here is a Wikipedia page about the KNN for you in case you want to get more technical about it.

By Antti Ajanki AnAj — Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=2170282

In this picture, our joke, the one about the marshmallow remember? Would represented by the green circle in which case we don’t known yet how is going to be labeled or classified by our computer; good or bad joke.

All other figures; squares and triangles represents other jokes that we for sure know they’re good or bad jokes, those represent the 25 jokes data that we collect earlier, blue squares are good jokes and red triangle are bad jokes.

Also we can see two delimiters: one continues black line that creates a circle around some figures and a dashed bigger one. The K nearest neighbors uses factor call ‘K’, you can play with ‘K’ factor, making it bigger or smaller for the computer to classify our marshmallow joke. The K factor is represented by this delimiters.

If we use the dashed black line as K factor (k=5, total items without our green one) our green point is going to be classified as blue due to 3BlueSqueres > 2RedTriangle since we have more blue squares than red triangles, but if we use the small delimiter (k=3) the outcome is going to be red due to 2RedTriangle > 1BlueSqueres.

Wait, what about if we have the same number of figures: 2RedTriangle == 2BlueSqueres, If this is the case then the closest one to the green circle(test sample) is chosen that would be the red triangle.

So how do we set the ‘K’ factor? The answer is testing, did you remember our 25 jokes data that we collected earlier? We’re going to split this data. One part for the training process and one part for testing purposes, we can split the data in different portions but a good start is 80–20, 80% of the data for training our computer and 20% for testing our ‘K’ factor in order to find the more accurate one.

At this point you should be able to understand the basics of the KNN and how it works, we are going to get a little bit more technical, we’re going to actually make your computer clap for good jokes!

We are going to use: Python3, SKLearn and MathPlotLib

Try to set different values for `green_joke` variable, are they good jokes or bad ones?

Congrats! now you your computer can actually know if your jokes are good or bad, don’t get sad if they’re bad ones, you’ll get it!.

Now if you want to find the best ‘K’ factor you need to train you model with different ‘K’s check the for loop in where we are iterating with different ‘K’s values and measure its accuracy:

In conclusion, the better accurate data you have for your features the best predictions your are going to have.

If you like this please give it a clap, follow me for more interesting articles in ‘simple words’.

Have a good one!

--

--