Machine Learning: A theoretical approach of predictive models with K-nearest Neighbors Algorithm

Rubén Acevedo
Dev Environment
Published in
5 min readDec 8, 2020

“Computers are able to see, hear and learn. Welcome to the future.” — Dave Waters.

Machine learning’s explosive growth is now a fact instead of a vision. Many useful tools were born in the last two years using it, and the need of implementing intelligent solutions is a life and death need for the arising companies as well as the ones that still want to survive the 4.0 revolution.

But what is machine learning and how can we use it to our advantage? In which situations can we successfully implement a machine learning solution? How does it works?

Definition

Considering Machine learning as an application of AI that provides systems the ability to automatically learn and improve from experience without being explicitly programmed, Machine Learning uses already existing mathematical approaches empowered with computational features to take statistics and data analysis to a whole new level and we can implement it in many fields as accurate predicting a large amount of numerical or categorical variables for insights generation and clustering data to create recommendation systems and targeted marketing campaigns as so as other features that you can check below:

Spotify, Netflix, Google, banks, marketing agencies, corporative analysis and so much other companies and sectors are examples of exploring the advantages of machine learning, and in the link below you can find an article with the top 10 companies that uses Machine Learning in their regular basis:

Machine Learning Algorithms

There are many types of algorithms capable of predicting variables, and the most important thing to understand is that not all of them work for a given situation.

Depending of the type of data, or the type of information you want to get from that data, some types of algorythms will be more useful than others.

That is an important thing to understand about machine learning, but because this is a theoretical approach, I won’t be going deeper on that.

I’ll just leave this article here about this topic:

Understanding Machine Learning

Now that we know what machine learning is, we need to understand how it works. And there’s no better way to understand it that with an practical example:

Let’s imagine that we want to find out when a person is going for a walk (Thursdays and Fridays/Saturdays and Sundays), whether it’s going to the movies, spending the afternoon in a park, eating in a restaurant…

Is it possible? Yes.

So, this is an hypothetical situation, that means that the data I’m using here is made out by me, just with the goal of explaining how our thinking must work and so you can easily understand how machine learning algorithms work.

Now, in this example, we gathered a group of people and asked them when do they prefer to go out, if thursdays and fridays or saturdays and sundays.

We separate them by age and salary.

This is the chart that represent the results:

We can see here that there’s a linear relation between age, salary and the days of the week that people are more likely to go for a walk. But how can we use machine learning to always predict that?

Visualizing the problem

In this situation, we will need to estimate whether a 29-year-old with a US$ 2,700 salary will tend to go out on Thursdays and Fridays or on Saturdays and Sundays.

Now that we have the data to analyze, the best thing to do is to represent it on the same chart as the rest:

The yellow dot represents the data whose classification we do not know.

Now that we have a graphical representation of our problem, we should start using machine learning techniques to find the most likely rating for this data.

In this case, we will use the concept of KNN, K-nearest neighbors.

K-nearest neighbors

So, what is KNN after all?

K-nearest neighbors is a machine learning algorithm that uses geometric positioning within a graph to identify patterns and predict the classification of a given data or group of data based on the surrounding data.

Now in the chart below we will select the five classifications closest to the one we want to predict:

We can see that there are a majority of orange icons (3–2), representing a majority of people who prefer to go out on Saturdays and Sundays. This gives us a 60% probability that this person will be classified with the orange icon.

But why not expand the number of data used to calculate this probability?

The answer to this question is simple: Randomness.

The more data we use as a parameter, the more the algorithm tends to be random, and mainly to be defined by a general majority of data obtained and not necessarily dependent on its location on the Cartesian plane.

Imagine that we increased the value of K (the value that defines the amount of data used to predict our yellow dot) so much that it encompasses most of our data. In this situation X will not be defined by the proximity of the data, but by the overall amount of data in the plane.

Therefore, the best way to work with the value of K is to use a moderate value that can predict our X as accurately as possible.

Finding K

The best way to find the correct K value is to use a method known as train / test split.

In this method we separate the data we already know its classification into two different databases, one for testing and one for training.

The training database will reproduce several K values, and then test them against the test database. The value of K with the best results will be chosen.

This is the most practical method for finding the best K value for a given data group.

Conclusion

I believe this theoretical summary helps to clarify some doubts about how we can use machine learning to predict certain situations and data.

The use of these techniques depends on our imagination, as mathematics can apply to almost every situation on our planet, and it is already bringing about many changes in business, health, engineering and sports.

Any questions or suggestions please let me know.

--

--

Rubén Acevedo
Dev Environment

Data scientist, caring brother and passionate writer.