ML methods: Regression and classification

Badreldin Mostafa
unpack
Published in
5 min readMar 8, 2021

“Any sufficiently advanced technology is indistinguishable from magic”. Arthur C. Clarke

Since time immemorial humans have tried a myriad of methods to predict the future and devised ways to recognize friends from foes, our very survival depended on it. From the high priests of the Pharaoh to the oracles of Delphi, throughout history, we tried everything to identify the best time to start a war, plant trees, build cities. We tried to predict people’s personalities from the stars under which they were born, we used spirits, crystal balls, palm reading, and tarot cards. Only in modern times that we started relying on the scientific method. But our goals remained the same, predicting and classifying. The only difference is that we have a more sober view of how the world actually works and a consciousness of our predictive limitations.

Machine Learning and AI are nudging us a step forward in improving our predictive and classification powers. While there are plenty of Machine learning methods and uses. This article will briefly introduce two of the main ML methods namely regression and Classification.

Probably you made acquaintance with regression through some statistics class at some point but perhaps, but if you are like me, your memory might be a bit rusty.

Ever met a seasoned real-estate agent, you tell her a location, an apartment size, direction, etc. and she reflexively gives you an accurate estimate price for this hypothetical apartment. It takes years in the market to build up this intuition. While I am always wary of prices quoted by my friends in real estate, I am still fascinated by the accuracy of their estimates. So, what if we can teach machines to reason in a similar fashion and give even better estimates?

There are those beautiful things in mathematics called functions, you feed them with a number x from one side and they gave you an output y.

y= f(x)

X, the independent variable and Y is called the dependent variable (it depends on the value of x doesn’t it). In the case of our apartments the price is y, it depends on one or more variables (x or a vector of x values x1, x2, x3…xN) like the location, size, position etc. The function f(x) is what links them and is what we want the machine to figure out. A good function will approximate the price very accurately. A simple linear relationship between the dependent and one independent variable say price vs. size might take the form of f(x)= m*x + b where (m) is the slope of the line or the intensity of the relationship between y and x. All else equal we expect the price to increase with the increase of the apartment size, but by how much? A positive relationship makes (m) positive and the value reflects the increase in the price for each unit increase in size. (b) is the intercept with the y axis at x equals zero, while I am sure no one wants to buy a zero-sized apartment, there are always those fixed costs involved like the legal paperwork etc. you pay them regardless of size.

Training a computer to do the job of the real estate agent, it should have a way to figure out the (m) and the (b) to the line that predicts with the price with the highest accuracy given a certain size.

In real life you might find multiple apartments with the same size at different price points, not so straightforward. The trick is to collect actual data of real apartments with their corresponding prices (we call that supervised learning as we already know the right answer) and let the algorithm find the best fit line that can explain most of the answers. After all, we are only giving an estimate of what a price should be. In the case of a linear relationship the model will keep tweaking the value of m and b to get the line that minimizes the square error or the distance between each actual data point (price vs. size) and where the line would have predicted the price to be given that size, we need this error to be as small as possible as that will make better predictions and make our model behave like a true real-estate expert. This was a simplistic view of the model, considering all the other variables we need more advanced numerical techniques beyond the scope of this article.

Regression’s close cousin, is classification. While with regression we were trying to find a number “Price of the apartment” given the X variables. Classification takes X input variables for a data observation and recognizes this observation as belonging to a specific class. Let us take for instance a health survey where each data observation X is a set of health variables from one specific person X1 is height, X2 Weight, X3 Muscle percentage, X4 gender etc. The output Y is a classification for health status like; underweight, fit, overweight, obese, or morbidly obese. As with regression we build the model with labeled data to train it so when it sees new data from a new person that wasn’t present in the original training data set it can classify this person’s health status based on its input vector {X1,X2,X3..XN}. There are several methods of classification like logistic regression, decision trees, Knn, random forest etc. but those are beyond the scope of this article.

A last word on those methods, they are both supervised learning methods, trained on labeled data sets where the right answer is known beforehand. One of the problems we want to avoid when training this type of model is overfitting. It is when we make the model memorize all the data instead of well, understanding the relationships between the inputs and outputs. We want our models to be smart not just regurgitating what they have learned. A 100% accuracy for a model might be a warning sign of overfitting, to avoid that usually we split the labeled data set in three parts, training data, verification, and validation data. We train the model with the training data set, then verify it’s performance on data it haven’t seen before, this is where the verification and validation data sets are used. Since we already know the correct labels (values in case of regression/classes in case of classification) we have a way to judge the model’s performance.

With Regression and Classification, we are just scratching the surface of machine learning. But I hope you can see the potential of machines being able to predict and classify accurately. Machine learning is the closest thing we have to real magic in modern times.

--

--