Solve problems with machine learning — what is machine learning anyway? An introduction of Regression Models to a nontechnical audience.
Up till three months ago, machine learning was a total mystery to me. I had no idea what it was, what to use it for, and not to say how to configure it. The only thing I know was — it’s trendy and all the geeks are talking about it.
So, what is machine learning anyway?
It is a subset of Artificial intelligence. It is an approach to apply statistical modeling to known data. The goal is to use the outcome of our models to answer specific questions, and different models can help answer different types of questions. What’s powerful about machine learning is that the model can learn the relationship between a good amount of data points from a big data set far faster than humans can ever do.
With the development and availability of computing power, the application of machine learning is seen everywhere in our life. Regression and classification are typical machine learning questions and models. There are also image and voice recognition, natural language processing, and so on. I will have a series of articles talking about different types of issues and models, and I am going to focus on regression issues and models in this one.
Regression models generate a number, or numbers, called prediction(s), as the outcome of its learning process. We call this learning process model training — we train our models to learn the relationship between all identified data points and the target value — Here are a few examples of regression issues:
- How much is the selling price of a house based on: square footage, number of bedrooms, number of baths, number of the garage, size of the kitchen, and so on.
- The estimation of website traffic based on: types of marketing events, launch time of the event, duration of the event, marketing tools, social media platforms, time of a day, and so on.
- How much rain an area to receive based on: The time of the year, The incoming direction of the wind, The height of the area and so on.
An example closer to daily life would be, let’s say, based on the data of money I spend on meals every day and my eating habit, how much money am I going to spend on food next Saturday? People who know me well will probably say “A lot”, but if I have the data and a regression model configured, after I enter the required input, my model will answer my question with a number. You would probably think, “Sure, it sounds cool, but does that mean the number is right? Can the number be right? Or how right the number can be?”
First, the real spending number is likely to be different from the prediction because when it comes to the future, we don’t know what it is going to be until the future comes. What can possibly happen is that, for example, I look at the number and decide I have spent too much on food and I need to regulate my spending, or I feel I don’t spend enough and I decide to go big by getting a luxury meal on Saturday. Can I make an effort to make sure I spend exactly what the model predicates? Sure, why not.
The first point I try to make here is that since we are using known data to train and validate our models, we can validate whether the outcome is right or wrong by generating metrics to help us understand how right or how wrong the outcome is. When it comes time to make a prediction about the future, the actual number is likely to be different because the majority of variables are unpredictable and keep changing. However, with the prediction, it’s a prompt to advocate for certain decisions, to urge for actions to proactively handle a situation, or to prepare for an undesirable outcome.
Second, how do we know how likely our model will be right or wrong? Or how right/wrong the model is? We can utilize metrics to help us understand whether our model is making meaningful outcome or not. Even though there are various metrics to indicate how good or bad our models are, most of them are variations of the differences between the predicted values and actual values. This error is called residual. By feeding residuals to different loss functions, we can examine our models from different angles to help us understand how to make the model better and whether the changes we have been making are working or not.
I am going to stop here because it’s getting math heavy to explain different matrix. Text me or email me if you are interested in knowing more about machine learning models and understanding the evaluation matrix better.