2. First Look into Machine Learning

10 min readApr 3, 2024

This is a part of the Machine Learning series.

The Data

So, as we have discussed in the previous post, data is the foundation for Machine Learning. Your algorithm is as good as the data you have. Having poor data is like reading a poorly written textbook. You could be a great learner, but you can only learn how much there is to learn from the book. What if the book does not have enough content on the material you want to know? Or worse, what if the book has abnormal information? You might pick them up while learning. And so will your Machine Learning model from a misplaced data point. This out-of-order data point is called an outlier (lying outside the expected behavior) and needs to be treated. Also, there could be noise in the data you collected (it could be because of measurement error, faulty instruments, or human error — you copied the values wrong). You must take care of all that before feeding it to a model. This step is called Data Cleaning, and we will look into it later.

Sometimes, the data could be in a format that is not friendly to the model. For example, there could be some missing values in the table (accidentally deleted it). Or your data might have words like ‘Yes,’ ‘No,’ ‘Red,’ ‘Blue,’ etc. Machine Learning models work best with numbers and do not understand letters (after all, algorithms do some operations on numbers, and Machine Learning models aren’t exceptions). Your data might have values from 100 to 1000 in one field and 0 to 1 in the other. We must handle all these scenarios and ensure the values are consistent.

But more importantly, the data might have a lot of fields that might or might not be necessary for this task. You don’t need all of them, or you can combine a few fields and use this new one instead. This process is called Feature Engineering, one of the critical steps in Data Preprocessing. Again, we will discuss this in later posts. One thumb rule in Machine Learning — The data for the model should have more consistent values but fewer fields (features), just enough to represent the problem.

Now that the data is ready let’s go to the model.

The model

Now, the data is clean and consistent, ready to be picked up by our model. The fields in the data are part of the input for the model, and the output desired from the model will be the target, which ideally matches the actual output given by the model. In this post, and probably in future posts, too, we will represent inputs by x, target by t, and the output of the model by y.

For the model to learn the problem, we need to have data with both the input and the target (so we can train the model to pick up the input and give out the target). This is the training dataset. However, we also need to check if the model is learning correctly. We cannot use the same training dataset to learn and check the model (you just trained with it so that it will give correct values, for sure). So, we keep aside a part of the input-target dataset as the test dataset and use the remaining to train the model. We say our model performs well after testing it with the test dataset and ensuring the model outputs are close to the actual targets.

Figure 1: The training process of the Machine Learning model. We pick a data point from the training dataset. Feed the input to the model (x). Get the output (y) and compare it with the target (t). We see how good the output is to the target and send feedback to the model so that it can modify itself appropriately.

The two forms of Machine Learning

The type of Machine Learning problem described above falls under supervised learning, where you have to get the desired target from the inputs. You supervise the model’s training process by setting the target for the training data and then expecting an output for future input data. We have two types of problems within supervised learning — regression and classification problems. A regression problem is when the target is continuous, meaning you are trying to predict a number. For example, it could predict tomorrow’s stock prices or the traffic flow at certain times. A classification problem, on the other hand, is when the target is discrete, where you are trying to assign the inputs to a group. Weather forecasting is one example where you have two groups — Rain and No Rain- and you classify the inputs into one of the two classes. Another example is whether a given image is a dog or a cat.

(There is a nice history on the origin of term regression in statistics and machine learning, and it has little to do with the fields. Look it up!)

There are problems where the training data only has inputs and no target. These problems belong to unsupervised learning. Now, what do we do in those scenarios? We could try to find interesting patterns and groups in the data (clustering problem), or we can project it to simpler forms (like with fewer fields), so it is more understandable (visualization problem), or understand how the data is distributed, like for which values they are more populated and which regions are barren (density estimation problem). Unsupervised learning is generally used to understand the domain of the problem and how the data is distributed in the dataset. It may also be used in the data preprocessing stage of supervised learning to understand the data and tweak our model appropriately.

There’s a third “group” of Machine Learning problems for which the data is a mix of the two: Some data points have a target, but most are unlabelled (do not have a target). This is usually when you have a lot of data (millions or billions of images, for example) but not many resources to label all the images. This is semi-supervised learning, where only a part of the data is supervised, but the rest aren’t. But essentially, this is supervised learning with not enough data. So, we can consider this a combination of supervised and unsupervised learning instead of making it a whole different group.

We will discuss all these types of problems in detail in further blogs. But for now, let’s try out a very simple Machine Learning algorithm.

Example — Polynomial Curve Fitting

Let’s start with a very simple data (Table 1). The data has one numeric field as an input (x) and one numeric output (target). You can see the data in Table 1. As this is an example of machine learning, let’s say the data is cleaned and preprocessed, and we find out that field x influences the values of the target. We just don’t know how.

Table 1: The Data. In our example, the input is just one field (x) with an output (target). We need to find the relation between them and use that to predict the target values for future values of x. The data is noisy, though.

We can see that it’s hard to find the relationship by just looking at the table. They look like some numbers put together. However, we can plot it on a graph to see what they look like (Figure 2).

Figure 2: The Plot of the Data points. We can see a pattern now. It looks like it’s following a curve.

From the plot, we see that it roughly follows a curve. Let’s try to find that curve by writing a formula that will take in x and give out the target. This is called curve fitting — find the curve that fits our data. We can then use this formula to predict the value of the target for unknown values of x.

Equation 1: The target in the data is expressed as a function of the input. In reality, we might not perfectly find this function, because of the noise in the data we collect. We find a function that gives a good approximation of the target.

The next question is, what type of formula? A straight line? A wavy curve? Or something like a sine or cosine curve? This is our choice. We need to decide what kind of formula fits the data points well.

For this example, we will look at polynomial functions. A polynomial function is where the output of the function is the sum of powers of the input multiplied by some coefficients. Equation 2 shows the general form of a polynomial function. Now, all we need to do is find the values of n and a’s in the function so that the value of y matches the target in the data.

Equation 2: The output of the model as a polynomial function of the input. Input x is in the data, so our task is to find the best value of n (the highest power of x to consider) and the values of all the a’s (the coefficients of the powers of x)

Let’s try with a simple value for n, setting it to 1. Now we have a linear function (Equation 3). The problem is reduced to finding the values of a0 and a1, much simpler than before.

Equation 3: A linear function on x. It’s called linear because if you plot this on a graph for different values of x, you get a straight line.

We still haven’t discussed how to find those values, though. Let’s come to that later. For now, we will plug in different values of a0 and a1 and see how well the function works. But how do we calculate how good (or bad) our model is? For that, we have error functions (also called loss functions). Error functions calculate how far the model’s output is from the target.

One way is just to take the difference between them. Find the difference between the values of y and the target for each data point (don’t forget to take the absolute difference — we don’t need the sign) and add up all the differences. As simple as this sounds, this is a commonly used loss function called Least Absolute Deviation (LAD) or L1 loss.

Equation 4: Least Absolute Deviation (LAD) of the model. The higher the value of LAD, the worse your model is.

We will discuss more error functions in later posts.

Now that we can measure how our model performs, we check the LAD score with different values of a0 and a1 for the data and settle on the values that give the smallest LAD score. Problem solved.

…Or is it, though? Remember that all we found was the best linear function for our data. Is linear function good enough? If we plot the best linear function for our data (Figure 3), we see that it’s nowhere close to the data points. We notice that the function has approximated the target a little too much. Our model is not complex enough to take in all the points. This is called underfitting.

Figure 3: Linear curve fitting of the dataset (a0 = 0.045, a1 = 1.027). The red dots are the points in the data, while the green dotted line is the best linear function we can get. But this is way too generalized.

What do we do if our model is underfitting? We increase the complexity of the model. In this case, it is increasing the value of n. Let’s try for a value of n = 4. We now have 4 values of a to find, a much harder task. But it is still doable. And the plot of the best function (Figure 4) looks like a good fit!

Figure 4: Polynomial curve fitting of dimension 4. The red dots are the points in the data, while the blue dotted line is the best function we can get. The curve looks really good! A good representation of the data with the difference being very small.

So, increasing the value of n gave a better-fitting curve. So why don’t we give a much higher value for n? Let’s try for n = 10. The best-fit curve (Figure 5) is a little too complicated. It has covered all the data points perfectly but has created a few humps. This is because our data was a little noisy, and the model captured the data along with the noise, causing this unexpected behavior.

Figure 5: Polynomial curve fitting of dimension 10. The red dots are the points in the data, while the blue dotted line is the best function we can get. The curve is too complex and has captured every data point, even with its noise.

The problem is when the function is used for more data points. This model is a bad choice if the data points are along a simpler curve (like in Figure 4). This is the case of overthinking your data and getting a worse performance than a simpler curve. This is called overfitting.

You don’t want your model to either underfit or overfit. In general, it is not hard to find out if the model is underfitting. The loss scores of the model will be really bad, and you need to increase the complexity of the model. However, it is tough to see if the model is overfitting. The loss score will be the lowest, and it is easy to think that the model is at its best.

To ensure the model does not overfit, we keep aside a part of the dataset as testing data and the rest as training data. We use the testing data only to see how the model performs and is not a part of the model training process. Even if the model overfits the training data, it hasn’t looked at the testing data, so the overfit model will perform poorly.

Another way to take care of overfitting is to penalize the model if it gets too complicated. We take the error function we defined earlier and add a term that depends on the coefficients of the polynomial function. This is called regularisation. One of the regularisation ideas is to add the (absolute values of the) coefficients to the loss function and use that as the new loss. This is called L1 regularisation (Equation 5).

Equation 5: L1 regularisation. The sum of the absolute values of the coefficients (the penalty) is multiplied by lambda before adding to the loss function. We can control the strength of the penalty through lambda.

The lambda term in the regularisation equation is used to alter how much we want the penalty to influence the total loss calculation. Higher lambda will penalize more for the complexity of the model and might harm even the good models. We will discuss more on regularisation later.

With this, we can end this post. We have discussed the case of a simple curve-fitting function to find the pattern in the data.

Next post: A small visit to the world of Probability.

Reference

Bishop, Christopher M. Pattern Recognition and Machine Learning. New York: Springer, 2006.