Machine learning frontiers: modelling basics

Luiz doleron
7 min readFeb 1, 2022

--

Artificial Intelligence (2001) — copyrights of Warner Bros. Pictures - source

If the world rebooted today, which modelling concepts must you know to get machine learning started up again?

In despite of the scientific efforts, it is not possible to understand every single force acting on a physical or social phenomenon. This occurs due three classes of limitations:

  • theoretical ignorance: the relationships and theoretical rules of a given phenomena are not completely known
  • practical ignorance: absence of complete observations like facts , measurements and experimental readings
  • laziness: the full set of forces acting on a given phenomenon is so huge that it is impossible to list them or precisely to calculate their results

Machine learning uses function approximations to deal with scenarios like this. These function approximations are called models.

Modelling

In the context of machine learning, modelling is the process of finding useful models through out a training process.

A model is said useful when performs well on unseen data, i.e, data not used during its construction. In a nutshell:

Of course, we want to get rid of underfitted and overfitted models. Let’s consider the following synthetic scenario in order to understand how.

Synthetic data to the rescue

Synthetic data is an valuable resource. It makes easier to understand the behavior of models and algorithms before applying them to real data. Here and afterwards, we are going to use synthetic data in order to get insights of the main concerns of machine learning modelling.

Let’s suppose that we somehow know the generative source governing the phenomenon under study:

The blue line represents the source function which miraculously we already know. This particular function is the periodic sine wave function f(x) = sin(x). The code in JavaScript to generate this sine data is:

The red crosses in the chart are measurements obtained from the real phenomena by an experimental procedure. Note that, due to different sources of noise, these experimental readings are not exactly co-linear with the source generative sine function.

> the normal distribution is a good representation for this type of noise. Indeed, the central limit theorem states that the summation of independent random variables are approximately normally distributed.

In real scenarios, we don’t even known the shape of the generative source signal. Usually we have access to the experimental data only:

In the rest of this article, we shall find manners to obtain a good approximation of the source generative signal assuming that we don’t know neither its shape nor formula.

Approximating functions

Roughly speaking, training algorithms aim to find an approximation function (or model) given the training data. In order to check it running, let’s set the training data to be 67% of the original experimental data, keeping the remaining 33% aside for future validation:

> Splitting the data in training and validation sets is called hold-out. Common split percentages are 67%, 80%, 90% and 99%.

To keep things as simple as possible, in this experiment we shall use simple basic schoolbook models: lines (also know as degree 1 polynomial), third degree polynomial (cubic curves) and so on. Therefore, using the least square method, we can find the following approximations:

Which model approximates the training data the better? Checking the images, we can find that the 9-th polynomial degree curve pass over almost every training point whereas the other curves are more or less closer. But , how to quantify this proximity?

A good alternative to answer this question is the mean squared error or mse:

mse is the average of the squared differences between the predicted value Ŷ and the observed value Y. The predicted value is the value guessed by the model whereas the observed value is the original value in the dataset. The implementation of mse for our particular uni-dimensional data is pretty straight-forward:

For error indicators like mse, the smaller the better. Applying mse over the training data results in:

Based on that performance, we can be lead to think that the best model is the 9-th polynomial degree approximation. Of course, this is a wrong finding: applying mse to the validation data provides a fairly view of the actual model performance:

The chart above uses a logarithmic scale! It clearly shows that the 9th-polynomial model performs poorly on the validation data — albeit it achieves high performance on the training data. In other words, what this chart shows is that the 9-th polynomial model suffers of overfitting.

The linear model, at other hand, has a low performance in either training and validation sets. This is called underfitting. In a real experiment, both models linear and 9-th degree should be discarded.

Famous picture of someone deploying an overfitted model in production

The most important lesson here is: models are evaluated using data not used in their training.

Now, we know how to detect overfitted models. But, what makes a model to overfit? How to avoid it?

Causes of over and under fitting

The common cause of overfitting is model complexity. We can resume model complexity as the number of free parameters in a model. In the case of the 9th-polynomial, there are 10 free parameters to fit the data. The more the free parameters, the prone to overfitting.

Einsten citation investigation here

In the opposite direction, models with few parameters are prone to underfitting. This is the case of using a linear model to approximate a (non-linear) sine wave function.

The choice of model complexity is one of the more significant decision to be taken during the modelling phase. Automating this decision in the training algorithm is a source of active research in machine learning.

> Reducing the number or the influence of free parameters is usually known as regularization. We shall discuss regularization in details soon in another article.

Other sources of over and under fitting

There are other causes of under and over fitting, usually related to the data quality. In particular, small data is a big problem, having a strong influence for both under and overfitting.

> The data acquisition and preparation process is key to the success of modelling. We shall talk about it in a forthcoming article.

Another cause of over/under fitting is the choice of training hyperparameters. The training process is covered in the next article of this series.

Finally, what makes a model useful?

In our previous experiment, models with intermediary complexity (3-th and 5-th) showed the best balance between training and validation performance. This balance is the main indicator in model selection.

But what happens in case like this when two or more different models have approximately the same performance? It is simple! Select the simplest one!

The argument of using less complex models instead of more complex ones is called Occam’s razor principle. Checking out our previous example, we find that the 3th-degree and 5-th polynomial models have close similar shapes to the original source signal (the blue line). Hence — by following the Occam’s razor principle — we would choose the 3-th as the final selected model.

> In realtime applications, the simplest model is also the faster one. Thus, for two same-performance models, the simplest one is always the chosen one.

Conclusion

In this article, we discussed the basics topics of modelling in the context of machine learning.

Concepts like underfitting and overfitting were illustrated using a synthetic scenario and simple school-level polynomial functions.

In real scenarios, more complex models take the place and proper iterative training algorithms are used. Anyway, the core modelling subjects discussed here such as training & validating sets, over/under fitting and model complexity are also found and equally valid.

Code

The code used in this article was written in JavaScript. You can find it in this gist or using the fiddle below:

If JavaScript is not your preferred language, no worries. It is not hard to port this code to different languages like Python, Java or C++. Any questions, do not hesitate to ask me at: doleron gmail com

--

--