Supervised Learning and its Shortcomings

Raul Incze
Cognifeed
Published in
7 min readMar 6, 2019

This is the first in a series of articles that are describing the “magic” behind Cognifeed. The aim of the series is to provide a high-level explanation of the Machine Learning (ML) concepts that make Cognifeed work and how they all interweave.

As previously stated, our cognitive feed relies on the concept of active learning. Intuitively, active learning entails algorithms that can “ask questions” in order to learn faster and with fewer examples. But what does that actually mean? In order to understand the benefits of active machine learning, we first need to understand what supervised learning is. This blog post is aimed at providing a very short introduction to the concept of supervised learning while emphasizing a few of its shortcomings. Those of you already familiar with this highly popular type of ML, feel free to skimp over the following and hold tight for the upcoming post on active learning.

Supervised learning

Supervised learning can be seen as simply mapping inputs to outputs. The learning algorithm discovers these mappings by churning through a number of (X, y) pairs of training samples. Here X is the input and y is the output which at the point of training it is known. The output can either be a label (from a fixed and discrete set of predetermined labels) or a number (a continuous value in a set interval).

If the output is a label, we’re dealing with the problem of classification. Let’s suppose we want to train an algorithm to tell if the sky in a picture is clear or cloudy. In this case our input X will consist of the image’s pixels and y will be a label that’s either going to be clear or cloudy. During the supervised training process we will feed the algorithm pictures alongside their labels. This way we’re making it learn correlations between the values (colours) of the pixels, our input, and the output. It will probably end up learning that if there are many blue pixels where the sky would be in a picture, there’s a high probability that the sky is clear. After the training process we can use these correlations to feed new X inputs to the algorithm and let it predict the y by itself. This is called inferencing.

Another perspective on supervised learning is from a function estimation point of view. For our example this implies that there exists a function f(X)=y that perfectly maps our pixels (X) to a clear/cloudy label (y). The model is trying to find an estimate f’ for this function that will minimize its mistakes (errors) on the pairs that we’re feeding it.

These functions, especially ones that have hundreds of inputs (such as an image’s pixels) can be highly complex. So let’s pick a more simple example. Let’s see how the learning process unfolds when trying to estimate the sign function. This is a binary classification problem (two classes, + and -), with a single input which is a real number. More formally, we’re trying to estimate the following function (let’s assume 0 is positive, for simplicity’s sake and the label 1 corresponds to + while -1 is -):

Our algorithm will try to find a separation boundary between these two classes on the x axis. The gray circles are our training examples. As we randomly sample them, they increase. Notice how every relevant example weighs in in moving the orange separation boundary. At the end we have a fair approximation of our function based on the training samples. The whole thing is learnable with a O(n) complexity, where n is the number of training samples we’re having. Roughly speaking this means that only after n steps we can be sure that we’ve found the best separation boundary. Of course, given this limited dataset of points, we will end up with an absolute error ε equal to o.5. We can’t learn the function perfectly.

Now let’s consider that instead of a fixed set of samples we take the entirety of real numbers. This means that we now have a continuous distribution, rather than a discrete dataset. In our case, the distribution generating the data is one-dimensional and uniform. We can sample from this distribution (take randomly) points and use them to train our estimator. Under this scenario we can reduce the ε as much as we want by simply sampling more points. It’s been proven that the sample efficiency of learning this is O(1/ε), where ε is our desired error. For our example, if we want our error to be no greater than 0.01, 100 samples are guaranteed to be sufficient.

A more complex, two-dimensional, dataset than our example, generated with playground.tensorflow.org. Here the separation boundary (white circle) has been perfectly learned by a very simple neural network. We can play with this example and watch it learn in real time if you go to the link above!

Of course that for more complex data, such as images, finding an approximation of our target function is difficult. Various architectures have been proposed along the years that make use of various prior assumptions on our input data. For images, deep convolutional networks were built on top of the assumption that within these images pixels are spatially correlated (there is a continuity that “makes sense” in neighbouring pixels). These assumptions make the learner fit the data properly using fewer parameters and with fewer training samples.

This learner, or model, is made out of two sets of parameters. One of them describes the topology of a learner. These are often called meta-parameters or hyperparameters. The other set of parameters that are tweaked during training (learned) in order to reach a better function estimate. In the case of neural networks, these parameters are called weights.

Scientifically accurate representation of a baby model (described by its hyper-parameters) getting fat on data (adjusting its weights). Notice the frustration on researcher’s face as they have to go and tweak the hyper-parameters.

Yet even if we know what the right learner topology is for our data, most modern architectures still have a large number parameters that need to be tweaked. For instance ResNet-152, one of the most popular architectures, has 60,344,232 parameters. This translates in a big number of training samples that we need to use to make them converge to a desirable function estimate from scratch and random noise.

Sample efficiency

We live in the world of big data. Large companies gather insane amounts of information on users and their behaviour. When it comes to Machine Learning, data, or rather the quantity of data, is what creates competitive advantage. This happens, at least partially, because most of the machine learning methods used are sample inefficient.

Let’s return to the field of computer vision for a second and compare man and machine. It is highly likely that you can teach a human being to distinguish between different kinds of fruit by showing them one single example of each. On the other hand, for current convolutional networks to learn (in the absence of transfer learning) you will need to feed them thousands of images of fruits and their labels. If we try to feed only one example of each type, the learner won’t be able to generalize the concept of what that fruit is. It will simply memorize raw correlations in those pictures. Correlations that work strictly on them. Given a new instance to evaluate, that is ever so slightly different from the one it had learned, the algorithm will fail to label it correctly. This is what in ML literature is called overfitting.

From a mathematical point of view, this behaviour of machine learners is very much expected. What is fascinating is our ability to quickly learn new concepts when presented to us. In the case of vision, maybe it is because our vision processing center has a topology that is extremely data efficient. Or maybe it’s because of our power to use already learned patterns to quickly identify objects. This way us humans are learning these tasks at a higher representational level, leveraging the power of that which in ML is called representation learning (more on this subject in a future post). Could it be that a more fair comparison would be one between a learning algorithm and a newly born infant that does not have these patterns developed? But enough philosophy, let’s get back to some math and estimate the quantity of data needed to learn something.

Or… at least try to. When it comes to deep learning and real datasets full of noise, the math gets tricky. The mathematical bounds in these cases are still being refined, challenged and researched. Everything’s quite abstract and symbolic, and we’re not going to go through it in this article. But if I sparked your interest you should read: On the Computational Efficiency of Training Neural Networks and How Many Samples are Needed to Estimate a Convolutional Neural Network? The short answer to “how many samples to we need” is as perplexing as the math: way too many yet less than it should given the mathematical entities that are doing the learning.

Label efficiency and other issues

Let’s suppose for a moment that we have the data. All of it. We’ve spend days or maybe weeks collecting it. Unfortunately, as we mentioned earlier, for supervised learning to work we also need labels. All of this data has to be manually annotated by a human (or a collective of humans) before any algorithms can be trained on them. Most of the time this process is even more tedious than collecting the data itself.

And furthermore, after collecting and annotating the data, we sometimes realize that we end up with an unbalanced dataset. This means that some classes are over-represented in the dataset while others are under-represented. A model trained on such a dataset might end up having a prediction bias towards the over-represented class, or it might not learn to correctly label the under-represented ones at all. There are tricks to deal with this, such as subsampling, which in broad terms implies only using a maximum number of samples from each class equal to the number of samples in the most under-represented of classes. But this way we’re not only wasting a lot of precious samples and their labels but also introducing sample biasing into the mix!

Sample efficiency, label efficiency, unbalanced datasets. Such a headache! Can we address all of these at once? Partially yes, and we will explore how in a future blog post. (Spoiler: it’s active learning)

Stay tuned!

Remarks: Special thanks to Petra Ivascu and Vescan Flaviu for proofreading this piece and for your valuable suggestions.

--

--

Raul Incze
Cognifeed

Fighting to bring machine learning to as many products and businesses as possible, automating processes and improving living experience.