How learning is feasible

Published in

Poatek

7 min readMay 21, 2024

We all have heard about Artificial Intelligence, mostly in the branch of Machine Learning. But how can machines learn? Through statistics.

In short, it’s the idea of understanding the whole, through a sample, by extracting the important parts of the data. This text will try to demystify this process without going too deep into the theory.

If you want to go further into this topic of Statistical Learning Theory, I recommend the book “Learning From Data”, whose lectures I’m basing on, as well as Vapnik’s lectures on MIT, and “Foundations of Machine Learning”

1. Formalizing the problem:

Okay, so let’s do some machine learning. We have a problem that has a pattern (this is assumed), and we want to understand better and predict this pattern with the data that we have.

First, we’ll need an input X and an output Y, making our data a pair of Xs and Ys. X is what we have, and Y is what we want to be able to predict.

This means that there’s a function (our target function) that can output Y given X

data = (x1, y1 … xN, yN)
f : X -> Y

We consider a “Hypothesis Set” as the set of candidate formulas (all possible formulas that do what we want, i.e., receive x and give a y’). Some functions will perform better than others, so we must pick the best one as our final hypothesis. This best formula, which gives the closest Y for our real data, is called g.

Then, the learning part is simply the process of finding this g.

2. Probability is everything:

Now, the part we need to work out is how we can understand the behavior (and, in this case, find g) with only a limited amount of data compared to the entire distribution.

For a better example (one which is also present in the lecture from Professor Yaser), think of a bin full of green and red marbles. This means there is a distribution of the colors that is unknown to us, such as mu red marbles and 1 — mu green marbles. If we grab a handful of the marbles (size N), we’ll have another distribution of marbles in our hands, nu red marbles.

Image from LFD Lecture 2 — Yaser Abu -Mostafa

Nu and Mu can be different, but the sample frequency will, likely, be close! Or, in other words, given a large enough sample N, mu is probably close to nu, within an error margin “e”. This is explained in the formula below, which states that: the probability of the difference between mu and nu being outside the error e is less than or equal to a value that depends both on the sample size and the error margin.

We just defined Hoeffding’s Inequality, that the statement mu = nu is probably (given a large enough sample) approximately (given the error margin) correct.

PS: There are other forms of this inequality, with different degrees of generality, but this is the basic gist of it.

3. Applying the Inequality

Now, this works for one bin, but with machine learning, we’re effectively choosing between multiple bins with more than one hypothesis. Remember the Hypothesis Space? Each bin is equivalent to one hypothesis, where there are several in total. This means that for each bin, there is a new mu and nu, for M hypothesis, all belonging to the Hypothesis Space

With that, let’s also change the notation.

Nu is the distribution in the sample, so it’s the “in sample error” for a given bin h
In the same way, mu is the distribution for the whole bin, being “out of sample”

Not only that, but we have to account for all the other bin possibilities, so let’s rewrite the inequality considering M bins, instead of just one

Rewriting the inequality gives the following:

What does it mean?

Now, we account for every M in the set of the hypothesis space we’re considering (every bin).

Also, we’re analyzing the error margin for the final hypothesis g, instead of a single h (as we’ve included every M bin, so we can identify the best function g)

4. Cutting down on the size

Great, we have M. But the bad news, M is too large. We’re getting the whole input space (all the bins), which is pretty much infinite.

We have to replace the M with something smaller. The number of hypotheses can be infinite, so we only get dichotomies. The hypothesis regarding our own data points X. Considering the image below, we ignore most of the graph and focus only on the data points we have.

Image from LFD Lecture 5 — Yaser Abu -Mostafa

This drastically diminishes the possible number. For a classification problem h : X -> {-1, +1}, the dichotomy is given as h : {x1, x2 … xn} -> {-1, +1}. So it’s 2^n at most (two possibilities for N values of x).

As such, we need a function that counts the most dichotomies of any N points, effectively getting the most out of the Hypothesis Space H. The maximum number of distinct ways that N points can be classified using the hypothesis in H. With that, we eliminate the overlap in M, where the points can be equal between different hypotheses.

The notation mH(N) denotes this, the growth function. It only depends on H and N, and the larger it is, the more expressive and complex the hypothesis space. It can also be shown as Π

As we said, the maximum amount of the dichotomies for a given N is 2^N.

5. The hard part: mH(n) as a polynomial

Let’s work to replace the M for the growth function in the inequality. For that, we need mH(n) to be a polynomial.

How do we prove that? Using breakpoints and the VC dimension. Basically, we will bind the growth function to polynomials.

The definition of a breakpoint is simple. If no data set of size k can be shattered by the Hypothesis Space H, then k is a breakpoint for H.

What is shattering? It’s the process of “picking out” any element of that set, as per the image below. Every single combination of elements is possible, for the example of 3 data points (n = 3). But not for n=4. This means that 4 is a breakpoint.

Now, the VC dimension (for Vapnik Chekovenski) is the largest integer d for which the mH(d) = 2^d. Basically, it’s the integer before the breakpoint, so k-1 = Dvc

And here’s the magic: let’s do the Binomial distribution for B(N, k), being the maximum number of dichotomies on N points, with the Dvc. Of course, we must do it for the sum of all these combinations. If you want more detailed proof, here it is. The maximum power is N^d

With that, if there’s a breakpoint, then the growth function is a polynomial in N. If not, then it’s equal to 2^n.

Finally, we can replace the M for all hypotheses (and that was infinite) with the growth function that is bounded.

This leaves us with the VC Inequality, as below. The reason for the 2N is due to the proof using a sample of 2N instead of just N points. For that, we need to shrink the error margin by a factor of 4

Image from LFD Lecture 6 — Yaser Abu -Mostafa

5. Conclusion

So, in this text, we’ve understood the basis for getting general information out of a sample, as well as how the relation between the size of the sample and error margin relates to the probability of successfully predicting an output Y.

Reordering the inequality, we get the following:

The best takeaway from this is the notion of model complexity:

Dvc increase -> Ein decreases and Eout decreases

The minimum out-of-sample error is at some intermediate Dvc
By replacing an infinite number of hypotheses with the growth function, we derived the VC inequality, which relates sample size, error margin, and model complexity to the probability of successful prediction.