What Makes Machine Learning Possible

Machines consume a lot of data to learn. Why such an appetite for data?

Published in

AI Graduate

12 min readJan 25, 2019

Memorisation goes as far back as our childhood. As kids we are taught to remember the tables of 2. As we grow old we start to multiply bigger and bigger numbers. We start to learn how to multiply big numbers instead of memorizing the results of multiplication.

This starts our journey towards a more complex form of understanding and inferring knowledge — Learning!

As students we used to answer questions we had never seen before! (Most of us anyway). Answering what you remember is Memorization. but what is really exciting to AI people is a machine that LEARNS.

But how do machines learn?

Remember the time when you were a student. You used appear in exams. What fundamental process helped you answer the questions. Yes — Studying!

What helped you to figure out that the question in front of you is linked somehow to what you read earlier?

It is Learning! And it is indeed, a very specific type of learning. It is learning from data.

Learning Theory

Scientists have spent countless hours understanding how machines learn. They have, with 0 creativity, named the field — Learning Theory!

Learning theory is a very mathematical and a detail-oriented field. Let’s get an intuitive feel of some of the concepts and some bit of maths ;)

The Fundamental Questions

Why do we need to learn from data?
What is meant by ‘learning’ from data?
Is it even possible to learn from data?

1. Why do we need to learn from data?

Let us start with the Cancer detection Problem as discussed in a very beautiful article here. We have information about 1)Tumor Size and 2) Color. We don’t have any other information AND we want to detect cancer in this

A sample of data points for cancer detection

Now suppose there were other factors to consider while deciding the nature of the tumor. For example factors like patient history, family history, Mitotic rate, Nuclear grade etc. These factors can have a very complex relationship with the nature of the tumor.

Discovering the exact relationship might take years of research and billions of dollars! So what do you do?

DATA! — Realize that you have the past records of all the patients who have reported a tumor. So although you don’t have an “analytical solution” OR a formula to predict cancer, you can still construct an “empirical solution” i.e a solution that seems to agree well with the observations and works most of the times.

Just by looking at the data of 1) Tumor Size and 2) Color Intensity we can construct an empirical solution. That is where data comes in. Data helps you build an ‘empirical solution’.

An analytical solution is the one which is based on rigorous analysis and reasoning and possible only when the all the factors related to a problem are identified and there relationship with each other is well understood. Whereas the empirical solutions are based on data and observations, where we want to come up with a solution which agrees with data. An empirical solution might not be correct every time but sometimes that is not required.

2. What is meant by ‘learning’ from data?

Let’s say we have the data. How do we know that machine is able to learn something!

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E
— Tom Mitchell

Well, that’s a lot of jargon in a single sentence!! Let us try to break it and understand the basic traits of learning with respect to the cancer detection problem.

The program should be solving a task T (in our case it is the cancer detection problem). understandably we should be putting such effort only when there is no feasible analytical solution to the problem T.
The program has access to experience E i.e data (in our case it is the historical records of previous patients)
A performance measure P. Here the performance of P refers to how good or bad our program is performing on the task. In our case, for simplicity, we can think of P as the percentage of correctly detected tumors or accuracy. (Note that P should be evaluated on data that is not a part of the experience E.)

Thus we can say a computer program a.k.a machine, is learning when P increases with E i.e in our case we can say if the program is able to learn if it increases its accuracy of tumor detection on previously unseen tumor records with more amount of historical patient records.

This aspect of learning where you can infer about unseen data from training data is often termed as generalization.

3. Is it possible to learn from data?

MATH ALERT! In this segment, let us try to understand the different components of learning and define them mathematically.

In the most simplistic sense our machine is trying to map Patient Data (X = patient records) to Cancer Results (Y=whether a person has cancer or not).

The Ideal Mapping Between Data and Results

What the ‘f’

What is the thing that we are trying to learn here? We assume, that there exists a function f which when given with a patient tumor record x, can perfectly detect whether it is a malignant or not for all tumors. Let the function f outputs 1 if the tumor is malignant and 0 otherwise. We define f as an f : X → Y(ideal formula for cancer detection), where X is the input space which consists all possible patient tumor record. Y is the output space which consists of all possible outputs. Here Y ∈ {1,0}. This is also termed as the label set.

This target function f is unknown and the objective of our learning process is to learn f. With reference to our cancer detection problem, you can imagine f to be the ultimate rule by which we can predict whether a tumor is cancerous.

Building a hypothesis

How do we approximate the ‘ideal’ f? We learn f from the training dataset D.

D consists of N input-output pairs or data points of the form (x1,y1), (x2,y2) … (xN,yN) where yi = f(xi) ∀ i ∈ {1,2,…N} .In our case, D corresponds to the past patient tumor records.
We use a learning algorithm A to chooses a function g :X ↦ Y from a set of functions denoted by H. In the real world, we deal not only with functions but also distributions. Hypothesis — the functions or distributions that we choose from H.
For the cancer detection problem, suppose you are using neural networks. The set of all possible neural network architectures will form your hypothesis set H and algorithms like gradient descent,back-propagation will be part of algorithm A.
In A NutShell — We use an algorithm A, that looks at the dataset D and chooses hypothesis g from H, that approximates f best. This process is called training. H and A together is referred to as a model. (when people say that they have “built a model”, they generally mean that they have come up with the final hypothesis.)

Finding the best hypothesis

How do we find the best (g) hypothesis? We choose the hypothesis which performs best on the training data D with respect to some error measure e and we hope that it will perform well on unseen test dataset T.

The performance of our hypothesis on D gives us a sense how well our hypothesis can memorize.
Its performance on T tells us how well our hypothesis is able to generalize or learn.

Is Hope a good thing?

If you were paying attention, you must have noticed that the word hope is very crucial here. Do you leave the fate of your patients on hope? When your single prediction can decide between life and death, is just having hope good enough?

You might argue that intuitively our current scheme makes sense since our chosen hypothesis g agrees with f on so many data-points, it must be a good approximation of f. But wait, just look at it from a different point of view.

f is completely unknown and can be any function i.e there are infinite choices for f and however large our training dataset D is, it is still finite! How can we approximate an entity that can has infinitely possible values with a finite amount of information?

You can prove, there exist infinitely many functions g which agree with f on all the data points in D and your learning algorithm can choose any of them. So there exists a possibility that after using machine learning, you come up with a hypothesis that perfectly classifies all the records in D but predicts all tumors as non-cancerous for unseen records.

That sounds like a horror story, right? It means the predictions of g outside D are meaningless. So, should we conclude that learning from data is not possible and give up all our hope? Well not yet!!

Possibility vs Probability

There is always a possibility that tomorrow morning when you wake up, you receive a million dollars at your doorstep.

But what is the probability of that happening? There is a possibility that our learning attempt might fail miserably i.e learning is impossible in a deterministic sense, but how with some tricks we can make sure that the probability of that happening is very small i.e learning is possible in a probabilistic sense even when f is completely unknown.

The Bin full of Marbles!

Before we move on to prove how learning from data is possible in a probabilistic sense, let us take a simple example of a bin full of marbles. Let us assume, there is a bin which consists of marbles of two colors, red and green. The fraction of red marbles in the bin is μ and a fraction of green marbles is 1-μ. If in the bin, most of the balls are red then is μ is close to 1 and if most of them are green μ is close to 0 . We can consider this bin as our population. Let us draw a new random sample s of size N from this bin of marbles. Now that we have access to this sample, let us forget about the bin. Thus, now μ is unknown to us. Let us denote the fraction of red marbles in the sample as ν and that of green marbles in the sample as 1-ν.

What is the probability that ν and μ are same? Imagine if the bin contains 100 marbles and 93 of them are red. And you randomly draw out 4 marbles from this bin and all of them turn out to be green! This is definitely possible but intuitively since the original bin consisted mostly red marbles, such an incident, like you getting a million dollars tomorrow at your doorstep, will be very rare (low probability). What if the sample size N was larger, like 5 or 6 or 7? The probability that all the balls in the sample will turn out to be green reduces further.

Here ν is a random variable that can have different values across different samples. μ is the expected value of ν i.e if we take a large number of samples and average the value of ν, it will be very close to μ. This relationship between a random variable and its expected value is quantified by Hoeffding’s inequality.

Hoeffding Inequality

Let’s take look at what Hoeffding Inequality conveys intuitively. Hoeffding inequality states that the probability that a sample mean (ν) will differ from its expected value (μ) by more than tolerance limit(ϵ), decreases exponentially with increasing sample size(N) and tolerance limit(ϵ). The tolerance limit should be positive(ϵ). There is no restriction on the sample size N.

This is in line with what we were expecting. The larger the sample size, for the same tolerance limit we can say with more confidence that the value of sample statistic(ν) and population parameter μ will not differ more than ϵ. Similarly, if we want a small tolerance, we need a large enough sample size. If ϵ is small and the quantity P(|ν - μ| > ϵ) is small, we can conclude that ν is a good approximation of μ.

Observations from the above example:

The probabilistic bounds are valid only if the sample s is randomly drawn from the bin.
The bin size is immaterial for the inequality to hold.

Now that we have some understanding of the relationship between ν and μ, we will try to incorporate some of these relationships into our learning scenario. First, let us try to map the different objects in the above example to the different components of learning.

Marbles in the bin: They correspond to the input space X. Each marble can be considered as a data point.
Marble colors: You can think the marble colors as the output space. The marble colors are the labels for each marble.
Sample(s): We choose the sample s randomly from the bin. If training data set D is randomly selected from the input space X, D will correspond to s. (In practice, x1,x2..xN ∈ D are drawn according to some probability distribution P. I will explain this in detail in a later article while discussing testin.)
Unknown(μ): In the above scenario, the value of μ is unknown whereas, in the learning scenario, the target function f:X → Y is unknown. We will try to connect μ and f.

Let us start with a modified version of the experiment. Assume this time we deal with data points instead of marbles. But this time they are not colored. Suppose we come up with a hypothesis h. We label all the data points such that x ∈ X using h and f (unknown target function). Let the label assigned to x by f be f(x) and that by h is h(x). In some of the cases our hypothesis agrees with the target function and in some cases, it does not. We color a data point as red if h(x) ≠ f(x) and green if h(x) = f(x). Thus the fraction of data points colored red(μ) represents the error committed by our hypothesis h (since our hypothesis fails to predict the output of the target function for the red data-points). We denote this as out of sample error (Eout)

Now we sample D from the space where every data point is colored by the above scheme. In the drawn sample, some data points are red and some are green. The fraction of data points colored red(ν) represents the error committed by our hypothesis h on the data points in D. We denote this as the in-sample error (Ein)

Now that we have established the analogy, we can plug Ein in place of ν and Eout in place of μ.

Since Ein and Eout are far apart with very low probability for large N, this indicates that the performance of our hypothesis in the outside world is not much worse than what we are seeing on D, most of the time. It gives us some hope that reaching beyond training dataset D is possible.

But wait! There is a caveat. If you have noticed, in the above argument, we first fix our hypothesis h and then look at the sample D. Also we are dealing with a single hypothesis. The learning framework we have defined earlier, it is the other way around. We first see the training dataset D and then choose our hypothesis h by looking at it and learning algorithm A chooses multiple such hypotheses based on the previous hypotheses to finally arrive at our final hypothesis g. Thus, although we have started to see some light and we are increasingly becoming hopeful of the possibility of learning, we have to walk a little further to prove that learning from data is indeed possible.

Stay tuned for our next article on this topic, as the road to learning is full of adventures!

“There is no royal road to learning; no short cut to the acquirement of any art.”
— Anthony Trollope

Credits

Learning from data: This course is the best resource on learning theory for beginners.
CS 229 Lecture Notes

Some of the examples have been taken from the book “Learning From data” by Yaser Abu Mostafa. Interested readers are advised to refer it for a more detailed and rigorous treatment of the subject.

X8 aims to organize and build a community for AI that not only is open source but also looks at the ethical and political aspects of it. More such simplified AI concepts will follow. If you liked this or have some feedback or follow-up questions please comment below.

Thanks for Reading!