# How Machines Learn To Doubt

There is a revolution underway in how we interpret the predictions coming out of machine learning models. It centers around the quality of the prediction, and exactly how certain the model is about the prediction it is making.

# Software 1.0

To get an intuitive feel for it, we start a few steps back. In Software 1.0, that is algorithmic software, you would start with a set of inputs, say strings like “Malay”, “Joe” etc. And ask some questions about the input which has a deterministic answer. For example: is the string a palindrome? So you end up with pairs of inputs and the corresponding deterministic answer, like {“Malay”, false}, {“Malayalam”, true}, {“Ana”, true”}, {“Anastasia”, false}. Then you go and write some code that given an input, gives you the expected output.

# Software 2.0

But the real world is too complicated to be tamed by such deterministic rules. For example, let’s start with strings like before, but this time ask: is the string the name of an Indian? Now in your given pairs of inputs and outputs, you start seeing things like {“Kamala”, true}, {“Satya”, true}, {“Satya”, false}, {“Kamala”, false}. For the same input, the real world sometimes tells you the answer is true, sometimes false. How can you write code to deal with such ambiguity? In Software 2.0, that is machine learning, you deal with the problem by spitting out a probability of true or false, instead of a hard decision like either definitely true, or definitely false.

To make it a little more concrete, given a dataset like:

For any given input, you would learn a probability that the string is the name of someone from India and get {“Kamala”, 0.33}, {“Deepak”, 1.0}, {“Donald”, 0.0}, {“Joe”, 0.0}.

While this is much more useful that Software 1.0, you can immediately see that the real world is still far from what your code is telling you. The code is now telling you probabilities *based on the examples seen*, that is your training data. For practical limitations, your training data is always a small fraction of what the real world truly is. So when you take this code which thinks “Kamala” has a 1/3 probability of being an Indian, and apply it to the real world, you’ll quickly learn it is grossly understating the reality. That is because the vast majority of Kamalas are Indian, and your training data was too small.

To understand the solution, we need to first dive a little deeper into the problem. Suppose the true probability of “Kamala” being an Indian is *p*. Now given three people with the name “Kamala”, if you ask them “are you from India?”, what is the probability that you’ll get the answers as true, false, false? It’s *p*(1-p)*(1-p) = p*(1-p)²*

This quantity, the probability of observing your given training dataset, is known as the likelihood. Note that *p*(1-p)² *is not the probability of “Kamala” being from India, it is merely the probability of getting a true, false, false answer to that question. The underlying probability can range anywhere from 0.0 to 1.0, for which you’ll get the likelihood as shown below (x-axis is probability and y-axis is likelihood):

Note the peak of the likelihood plot is at 0.33. That is, the likelihood is maximum at the point where we had calculated our naive notion of probability (also known as sample mean). That will also be the prediction from a machine learning model which represents the data faithfully. In other words, your model ends up guessing the most likely value of the underlying probability, given the observations in the training dataset. In reality, it may or may not be the true probability.

For the mathematically inclined, to build the model you'll be minimizing the sigmoid cross entropy, which is minimizingcross_entropy = -{z*log(f(x)) + (1-z)*log(1 - f(x))}where f() is the model you are trying to learn applied on input x, and z is the expected output where true is mapped to 1 and false to 0.Note minimizing cross_entropy is that same as maximizing the negative of it,-cross_entropy = z*log(f(x)) + (1-z)*log(1 - f(x))And maximizing the expression above also maximizes the exponentiation of the expression, ase^(-cross_entropy) = e^(z*log(f(x)) + (1-z)*log(1 - f(x)))This final form is exactly the expression for likelihood with f(x) standing in for the estimate of the underlying probability.Further, you can show that maximum of the likelihood function occurs at f(x) = mean(z), which is why the model output and the highest point of the likelihood plot above, all land at 1/3.Typically, f(x) will be constructed as f(x) = sigmoid(logit(x)) where logit is the direct output of your model. So if f(x) = p, the maximum likelihood estimate, then logit(x) = log(p/(1-p)) which is the direct output of the model, and it's not the probability.

= e^(z*log(f(x))) * e^((1-z)*log(1 - f(x))

= e^(log(f(x)^z)) * e^(log((1 - f(x))^(1-z)))

= f(x)^z * (1 - f(x))^(1-z)

# Software 3.0

The trouble with the model above was that the number of training examples was too small. If the model had seen 10,000 examples of “Kamala”, and only 3,333 of them were from India, then the prediction of 0.33 would have indeed made sense. But arriving at that conclusion by seeing only three examples seems a bit hasty. Note that the probability, given by (number of “Kamala” from India)/(total number of “Kamala”), is the same in 3,333/10,000 vs 1/3. But in one case, we are quite confident about the probability, in the other case we are not. To model this doubt, instead of relying on a single number like the probability of 0.33, we represent the situation by a Beta distribution, capturing a whole range of probabilities by plotting *Beta(1 + number of “Kamala” from India, 1 + number of “Kamala” not from India)*. The plot below shows the various cases when the number of “Kamala” from India vs not from India is 1 vs 2, 33 vs 67, 330 vs 670 and 3300 vs 6700. The x-axis represents the range of possible probabilities, and the y-axis the relative chances of that probability. When we have seen only 3 examples, the range of the possible probabilities is very wide. But as we collect more evidence, the range of the probability narrows down significantly.

The answer to whether “Kamala” is the name of a person from India now evolves from a probability to a *distribution of probabilities*, which depends on how many examples we have seen before.

In Software 2.0, we learned how to make a guess for the most likely probability. But how do we make the model output the entire distribution of the probabilities? This is where Software 3.0 and probabilistic programming comes in.

One way to get the distribution of probabilities is to build a large number of models instead of just one model, each with its own guess of what the actual probability is. So given a question like whether a particular string corresponds to the name of a person from India, we build 1000 models and look at the 1000 results from those models. If the underlying true probability has a very wide distribution, then we expect these 1000 results to differ widely. On the other hand, if the true probability is within a very narrow range, then most of the 1000 estimates of it will be clustered around that point and we can be confident that the probability is somewhere very close to that cluster.

But tackling even a single machine learning model seems painful enough, how do we manage thousands of them? This is addressed by techniques like dropout and flip out, which introduce random perturbations in a model to simulate the effect of thousands of variations of it. Other techniques such as Bayesian neural networks have native handling for distributions, with machinery to learn a distribution of models instead of a single model.

Summing up, the arc of software evolution is moving from definite answers to probabilities, and further still to the distribution of probabilities. And by incorporating doubt in their results, machines are getting a step closer to modeling the real world.

*See also:*