Bayesian Regularization for #NeuralNetworks

If you are a science or math nerd, there is no way in hell you would have not heard of Bayes’s Theorem. It’s pervasive and quite a powerful inference model to understand and model anything from growth of cancer cells, to obstacle detection in autonomous robots, to fixing the probability of a collision course of a asteroid towards earth. The simplicity of the model is where it draws its power from. Specifically in the Artificial Intelligence community, you cannot do away with ‘Bayesian Inference and Reasoning’ for optimizing your models.

In the past post titled ‘Emergence of the Artificial Neural Network” I had mentioned that ANNs are emerging prominently among all other models due to its ability to accommodate techniques and theories from all other AI approaches quite well. I did mention that a full Bayesian model can be used for interpreting weight decay. In this post, I intend to showcase the Bayesian techniques for regularizing neural networks.

This concept is also called Bayesian Regularized Artificial Neural Networks or BRANN for short.

What is Bayes’s Theorem?

(Feel free to skip this section if you already understand Bayes’s Theorem)

Bayes’s Theorem fundamentally is based on the concept of “validity of Beliefs”. Reverend Thomas Bayes was a Presbyterian minster and a Mathematician who pondered much about developing the proof of existence of God. He came up with the Theorem in 18th century (which was later refined by Pierre-Simmon Laplace) to fix or establish the validity of ‘existing’ or ‘previous’ beliefs in the face of best available ‘new’ evidence. Think of it as a equation to correct prior beliefs based on new evidence.

One of the popular example used to explain Bayes’s Theorem is to detect if a patient has a certain disease or not.

The key inferences in the Theorem is a follows:

Event: An event is a fact. The patient truly having a disease is an event. Also, truly NOT having the disease is also an event.

Test: A test is a mechanism to detect if a patient has the disease (or a test devised to prove that a patient does not have the disease. Note that they are not the same tests)

Subject: A patient is a subject who may or may not have the disease. A test needs to be devised for the subject to detect the presence of disease or devise a test to prove that the disease does not exist.

Test Reliability: A test devised to detect the disease may not be 100% reliable. The test may not possibly detect the disease all the time. When the detection fails to recognize the disease in a subject who truly has the disease, we call them false negatives. Also the test on the subject who truly does not have the disease may show that the subject does have the disease. This is called false positives.

Test Probability: This the probability of a test to detect the event (disease) given a subject (patient). This does not account the Test Reliability.

Event Probability (Posterior Probability): This is the “corrected” test probability to detect the event given a subject by considering the reliability of the devised test.

Belief (Prior Probability): A belief, also called a prior probability (or prior in short) is the subjective assumption that disease exits in a patient (based on symptoms or other subjective observations) prior to conducting the test. This is the most important concept in Bayes’s Theorem. You need to start with the priors (or Beliefs) before you make corrections to that belief.

The following is the equation which shall accommodate the stated concepts.

In the equation,

• A1.. A2.. are the events. A1 and A2 are mutually exclusive and collectively exhaustive. Let A1 mean that the disease is present in the subject and A2 mean that the disease is absent.
• Let Ai refer to either one of the event A1 or A2.
• B is a test devised to detect the disease (alternatively, it can also be a test that is devised to prove that the disease does not exist in the subject. Again, note that these are completely two different tests)
• Let us say there is a population of people (in a random city) where there is a prior belief (based on some random observation, which may or may not be subjective) that 5% of the population “has the disease”. So, for any given subject in the population, the prior probability P(A1) “has the disease” is 5% and the prior probability P(A2) “does not have the disease” is 95%.
• Let’s say, the test ‘B’ which is devised to “detect” the presence of a disease has a reliability of 90% (In other words, it detects the presence of a disease in a patient who truly have the disease only 9 out of 10 tests). Written mathematically, the probability of the test to detect a disease when the disease is truly present P(B|A1) = 0.9.
• Unfortunately, the test ‘B’ also has a flaw which sometimes shows that the patient has the disease even when the disease is truly not present in the patient. Let us say that the 2 out of 10 patients who really does not have a disease gets falsely detected as having a disease. Mathematically, P(B|A2) = 0.2.
• Now, if you randomly select a subject from the population and conduct the test on the subject, AND if the test result shows positive (The patient does have the disease), can we calculate the “Event probability” (or the Posterior Probability) of the person truly having the disease
• Mathematically, calculate P(A1|B). Which can be read as, calculate the probability of A1 (presence of disease), given B (given test results being positive)

So let’s assign the values for each probabilities.

• Prior Probability of person having the disease = P(A1) = 0.05
• Prior Probability of person NOT having disease = P(A2) = 0.95
• Conditional Probability that the test shows positive, given that the person truly does have a disease = P(B|A1) = 0.9
• Conditional Probability that the test shows positive, even if the person truly does NOT have a disease = P(B|A2) = 0.2
• What is the “event probability” of a randomly selected person from the population who was performed the test, and the test result shows positive, to truly have the disease? = What is P(truly has disease given test is positive) = P(A1|B)?

The posterior probability can be calculated based on Bayes’s Theorem as follows:

So the posterior probability of the person truly having the disease, given that the test result is positive is only 19% !! Note the stark difference in the corrected probability even if the test results are 90% accurate ? Why do you think, this is the case?

The answer lies in the ‘priors’. Note that the “belief” that only 5% of the population may have a disease, is the strong reason for a 19% posterior probability. It’s easy to prove. Change your prior beliefs (all else being equal) from 5% to let’s say a 30%. Then you shall get the following results.

Note that the posterior probability for the same test with a higher prior jumped significantly to 65%.

Hence, while all evidence and tests being equal, Bayes’s theorem is strongly influenced by priors. If you start with a very low prior, even in the face of strong evidence the posterior probability will be closer to the prior (lower).

A prior is not something you randomly make up. It should be based on observations even if subjective. There should be some emphasis on why someone holds on to a belief before assigning a percentage.

If you belief that God does not exist (prior), then strong test/evidence/hypothesis, which positively detects the possible existence of God moves your prior belief only a little bit, no matter how accurate the tests are.

What does Bayesian Inference mean for Neural Nets?

Now that we understand Bayes’s Theorem, let’s see how this is applicable for Regularizing Neural Networks. In past few posts, we learnt about how Neural Nets overfit data and also techniques to regularize the Network towards reducing bias and variance. (A high-variance state is a state when the network is overfitted).

One of the techniques to reduce variance and improve generalization is to apply weight decay and weight constraints. If we manage to trim the growing weights on a Neural Network to some meaningful degree, then we can control the variance of the network and avoid overfitting.

So let’s focus on the probability distribution of the weight vector given a set of training data. First, let’s relook at what happens in a Neural Network.

• We initialize the weight vector of a Neural Network to some optimal initial state.
• We have a set of training data that will be run through the network continuously which shall change the weight vector to meet a stated output during training.
• Every time we start with a new input (from the training data set) to train, we have a prior distribution of the weight vector and a probability of an output for the given input based on the weight vector.
• Based on the new output, a cost function calculates the error deviations.
• Back-propagation is used to fix the prior weights to reduce error.
• We seen a posterior distribution of the weight vector for a given training data.

The question we ask here is two fold:

1. Can we use the Bayesian Inference in such a way that the weight distribution is made optimal to learn the correct function that relevantly maps the input to the output.
2. Can we ensure that the network is NOT overfitting.

To recap, mathematically, if ‘t’ is a expected target output and ‘y’ was the output of the Neural Net, then local error is nothing but E=(t-y). The global error meanwhile can be a MSE as follows:

or a ESS as follows:

• Note that the dominant part of the equation is the squared Error in the equation.
• We are trying to find the weight vector that minimizes the squared errors.
• In likelihood terms, we can also state that we want to find the weight vectors that maximizes the log probability density towards a correct answer.
• Minimizing the squared error is the same as maximizing the log probability density of the correct answer. This is called Maximum Likelihood Estimation.

Maximum Likelihood Learning

First, let us look at the Maximum Likelihood learning before we apply Bayesian Inference. To do so, let’s assume that we are applying Gaussian Noise to the output of the Neural Network to regularize the network.

In the previous post titled “Mathematical foundation for Noise, Bias and Variance”, we used Noise as a regularizer in the input. Note that we can apply Noise even for the output.

Again, mathematically:

In other words, let the output for a given training case y_c be some function of an input x_c and the weight vector w.

Now assuming that we are applying a Gaussian Noise to the output, we get:

We are simply stating that the probability density of the target value given the output after applying Gaussian Noise is the Gaussian distribution centered around the output.

Let’s use negative log probability as the cost function as we want to minimize the cost. So we get:

When we are working on multiple training cases ‘c’ in the dataset ‘D’, we intend to maximize the product of the probabilities of output of every training case ‘c’ in the dataset ‘D’, to be closer to the target. Since the output error for every training case is NOT dependent on the previous training case. We can mathematically state this as :

In other words, the probability of observed data given a weight vector ‘w’ is the product of all probabilities of training case given the output. (Note that the output y_c is a function of inputs x_c and weight vector ‘w’).

But, instead of the product of the probability of the target value given an output, we stated that we can work in the log domain by taking negative log probabilities . So we can instead work on maximizing the sum of log probabilities as shown:

The above is the log probability of observed data, given a weight vector that helps in maximizing the log probability density of the output to be closer to the target value (assuming we are adding a Gaussian noise to the output).

Bayesian Inference and Maximum A Posteriori (MAP)

We worked on a equation for the Maximum Likelihood learning, but can we use the Bayesian Inference to regularize the Maximum Likelihood?

Indeed, the solution seems to lie in applying a Maximum A Posteriori or MAP in short. MAP tries to find the mode of the posterior distribution by employing Bayes’s Theorem. So for Neural Networks, this can be written as:

Where,

• P(w|D) is the posterior probability of the weight vector ‘w’ given the training data set D.
• P(w) is the prior probability of the weight vector.
• P(D|w) is the probability of the observed data given weight vector ‘w’.
• And, the denominator is the integral of all possible weight vectors.

We can convert the above equation to a cost function again applying the negative log likelihood as follows:

Here,

• P(D) is an integral over all possible weights and hence log P(D) converts to some constant.
• From the Maximum Likelihood, we already learnt the equation for log P (D|w)

Let’s look at log P(w), which is the log probability of the prior weights. This is based on how we initialize the weights. In the post titled “Is Optimizing your Neural Network a Dark Art ?” we learnt that the best way to initialize the weights is to apply a zero-mean-gaussian

So, mathematically:

So, the Bayesian Inference for MAP is as follows:

Again, notice the similarity of the loss function to L2 regularization.

Also note that we started we a randomly initialized zero-mean-gaussian weight vector for MAP and then started working towards fixing it to improve P(w|D). This has the same side-effect as L2 regularizers which can get stuck in local minima.

We take the MAP approach because a full bayesian approach over all possible weights is computational intensive and is not tractable. There are tricks with MCMC which can help approximate a unbiased sample from true posteriors over the entire weights. I may cover this later in another post.

Maybe now, you are equipped to validate the belief in God…