Mathematical foundation for Noise, Bias and Variance in #NeuralNetworks

A picture with Gaussian Noise as a filter.

Neural Nets are quite powerful models in machine learning which are used for learning the behavior of high dimensional data. Typically, data is not always present in its purest form, the signal. Most of the data that is made available for machines during training or runtime prediction do have some amount of noise. During training, the Neural Nets can get highly sensitive to the noise in the input and can start overfitting the learning to the noise in the input.

In the previous post titled “Algorithms to Improve NeuralNetwork Accuracy”, we learnt about the overfitting problem in Neural Nets and how to break it using L1/L2 regularizers, Weight penalties decay and constraints. Also in the post title “Committee of Intelligent Machines” , we learnt how to stack different models to improve accuracy and generalize the prediction better.

In this post, I would like to introduce the fundamental math on understanding Noise, Bias and Variance during Neural Net training and also use Noise as a regularizer for generalizing the Neural Nets.

Waiter, I have Noise in my Signal

It’s important to understand noise in machine learning because this is a fundamental underlying concept that is present in all datasets.

Let’s start with simple understanding of Noise. Noise is a distortion in data, that is unwanted by the perceiver of data. Noise is anything that is spurious and extraneous to the original data, that is not intended to be present in the first place, but was introduced due to faulty capturing process.

Noise gets into data in many ways:

  • If the data is an image, then the image noise in data can be due to poor lighting during capturing the image. It can also be due to the sensor (camera) which is of low quality, not able to capture high density picture.
  • If the data is text, then the textual noise can be due to spelling errors, typography, excessive use of informal language, text which does not have semantic coherence, garbled text during text capture (OCR) etc..
  • The same can be applicable to any other forms of data capture such as Video or Audio or any other forms like x-ray or spectral capture. Most of the noise happens due to noise in the nature (environment is not a perfect vacuum) or noise in the sensor (the equipment is not high quality).
  • Noise can be introduced during storage, transfer or input/output processing as well.

Given this, we need to always account for noise during machine learning.

Now, let’s say we have some data that we are using to train our Neural Net models. Let’s say that the input vectors {x1, x2,… xn} has outcomes {y1, y2…yn} associated with it. Then the way to think about Noise is as follows:

Where f(x) is some underlying function on the independent variable x (input features) and y is the outcome.

Here, epsilon is the noise in the data that has a functional relation with the input. Lets also assume that epsilon has a zero-mean and a standard-variance.

The objective of the Neural Nets is to learn the underlying function f(x) to predict the outcomes. As you may guess, since there is a noisy functional relation ‘epsilon’ between the underlying function f(x) and the outcome, the probability of the Neural Nets to learn about the noise is high.

The trick of the trade is that, if the Neural Nets can learn about the noise ‘epsilon’ as a separate functional variant, then we can claim that the Neural Nets are balanced. Typically, Neural Nets models are unable to differentiate noise from the input and lands up learning about noise as part of the input function, f(x+epsilon). This is when the Neural Nets have overfitted. We also call this state as high-variance state. The opposite of a high-variance state is the high-bias state, where the Neural Nets are unable to come up with any learning at all (as in, the Neural Net is not able to find any relation between the input vector and the outcomes). Let’s study this further…

The economy of Bias and Variance

Let’s mathematically decompose the input function to understand the concept of Bias and Variance, given that there is noise in the signal.

Let’s say we train the Neural Nets to find a underlying function f_cap(x) that tries to approximate the underlying function f(x) while noise ‘epsilon’ is present. Now, we can say that the Neural Network has learnt well if the mean squared error between the expected outcome ‘y’ and the output of the learnt function f_cap(x) is minimal or near to zero. Note that it should be minimal not only for the training set, but also for any new data that is provided during validation.

Let’s define the concept of bias and variance now:

Bias: As we said, bias is an inability of the Neural Nets to learn any co-relation between the input vector and the output state. Mathematically it can be expressed as follows:

I am using the < > notation to denote the expected-value here. If the expected value of the difference between the learnt-function and the underlying function is high, then we state that the learnt function is highly-biased.

Variance: Variance is the sensitivity of the Neural Nets to small changes in the inputs. In other words, any minor noise in the input gets picked up by the learning functions of the model and tries to overfit the noise as if they are signal. This causes overfitting and produces poor accuracy during validation. Mathematically it can be expressed as follows:

We can also generalize the expected value of a variance function for any given random variable X as follows:

We said that the mean squared error between the outcome ‘y’ and the learnt-function f_cap(x) should be minimal or near zero. In other words, we said:

In order to decompose the mean-squared error to its Bias and Variance parts, lets work on the equation as follows:

And hence, the above equation provides the relation between Variance and Bias for a mean-squared-error between the learnt-function and the signal.

Note that sigma-square is a variance of the underlying function of the signal, ‘Var’ is the variance purely of the learnt-function, and Bias is the difference between the underlying function of the signal from the learnt-function.

So, whenever we say that our Neural Net models are overfitting, we are stating that there is a high variance in the learnt-function.

One of the ways to reduce overfitting is to actually add more noise while training Neural Nets!!

Noise as a Regularizer

If a Neural Net is displaying high variance, it is possibly because it is overfitting “f(x) + epsilon” of the training-set nearly well. So when a validation-set during test is provided, the model does not know how to predict accurately.

One of the ways to break this is actually to add noise into f(x) so that the model is not able to co-relate the input vector {x1…xn} tightly to its output class {y1….yn}. In other words, we are trying to reduce variance of the system by breaking its ability to tightly fit f_cap(x) to “f(x) + epsilon”.

As an analogy, this is like saying, if the system is able to recognize the background ‘ticks’ too very well, then, we shall add more controlled “hums” into the background in such a way that the ticks are not noticed (or nullified). It’s like adding a homogenous normal ‘distortion’ to eliminate nuanced ticks/noise.

Note: For people who have used ensemble methods, this is NOT the same as bootstrapping in ensemble methods where multiple subsets of data is created from the same data-set using noise. Instead this concept is similar to Additive White Gaussian Noise, or AWGN.

A controlled noise like Gaussian is good to modulate any noise that exists in the input data.

Here, ‘N’ represents the Gaussian noise with zero-mean and Gaussian variance.

Mathematically, the effect can be understood as follows:

Note that the sigma-square turns out to be Gaussian penalty on the cost function, and is very similar to a L2-penalty which you learnt from the post titled “Algorithms to improve Neural Network Accuracy”.

In conclusion, by adding Gaussian noise to the input signal, we can regularize the Models more efficiently and prevent them from overfitting.

This is somewhat similar to:

Fight fire, with fire…