Neural Networks as Universal Approximators

An intuitive way of understanding the Universal Approximation Theorem

NNs can approximate any n-dimensional continuous function

The Universal Approximation Theorem states that NNs can approximate any real-valued function in any number of dimensions. At first, this might seem to be a crazy claim or maybe one that you immediately accept as fact. But it’s a useful concept to understand that highlights the power of NNs.

This is intended to develop a quick intuition behind the Universal Approximation Theorem, but the true proofs get more complicated involving the Hahn-Banach theorem and Riesz Representation theorem. I am told it is simple for mathematicians to grasp using vectors and Fourier analysis, but for the world of CS, I feel developing the intuition behind the design is more important than the nitty-gritty details of the proof.

The rigorous mathematical proofs can be found here. But let’s focus on a very cool intuitive understanding that I first learned from reading papers by a goat researcher Michael Nielsen.

An Intuitive Proof

The claim here is that the NN that we will arbitrarily call g(x) can approximate a function f(x) within some desired accuracy πœ– > 0.

This satisfies |g(x) - f(x)| < πœ–, βˆ€x

One quick disclaimer is that the function must be continuous to get a low error bound. It’s possible to approximate the discontinuous function with a continuous one, but the error will be higher.

2-Layer Neural Network

Let’s dive into the proof using a 2-Layer NN with z arbitrary neurons using sigmoid activation functions. If you don’t have a sense of how weights (w) and biases (b) affect a hidden layer, think about plugging in wx + b to any activation function. For a sigmoid, w is the steepness in activation of the sigmoid and b is a horizontal shift. Think about a very high value of w. There will be almost an instant activation (basically like a step function).

I encourage you to play around with these functions yourself to confirm what I’m saying. You will see that the position of the step is directly proportional to b and inversely proportional to w. To simplify, let’s combine this into one variable. Consider the ratio s = -b/w, changing s horizontally shifts the step function.

A high w results in a neuron output converging to a step function (Michael Nielsen)

All of this is in relation to a single neuron in the hidden layer. The second weight connecting the hidden layer to the output can vertically scale the output. Adding in more hidden layer neurons is essentially just adding or subtracting step functions. It’s easy to see that with many hidden layer neurons, we can create scaled/shifted rect functions or any step response.

There are various ways to group neurons together to create controlled functions, but to understand, it’s enough to see that two neurons can create a rect. Using rectangles of varying widths, we can approximate any function (think of how rectangles were used in Riemann sums for integration to approximate area).

In more mathematical terms, to map f(x) using a NN g(x), we need to design g(x) to be:

Function for NN to map f(x)

This is basic intuition for a sigmoid activation function. If the input variables increase in dimensionality, this intuition still holds but is harder to visualize. This can be done with planes for x, y inputs, but past that more rigorous proofs are needed to show existence of dense functions in a subspace.

I previously discussed different activation functions, but I only showed this intuition for sigmoid. Since tanh is just a scaled and shifted sigmoid, this same logic applies.

ReLU uses a very similar logical argument. This works for any variation of the ReLU. Instead of a sigmoid being a step, consider subtracting or adding two ReLU functions. I graphed it quick for a visual demo:

Adding Two ReLU Functions

From this example, it is easy to see how combining many scaled and shifted ReLUs can map different functions by slowly creating different sloped lines (this can even result in a 0 slope to create rect functions like with the sigmoid). This can be generalized to n-dimensions as well.

Unforeseen Drawbacks

Although the beauty of the theorem makes it seem as if neural networks are the ultimate solution for everything data-related, there are some drawbacks.

  1. Theory vs. Practice- In theory, we have shown we can approximate any function, but the math gives no guidance. In practice, it is difficult to optimize layer sizes and capacity to achieve the given error bound.
  2. Overfitting- Trying to perfectly fit f(x) with the NN g(x) can easily lead to overfitting the training data. This means g(x) may not generalize to the true f(x).
  3. Data Distribution- In practice, f(x) will have varying levels of noise and error inherent to the signal depending on the data collection. While the NN can learn to approximate the given data, it will not be generalizable to the true signal making it difficult to draw conclusions on the data.

References

[1] http://neuralnetworksanddeeplearning.com/

[2] https://numerics.ovgu.de/teaching/psnn/2122/handout_ahmet.pdf

--

--

Ryan D'Cunha
π€πˆ 𝐦𝐨𝐧𝐀𝐬.𝐒𝐨

Biomedical Engineer and Computer Scientist. Interested in deep learning and its applications to medical devices and medical imaging algorithms