Neural Networks as Universal Approximators
An intuitive way of understanding the Universal Approximation Theorem
The Universal Approximation Theorem states that NNs can approximate any real-valued function in any number of dimensions. At first, this might seem to be a crazy claim or maybe one that you immediately accept as fact. But itβs a useful concept to understand that highlights the power of NNs.
This is intended to develop a quick intuition behind the Universal Approximation Theorem, but the true proofs get more complicated involving the Hahn-Banach theorem and Riesz Representation theorem. I am told it is simple for mathematicians to grasp using vectors and Fourier analysis, but for the world of CS, I feel developing the intuition behind the design is more important than the nitty-gritty details of the proof.
The rigorous mathematical proofs can be found here. But letβs focus on a very cool intuitive understanding that I first learned from reading papers by a goat researcher Michael Nielsen.
An Intuitive Proof
The claim here is that the NN that we will arbitrarily call g(x) can approximate a function f(x) within some desired accuracy π > 0.
This satisfies |g(x) - f(x)| < π, βx
One quick disclaimer is that the function must be continuous to get a low error bound. Itβs possible to approximate the discontinuous function with a continuous one, but the error will be higher.
Letβs dive into the proof using a 2-Layer NN with z arbitrary neurons using sigmoid activation functions. If you donβt have a sense of how weights (w) and biases (b) affect a hidden layer, think about plugging in wx + b to any activation function. For a sigmoid, w is the steepness in activation of the sigmoid and b is a horizontal shift. Think about a very high value of w. There will be almost an instant activation (basically like a step function).
I encourage you to play around with these functions yourself to confirm what Iβm saying. You will see that the position of the step is directly proportional to b and inversely proportional to w. To simplify, letβs combine this into one variable. Consider the ratio s = -b/w, changing s horizontally shifts the step function.
All of this is in relation to a single neuron in the hidden layer. The second weight connecting the hidden layer to the output can vertically scale the output. Adding in more hidden layer neurons is essentially just adding or subtracting step functions. Itβs easy to see that with many hidden layer neurons, we can create scaled/shifted rect functions or any step response.
There are various ways to group neurons together to create controlled functions, but to understand, itβs enough to see that two neurons can create a rect. Using rectangles of varying widths, we can approximate any function (think of how rectangles were used in Riemann sums for integration to approximate area).
In more mathematical terms, to map f(x) using a NN g(x), we need to design g(x) to be:
This is basic intuition for a sigmoid activation function. If the input variables increase in dimensionality, this intuition still holds but is harder to visualize. This can be done with planes for x, y inputs, but past that more rigorous proofs are needed to show existence of dense functions in a subspace.
I previously discussed different activation functions, but I only showed this intuition for sigmoid. Since tanh is just a scaled and shifted sigmoid, this same logic applies.
ReLU uses a very similar logical argument. This works for any variation of the ReLU. Instead of a sigmoid being a step, consider subtracting or adding two ReLU functions. I graphed it quick for a visual demo:
From this example, it is easy to see how combining many scaled and shifted ReLUs can map different functions by slowly creating different sloped lines (this can even result in a 0 slope to create rect functions like with the sigmoid). This can be generalized to n-dimensions as well.
Unforeseen Drawbacks
Although the beauty of the theorem makes it seem as if neural networks are the ultimate solution for everything data-related, there are some drawbacks.
- Theory vs. Practice- In theory, we have shown we can approximate any function, but the math gives no guidance. In practice, it is difficult to optimize layer sizes and capacity to achieve the given error bound.
- Overfitting- Trying to perfectly fit f(x) with the NN g(x) can easily lead to overfitting the training data. This means g(x) may not generalize to the true f(x).
- Data Distribution- In practice, f(x) will have varying levels of noise and error inherent to the signal depending on the data collection. While the NN can learn to approximate the given data, it will not be generalizable to the true signal making it difficult to draw conclusions on the data.
References
[1] http://neuralnetworksanddeeplearning.com/
[2] https://numerics.ovgu.de/teaching/psnn/2122/handout_ahmet.pdf