DEEP LEARNING

Universal Approximation Theorem

A Visual proof

Omkar Bhalerao

Published in

The Startup

7 min readJul 2, 2020

Humans have come a long way, from the wheel to the rocket creating a marvelous empire of inventions. But it might come as a surprise to know that most of our designs are not original or unique. Most of them already exist in nature, and sometimes, these natural versions are even more sophisticated than their man-made counterparts. It is therefore often the case that to make breakthrough inventions man has turned towards nature, as who else than the one who has been optimizing itself since ages?

Neural networks are an attempt to mimic one such wonder of nature, the biological nervous system in the Brain.

What do NNs do ?

Neural networks are an example of function approximation algorithms that seek to approximate a function represented by our data.

A question that arises here is that,
Given that the space of possible functions is so large, can any finite computational stage do a good job approximating such functions?

Hence, neural networks just can’t be seen as any normal functions; rather, they have to be seen as Universal Functions having the capacity to perform such tasks.

Escaping Linearity

The neural network’s basic structure consists of layers of neurons, where each neuron computes a weighted sum of the inputs from the previous layer. The sum calculated by every neuron differs from that of its peers because of different weights and biases. Each succeeding layer in the network captures more and more complex patterns in the data.

After a neuron computes the weighted average of the inputs, this sum is passed through an activation function. Activation functions play a vital role in the functioning of the neural networks. Let us understand how.

A neural network without an activation function is essentially just a linear regression model. — Andrew Ng

Here’s how the output of a neural network with ’n’ layers with no activation results in a linear output function

It means that every neuron will only be performing a linear transformation on the inputs using the weights and biases. Though it will make the neural network simpler, this network would be less powerful and will not learn the intricate patterns from the data.

This type of neural network has a limited “capacity” to perform the task.

Admittedly this is not a desirable property of a universal function. Hence to incorporate flexibility, we introduce nonlinearity in the form of activation functions. (Sigmoid, Tanh, ReLU, etc.)

As an analogy, assume that we want to trace out a curve. Using activations functions is similar to tracing this curve using a french curve (engineering drawing tool). Without activations, it is similar to using a single straight ruler to do the same task.

Hence in order to capture complex data patterns and features that escape linear models, the outputs from intermediate neurons are passed through activation functions.

How does an Activation Function work ?

We will see how activation functions incorporate non-linearity with the example of the ReLU function (as in my opinion it was counterintuitive as to why it should work among other activation functions)

Here’s a way to see how ReLU helps with the curve tracing analogy.

The above animation shows how ReLUs approximate functions, by breaking them up into linear fragments instead of curves. Suprisingly, ReLU is a popular choice among activation functions for computer vision tasks, as it results in better convergence and computationally efficiency.

Moving on… Let us discuss the main topic of this article, The Universal Approximation Theorem (sounds powerful, doesn’t it ?), which is a way of understanding the representation power of a neural network.

Universal Approximation Theorem

Universal Approximation Theorem, in its lose form, states that a feed-forward network with a single hidden layer containing a finite number of neurons can approximate any continuous function.

Whoa! that’s quite a statement to make. So let’s try to have a better understanding here. But first, notice the words ‘approximate’ and ‘continuous function’ in the theorem.

By approximate, we mean that by using enough hidden neurons, we can always find a neural network whose output g(x) satisfies |g(x)−f(x)|<ϵ, for all inputs x. In other words, the approximation will be good to within the desired accuracy for every possible input.

And if a function is discontinuous, i.e., makes sudden, sharp jumps, then it won’t, in general, be possible to approximate using a neural net. Keeping this in mind let us build an intuitive proof of the theorem.

One dimensional case

Let us start with the case where we have only one input feature to the network. Imagine that the function(curve) to be approximated is made up of ’n’ rectangles. We call each of these rectangles as a tower function. Our task is then reduced to make these tower functions. But how do we make one?

‘w’ controls steepness and ‘b’ controls position

We see that a sigmoid neuron with a high value of ‘w’ is similar to a thresholding (step) function. By subtracting two such sigmoid neurons shifted along the axis results in a tower function.

We can then scale and shift this tower as required.

By producing many such towers and adding them all up, we can approximate any function to any arbitrary degree using the right set of parameters, which can be learned from the training data.

Note that it’s not necessary to make such towers functions in reality, it is just for illustrative purposes. Usually, we don’t allow networks to have such high weights; instead, we tend to make them small by a process called regularization or weight decay, which helps our network generalizes better and avoid overfitting.

Moving ahead, note that to form a single tower function, we required 2 sigmoid neurons. So to form n towers, 2n neurons are needed. By using more and more neurons, more such towers are created, and we can approximate the function better(Alert! using too many neurons may lead to what is known as Overfitting)

Higher dimensional towers

For high dimensional features (more than one input feature), we combine all the input features in a vector X. Similarly, we collect all weights in matrix W and biases in vector b. The above proof can then be generalized on similar lines for higher dimensions by considering the vectorized notation instead of the scalar version.

Here we look at the case where we have two inputs to get a better understanding. We can deconstruct the process of producing a tower in the two-dimensional case in the following manner.

For simplicity, we assume some weights are 0, indicating no contribution of that feature in the computation carried out by the hidden neurons.

We see that in the two-dimensional case, adding neurons shifted along a particular dimension doesn’t help much since the tower thus formed is open along the other dimension. So we create another half-open tower along the other dimension and add these two half-open towers resulting in a structure similar to a tower.

But still, as you can see, the resulting structure has some new plateaus around it, with a height of 1. We flatten these areas by preserving areas of the function with values above 1 and flattening the areas with a value less than or equal to 1. In this way, only the central area remains, and we obtain our tower. We perform this thresholding by again passing the structure through a sigmoid with high ‘w.’

Conclusion

We can see how our universal function, the neural network can approximate functions quite well. The approximation power of a neural network comes from the presence of activation functions that are present in hidden layers. Hence the presence of at least one hidden layer is sufficient.

Although the theorem spoke a single hidden layer, in practice, more than one hidden layer is used (mostly to avoid training related issues such as overfitting).

Seeing how neural networks can perform some quite complex tasks, it might be quite depressing to see some not so complex mathematics behind them.

The reason is that NNs mimic nature and nature always produces the most effective and least costing systems!

But even then, a neural network is only as good as the quality of data we give it. In recent years various NN architectures have been proposed which further enhance their abilities in specific tasks(CNNs for Computer Vision tasks and LSTMs, Transformers for Natural Language Processing). Research is still being made on exploring NNs and in the field of Neuroscience to understand the functioning of the brain and makes it one of the most fascinating fields today.

That’s it for this article ! Congrats on making it to the end. Hope you liked it!

A quick note : A formal proof is beyond the scope of this article. This article is written as an ELI5 and thus various simplifications / analogies and assumptions were used.
The animations were created using Manim. You can find the code here: https://github.com/ohumkar/Manim

For a more math heavy article you can refer to this great article https://mcneela.github.io/machine_learning/2017/03/21/Universal-Approximation-Theorem.html

Resources:

Neural Networks & Deep Learning, A great book by Michael Nielson http://neuralnetworksanddeeplearning.com/
Awesome videos from IITM by Prof.Mitesh Khapra https://www.youtube.com/watch?v=aPfkYu_qiF4&list=PLyqSpQzTE6M9gCgajvQbc68Hk_JKGBAYT

DEEP LEARNING

Universal Approximation Theorem

A Visual proof

What do NNs do ?

Escaping Linearity

How does an Activation Function work ?

Universal Approximation Theorem

One dimensional case

Higher dimensional towers

Conclusion

Written by Omkar Bhalerao