# A Little Bit of Activation Function Metamorphosis — ReLU network approximation with Sigmoid network

## A taste of Deep Learning Theory.

So it is February 2020 and if there’s some things I smell in the air these days, it is a mixture of *rain*, *snow*.. and a bit of *Deep Learning* (DL). I’ve been passively observing some of the academic and industry trends of AI and Machine Learning (ML) for a bit and if there is one thing that gets people of all kinds worked up with excitement, that would be **Deep Learning**. Now I am not really a practitioner of DL these days but I do find the mathematics and theory behind it to be quite fun. Fun, you might wonder? Yes, **fun**. Maybe I do not make sense, but I guess that has never been the objective function I use to model my life.. (lame ML joke, idc). But if I was a practitioner right now, this meme might very representative of my life.

Nonetheless, Deep Learning has proven time and time again to be a fantastic tool for tackling a variety of challenging real world problems. The downside of this approach, however, is that due to the complexity of Deep Learning, it has been challenging to develop enough theoretical foundations that can explain *why* it has been so successful and also explain the best ways to, say, architect a neural network for some problem.

Of course, there are some ideas as to why people claim it works so well, like the fact it can be shown to be a Universal Approximator. Now there are certainly plenty of other functions spaces that are universal approximators, yet most have not shown the practical success that Deep Learning models have. From a theoretical standpoint, this indicates that there is more to learn that the brilliant researchers working on these problems will hopefully discover at some point.

Now I have often found that plenty of people who make use of Deep Learning tend to be more on the applied side and do not work out any of the more rigorous results that are foundational to Deep Learning theory. For this reason, it seems appropriate to start off with a simple DL theory goal: **show that a shallow network comprised of ReLU activations can be approximated by a shallow network of sigmoid activations**. This is going to be technical but definitely a bit of fun!

# Useful Definitions

So it is always fun and games once we can start working on the analysis, but we need to show a little patience and define a few things. First, notice the following mathematical definitions for the ReLU and sigmoid functions:

Our ultimate goal here will be to perform an *ε-*approximate to a ReLU function using a collection of sigmoid functions and then use this construction to replace each ReLU in a shallow network with this sigmoid approximation. For us to see how great our approximation is doing, we need to define some metric that can help us define what our approximation error should be with respect to. For our purposes, we will use something referred to as the *uniform metric*, as shown below, for any two functions *f(x)* and *g(x)*:

where we are using the 2-norm inside the supremum. The exact domain for *x* that we choose is not very relevant because we can always rescale the inputs to *f* and *g* to bring the domain to be effectively between *[-1,1]*, so we will just stick to *x* being within *[-1,1]* from the start. From here, time to start having some fun constructing an approximation with some math..

**Provably approximating a ReLU with a few lumps**

So for anyone who has never plotted the ReLU function before, below we can see how it looks for any input *x* within the interval *[-1,1]*.

Looking at the geometry of this function, it actually looks pretty simple! So one question is, how can we effectively approximate this with a lot of sigmoid functions? When I first thought about this problem, I thought about the limit function of the sigmoid.. the step function, as shown below.

If we first thought about approximating the ReLU with step functions, what would we do? In my mind, the clear idea would be to create a staircase of step functions starting at some positive value for *x*, as shown below.

The goal then would be to make each step have smaller and smaller width, making the steps approximate the linear region to any chosen accuracy. If we discretize the interval *[0,1]* into *n* points spaced by *δ = 1/(n-1)*, then it is clear this construction has a worst case error, with respect to the uniform metric, of *δ.* Since the error is equal to *1/(n-1)*, we can see that for some fixed error tolerance *ε*, we will need *O(1/ε)* step functions to achieve the desired tolerance.

Simple enough for a step function, so what about a sigmoid? Well as mentioned before, a sigmoid can be brought arbitrarily close to a step function in a limit sense, so this must make analysis of the sigmoid simple, right? Well…

Yeah, it is not quite that simple. Let us follow a similar construction to using step functions but replace each step function with a sigmoid of a particular form. The approximation for the ReLU then becomes the below, where *C = (k/δ)* for any *k > 0*.

The interesting thing about this construction is that due to the smooth growth and decay of an individual sigmoid, the approximation delays getting some of its “mass” piled on as we sweep from left to right, making the above construction fall below the ReLU within the interval *[x*,∞)* for some *x* > 0*, as seen below.

This is an interesting result but it allows us to show the following due to the monotonicity of both the ReLU and the sigmoid approximation:

Monotonicity for the win, amiright? Sadly, the result of the above expression is not particularly nice, especially if we would like to invert it to find some value for *n* that will get us some desired error tolerance *ε*. So what can we do? Well, one approach is to reach into our bag of tricks and find a lower bound for that sum using an *integral*! Using standard lower bound inequalities for sums using integrals, we find that

If we toss this inequality into our earlier result for the uniform metric error, we can obtain the below result by recalling that *C = (k/δ)* and simplifying the result.

To summarize, we found a convenient upper bound on our approximation error! Really cool. Now we have a result that is simple enough that we can find a value for *n* that can ensure our construction achieves a desired error *ε*.

Now at first glance of this approximation error bound, it is actually really interesting. In particular, as *k* gets large, our sigmoid approximation quickly gets closer and closer to the error expected from a step function approximation. In other words, the step function construction had an upper bound on the error that was *δ* and the result above actually gives us that our sigmoid approximation error, as *k* approaches infinity, becomes the step function error.. as expected! Really cool.

Now to wrap this up, given some error tolerance *ε* that we wish to achieve, we need *O( (1 + 1/(k exp(k))) (1/ε) ) = O(1/ε)* sigmoids to approximate a ReLU under the uniform metric. This is readily found by setting the upper bound equal to *ε* and solving for *n*. How fun, amiright?

# Approximating neural networks.. say what?!

So yeah, we have this cool approximation that allows us to replace a ReLU with a collection of sigmoids that satisfy a provable approximation error. So what can we actually do with this?

Well suppose we want to construct a shallow sigmoid neural network that approximates a shallow ReLU one.. to some precise accuracy.. and without re-running gradient descent or whatever hip optimizer our friends have told us about. Luckily for you, the above construction gives us just what we need! So for our sake, let us assume our inputs to the ReLU network are *n*-dimensional vectors on the unit ball, ie the 2-norm of these input vectors is equal to 1. Then, let us define our given ReLU shallow network as

where all the weights are fixed at this point. Our goal might be to choose some sigmoid approximation *g_{2}(x)* such that we have the precise error of

So the question then becomes, how might we accomplish this? Well, first recognize that the approximation we found earlier was with respect to the uniform metric that assumes the range of inputs are in the interval *[-1,1]*. Since we allow for the *a**(j)* vectors to be arbitrary weights, the inputs to the ReLU are not actually contained within the interval *[-1,1]* but may have values larger than that. To correct this, notice that the ReLU has a convenient homogeneity property for *a > 0* of the form

With this, we have the following convenient, equivalent form for g_{1}(x):

This works for our goals because the inner product within each ReLU is now going to compute a value within *[-1,1]* since both vectors are normalized. Now, it is clear that our form for *g_{2}(x)* should follow a similar form. This implies that the following construction should be just what we need:

Notice that this construction is equivalent to a shallow sigmoid network because if we expand the definition of *f_{2}(x)*, it will just make *g_{2}(x)* into a big sum of sigmoids with the appropriate weights. From here, all we need to do now is prove that this construction gives us the desired error that we want! Let us do this carefully in a few steps.

And with that, we complete our construction of a shallow sigmoid network that achieves a precise approximation to the input ReLU shallow network! No need to rent out a GPU cluster or wait a few hours for a sigmoid shallow network to be trained by gradient descent from scratch. No, we can efficiently construct the sigmoid network and get an arbitrary close precision to our ReLU network! All it took was a little bit of math. I’m sure after this, all of us can vibe with the meme below..

# Summary

So after a good amount of math and some inequality manipulation, what exactly did we *actually* do? Well for starters, we found one way to provably approximate ReLU activations using a combination of sigmoid activations! Using this construction, we said no to re-training a sigmoid network and instead showed how we can use the approximation to more generally approximate a ReLU shallow network with a sigmoid shallow network. If this is not supposed to be fun, *please* do not tell me..