Learnings from Google’s comprehensive research into activation functions

This is a field that is heating up. Keep your eyes out on it

Devansh
CodeX
8 min readMar 22, 2022

--

Deep Learning is a complex and ever-evolving field. As more resources are poured into research and development, new network structures, architectures, and model training protocols are created. These are extremely exciting since they unlock new capabilities and impact new domains (to learn about what happened in Feb 2022, check this video).

Some of the most common activation functions that are used in Deep Learning. People have found that nonlinear activation functions work better in practice

However, Machine Learning research is much more comprehensive than this. There is tons of research into different data imputation policies, learning rates, and batch sizes (all of which I have covered on Medium) to maximize performance. One of the most important aspects of a neural network is its activation function. In Machine Learning, the activation function of a node defines the output of that node given an input or set of inputs. This makes activation functions one of the most important aspects of your model pipeline. Naturally, there is a lot of research trying to discover good activation functions. In this article, we will be covering “SEARCHING FOR ACTIVATION FUNCTIONS”, a paper comparing activation functions by researchers at Google.

This company has done a lot of research in Machine Learning. Photo by Rajeshwar Bachu on Unsplash

I will go over the paper, some interesting learnings, and the activation function that the researchers recommend after comparing all the functions discovered. Let’s get right into it.

Are activation functions really that important?

This is without a doubt a question on your mind. And this is the correct question to have. After all, there are a lot of different moving parts in a machine learning pipeline. Choosing to research and optimize one of them diverts potential resources from another aspect. Keeping this opportunity cost in mind is extremely important when we make decisions. So let’s see if activation functions can truly impact performance. Take a look at the following quote from the paper:

On ImageNet, replacing ReLUs with Swish units improves top-1 classification accuracy by 0.9% on Mobile NASNet-A (Zoph et al., 2017) and 0.6% on Inception-ResNet-v2 (Szegedy et al., 2017). These accuracy gains are significant given that one year of architectural tuning and enlarging yielded 1.3% accuracy improvement going from Inception V3 (Szegedy et al., 2016) to Inception-ResNet-v2 (Szegedy et al., 2017).

In case you’re not convinced, reread that. Just replacing the activation function from ReLU (the standard these days) to Swish (we’ll cover it), improved performance almost as much tuning and enlarging a giant model for 1 year. Research into various activation functions and how they interact with different kinds of inputs is absolutely worth it.

How to find the best activation functions

Okay now that we have established how important Machine Learning activation functions are, let’s talk about how the researchers found them. While the activation functions have traditionally been manually created (to match certain criteria), this process can get expensive (and slow) very quickly. So instead the team used a combination of exhaustive search (for smaller search spaces)and an RNN based controller (for larger search spaces). This opens a new problem though. How do we define the search space for our function?

The full list of functions that their search algorithm looked over

Next, comes the question of how we combine these functions to create our search blocks? This is not a trivial problem because different functions will present very different outputs. For that we can refer to the specific part of the paper:

The practice of using “units” in Machine Learning seems to be picking up a lot of steam. Keep an eye out for it.

To recap, we create core units out of a combination of unary and binary functions. Once we create these units, we combine them with other units to create our activation function. To create an activation function we use an RNN based controller to predict the activation function given input.

We use the validation accuracy as a reward to ensure that the RNN predicts the most performant function

That is the gist of the search process. It is quite complicated so feel free to take a break and reread it to make sure you really understand what happened. Just this setup is extremely impressive. For more details, I would recommend reading the paper. If you want my detailed breakdown of the paper, feel free to reach out to me for it.

What is Swish? Is Swish the best activation function?

So now that we have covered the search procedure, let’s go over the results.

Followers of my content know that I got hyped when I read the part I’ve highlighted. To learn about my conjecture on why we notice this trend, reach out about my annotated papers.

We can clearly see that several functions had compelling performances. The performance of periodic functions is pretty interesting and makes a compelling case for further exploration. When these activation functions were then compared against ReLU on multiple larger network structures, they fared fairly well.

RN: ResNet, WRN: Wide ResNet, DN: DenseNet

Swish is what the researchers called the function f(x)=x* sigma( beta*x). The researchers chose Swish over the other “best” activation function because it generalized better.

Making decisions on incomplete info is very common in Machine Learning

Swish was able to combine both generalization and high performance. Atleast on the computer vision task. However, to truly be a useful activation function, comparable to ReLU, Swish has to be able to perform on a bunch of tasks and be comparable to baselines. But first, let’s understand Swish on a fundamental level.

Understanding Swish

Swish has some very interesting mathematical properties. It’s simplicity and similarity with ReLU means that it can be substituted in places where ReLU is being used. At certain select values, it starts to mimic other activation functions (which might be the reason it can perform so well). Take a look at the following for more details

The fact that Swish can morph into other activation functions is pretty cool.

Some of you non-mathematical people might be wondering why we care about the first derivative of Swish. 1) Go learn the ML basics. 2) The first derivative tells you how quickly your grows/changes relative to the input. A large derivative means that you will have large changes for small changes in input (and vice-versa). Thus it is important to study the derivative of an activation function (some learning rate functions also use derivatives extensively). Analysis of the first derivative of Swish shows the following:

The last sentence is very interesting and deserves investigation on its own.

There are some more things I can cover but they would make the article too long. Let me know if you want me to cover any specific things that I’m leaving out. Now let’s get into the results of using Swish.

Results of Swish on Benchmarks

Section 5 of the paper starts with the following quote:

We benchmark Swish against ReLU and a number of recently proposed activation functions on challenging datasets and find that Swish matches or exceeds the baselines on nearly all tasks.

This is literally the first sentence. This is obviously very exciting but let’s get into the details to see if the improvements were substantial. The high-level overview of the performance is certainly strong, with strong performance across tasks.

This performance is impressive. This might just become the new standard activation function.

ImageNet is considered to be one of the most important image classification tasks in Machine Learning. That is why most Computer Vision papers always refer to it in their evaluations and testing. When comparing Swish to ReLU, Swish thoroughly outperforms ReLU

Swish is competitive with the other more specialized functions. This is very impressive.

We see a similar performance on the CIFAR 10 and CIFAR 100 dataset. Therefore for computer vision, Swish seems to be far ahead of ReLU (and competitive with other more refined variants). What about Natural Language Processing (NLP) tasks? For that, the researchers use the standard WMT English →German dataset. The results here speak for themselves:

Impressive results once again. Swish seems to be going strong.

Closing

It’s hard to argue against the results that the authors of the paper have presented so far. Swish has notably better performance on multiple standard tasks. Its simplicity and ability to morph into other activation functions are definitely compelling and warrant further exploration.

However, what stands out to me most are the possible extensions this paper provides. Evolutionary Algorithms have already proven very useful in search problems, while the paper makes a compelling argument to explore periodic functions as bases for activation functions. And fine-tuning Swish might have great results too. Therefore, more than just Swish, this paper contributes a lot to the Machine Learning research discourse.

To truly get good at Machine Learning, a base in Software Engineering will be crucial. They will help you conceptualize, build, and optimize your ML. My daily newsletter, Coding Interviews Made Simple covers topics in Algorithm Design, Math, Recent Events in Tech, Software Engineering, and much more to make you a better developer. I am currently running a 20% discount for a WHOLE YEAR, so make sure to check it out.

Think of the ROI that this student made by subscribing.

I created Coding Interviews Made Simple using new techniques discovered through tutoring multiple people into top tech firms. The newsletter is designed to help you succeed, saving you from hours wasted on the Leetcode grind.

To help me write better articles and understand you fill out this survey (anonymous). It will take 3 minutes at most and allow me to improve the quality of my work.

Feel free to reach out if you have any interesting jobs/projects/ideas for me as well. Always happy to hear you out.

For monetary support of my work following are my Venmo and Paypal. Any amount is appreciated and helps a lot. Donations unlock exclusive content such as paper analysis, special code, consultations, and specific coaching:

Venmo: https://account.venmo.com/u/FNU-Devansh

Paypal: paypal.me/ISeeThings

Reach out to me

Use the links below to check out my other content, learn more about tutoring, or just to say hi. Also, check out the free Robinhood referral link. We both get a free stock (you don’t have to put any money), and there is no risk to you. So not using it is just losing free money.

Check out my other articles on Medium. : https://rb.gy/zn1aiu

My YouTube: https://rb.gy/88iwdd

Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y

My Instagram: https://rb.gy/gmvuy9

My Twitter: https://twitter.com/Machine01776819

If you’re preparing for coding/technical interviews: https://codinginterviewsmadesimple.substack.com/

Get a free stock on Robinhood: https://join.robinhood.com/fnud75

--

--

Devansh
CodeX

Writing about AI, Math, the Tech Industry and whatever else interests me. Join my cult to gain inner peace and to support my crippling chocolate milk addiction