Understanding Activation Functions | Data Science for the Rest of Us
Can I be honest for a second? Neural Networks (NNs) are scary. There — I said it. Don’t get me wrong, I’ve been researching NNs from the beginning of my freshman year at business school. But just because I’ve stared at pages and pages of Data Science jargon doesn’t mean I always understand it. So let’s learn together.
Neural Network Basics
Before we dive in, we should probably get an idea of where activation functions fit into the rat’s nest — uh… neural network. At their most basic, NNs are just giant math equations. You’ve probably seen a diagram of a basic network before, but here’s another one just in case.
Those yellow circles are called the ‘input layer.’ It’s just another way of talking about the data you feed your network, hence ‘input.’ The green circles are the ‘hidden layer.’ They’re where the magic happens (and also where most of the activation functions will show up). The orange circle is the ‘output layer,’ in this case, just one node. In these sorts of networks, the output node is usually responsible for spitting out a probability (a decimal between 0 and 1).
To sum it all up, you feed a neural network data, the network chews on that data and transforms it, and then spits out a nice, simple output for you. Obviously, that’s quite the simplification, but it should work for our purposes.
Intuition of Activation Functions
So where do the activation functions come in? Well, activation functions are one of the core components of each of those circles (which the people who like precise terminology will call artificial neurons). But not all neurons are the same. Input neurons, for example, don’t have activation functions — their only job is to remember one itty-bitty data point and spit it out whenever it’s asked (not unlike a kid in a kindergarten play).
The green and orange circles (neurons), though? Those guys got the harder jobs. While the input neurons are sitting around lazily spouting the one number they’re there to remember, the hidden layer and output layer are working hard crunching numbers. See, in a basic NN, those layers are getting fed a bunch of data, and its their job to figure out what to do with it all, in time to deliver it to the next guy down the line, sort of like a poor accounting intern sitting in a cubicle somewhere.
So how do these sad neurons decide what to do with all this data? Well, that’s where the activation function comes in (finally…). The activation function tells each neuron how to transform the data it’s been fed before giving it to the next neuron in line. This allows them to make models that are more complex than just a straight line (which you would get in something like linear regression). Activation functions are what make neural networks special — well, not the only thing, but one of the big ones.
What those functions do can have a big impact on how your neural network performs and what tasks it’s suited to complete for you. For the sake of our learning, let’s take Michael Nielsen’s example problem from his free book, Neural Networks and Deep Learning. He simplifies the idea of a neural network into a machine that helps you make a decision, in his case whether you should go to a cheese festival this weekend. We’ll do something a little less… odd — let’s say you want a neural network that helps you decide if you should go catch a new movie that’s coming out.
Author’s Note: The following section’s example very closely resembles Michael Nielson’s. The intention behind this is to use his stellar explanation, but simplify it, reducing much of the mathematical notation while maintaining the intuition. I’m not trying to steal his work by any means, so I’d encourage you to check out his explanation here if you want a more thorough, mathematical approach.
Perceptrons and Step Functions
In order to help the neural network model your decision-making, you might decide on a few factors (or variables) that will impact your decision. Let’s say you identify these as your top 3 factors that go through your head when deciding if you’ll see a movie:
- Is the movie a Marvel movie?
- Is the movie showing at your favorite theater?
- Is anyone else going to go with you?
Now let’s say that you’re a huge Marvel fan, and you’ll go see the latest release no matter what. It doesn’t matter to you where the movie is showing or if your friends/S.O. will join you — you’re going. But, let’s say it’s not a Marvel movie? Maybe then you’ll only watch it if your friends go and the movie is showing at your favorite theater.
How could we use a Neural Network to model this decision making process? Well, each of your decision factors will serve as inputs to the decision. Now we need a neuron that can take that information, apply your style of thinking to it, and output the best decision. How do we do this? With weights.
Not all information is equally important, in model building and in life. So how are we going to represent the significant importance of Marvel movies to you, and the comparatively low importance of the other two factors? Well, we only want the neuron to fire when variable 1 is true (regardless of the status of (2) and (3)) or when (2) and (3) are true. So, let’s set a cutoff value for our neuron. If the total of the inputs is less than or equal to that cutoff (or threshold) then the neuron won’t fire, and vice versa.
Imagine the cutoff value is 3. If we make the weight of Marvel movies 4 or more, the neuron will always fire when the movie is from Marvel. If we make the weights of the theater and your friends both 2, then the neuron won’t fire when only one is true (because 2 is less than 3), but it will fire if both of them are true.
That right there is the basic intuition behind perceptrons, one of the older types of artificial neurons out there. But what activation function does a perceptron use? We call it a step function. In order to see why, let’s translate the intuition into math. Don’t worry! You just need to understand basic addition and multiplication for this.
First, let’s call all our inputs x. We know which input is which with a subscript:
x₁ = Is the movie a Marvel movie? (1=yes, 0=no)
x₂ = Is the movie showing at your favorite theater? (1=yes, 0=no)
x₃ = Is anyone else going to go with you? (1=yes, 0=no)
And now, let’s define our weights. A weight applies to the x variable that shares its subscript.
w₁ = 4
w₂ = 2
w₃ = 2
Now let’s define our activation function, or the equation that will decide whether our neuron fires or not. I’ve avoided sigma notation because I find it annoying and confusing for intuition unless you’re already familiar with it.
We can simplify the multiplication and addition with dot notation:
The dot just tells you to multiply all the weights by their corresponding inputs and then add all of them up. If this confuses you, don’t worry about it. It’s saying exactly the same thing as the first big equation.
We’re not quite done yet. What if we don’t know what threshold we should use? Let’s redefine our equations (inequalities, technically) so the threshold is on the left side. We just need to subtract our threshold from both sides.
We’ll define a new term, called the bias term (b), and set it equal to -threshold. If the threshold is a number representing how hard it is for our neuron to ‘fire,’ then the bias term is just the opposite, a number representing how easy it is for our neuron to fire. A little algebra magic and we get this new function:
That function right there is one of the earliest activation functions used in a neural network. It’s just a special step function that only outputs a 1 or a 0. Graphically, it looks like this:
Perceptrons have a problem though — they don’t learn very well. See, Neural Networks start stupid and get better as they train. We won’t get into the weeds of how that happens, but the important thing to know is that we make small changes in a weight (or bias) and look for small changes in the output, to see if our network got better or worse. We do that over and over again until our network is (hopefully) much better than it was at first.
But here’s the thing, when your only output is 1 or 0, your small change in a weight won’t make a small change in the output. It will either make no change at all, or completely flip your prediction from a 0 to a 1 or vice versa. That makes training our network very difficult, because it isn’t obvious how those small changes will affect the outcome of the network.
The Sigmoid/Logistic Function
But what if we could make those perceptrons less sporadic? What if we could smooth out their activation function so it looked less like the graph above and more like this…?
That right there is the idea behind the sigmoid (or logistic) activation function. Instead of making your neurons hop from 1 to 0 and back to 1 again, the sigmoid function smooths everything out, so your neuron can spit out any decimal between zero and one.
Sigmoid activation functions make life so much easier for classification models. Now, a small change in a weight will make a small change in our prediction, making it much easier for our hard-working network to learn. This does make life a little more interesting for us humans, though, because now the network will output a whole range of numbers between 0 and 1. That means we’ll have to pick a cutoff value and classify everything above it as 1 and everything below it as zero (You might think you’d always just use 0.5 as your cutoff, but that’s not always the best option — but we’ll save that for another article).
Tanh, or the Tangent Hyperbolic function, is a close cousin of the sigmoid activation function. They’re mathematically related, but instead of compressing data between 1 and 0 like the sigmoid, tanh compresses its inputs between 1 and -1. This adds some nice mathematical functionality like having average zero, which helps center the data and make training future layers a bit easier. It can also be helpful if your data tends to fall into groups of highly negative, near-zero, and highly positive values. Otherwise, it’s the same intuition as sigmoid.
ReLU, or Rectified Linear Unit, may remind you a little bit of the step function we saw earlier. Both are stepwise functions, but in this case, we don’t transform our inputs if they’re greater than 0. If they’re less than or equal to 0 though, we just output 0. Why would we do this?
Well, ReLU has some very nice mathematical properties for training neural networks. Multiple layers of ReLU neurons can approximate most any nonlinear function (very convenient for deep learning). They’re very computationally efficient when compared to functions like sigmoid or tanh. And, because all negative values output zero, only roughly 50% of the neurons get activated when you randomly initialize your model (that’s sparse activation if you want to look it up).
Overall, that makes ReLU pretty darn attractive for deep learning in particular. But ReLU has some very significant problems too. The Achilles heel of ReLU is called the Dying ReLU problem. It basically means that, as your model trains, some of your neurons may train so that it becomes almost impossible for them to fire (because most of the inputs are consistently 0 or negative). That means, for all intents and purposes, that neuron is dead — it isn’t doing anything useful for the model but eating processing power.
Now imagine a network with many or these dead neurons. Eventually, the model won’t be able to train any longer because most of its neurons are dead. And that’s the Dying ReLU problem. There are also some other inconveniences related to calculus and the fact that it doesn’t have a mean of 0, but those problems are beyond our scope right now.
Key takeaway: ReLU can be great for deep networks, but may reach a point where it can’t train anymore because too many neurons have ‘died’ off.
Leaky ReLU is an attempt to solve the Dying ReLU problem. Instead of setting all negative values to 0, we multiply our x by some decimal, making it much smaller than it otherwise would be. This creates a sort of ‘leaky’ tail on the ReLU function that allows us to retain many of ReLU’s benefits while keeping more information and hopefully mitigating dying ReLU. Beyond solving the dying ReLU issue, there appears to be some evidence that leaky ReLU can even speed up training a little more.
Where to Go From Here
That concludes the survey of some of the most common activation functions in Neural Network design. You should know, though, that there are many more out there, some related to the ones above (attempts to shore up their weaknesses and play to their strengths) and altogether different ones. There’s plenty of more technical tutorials on topics like the mathematical properties behind these functions and how to implement them in code. My hope is that this article gave you a little more confidence to approach that literature with more confidence. I’ll leave you with this neat little summary image…