Network sparsity and rectifiers

Where we learn why, despite being amazing on their own, linear and sigmoid neurons are not the self evident choice for neural networks

Published in

Biffures

6 min readMar 7, 2018

We left the last article reflecting on how powerful single neurons are — particularly neurons with Identity or Sigmoid activations, which we now know can solve linear and logistic regressions. We also wondered why despite their qualities, those types of neurons are not the evident choice for neural networks.

We use Deep sparse rectifier neural networks (Glorot et al., 2011) to find answers, with key findings summarized in two tables below:

Rectifiers do not do much as single neurons but shine when used in networks

Rectifiers are at the same time biologically plausible, performant, and create sparsity in a neural network (high number of neurons with an output of 0)

Activation functions in networks: the case for rectifiers

In the rest of this article, we examine why rectifiers stand out as a choice of activation function. We start by ruling out linear neurons as candidates for neural networks, then explore the advantages of rectifiers over comparable neurons, like the sigmoid logistic ones.

Linear activation neurons do not stack. N layers of linear neurons are equivalent to a single layer of linear neurons, given how information propagates in neural networks. This defeats the purpose of building neural networks, and a mental note that any good activation function candidate for a multi-layer network will need to be non-linear.
Sigmoid activation neurons, which we know are able to perform logistic regressions on their own, work fine in deep neural networks too. We simply note that all neurons in a network of such neurons are always active and firing signal.
Rectifying neurons are neurons using x → max(0, x) as their activation function. They have no clear use as single neurons but actually do well in deep neural networks. In Glorot et al., we learn that rectifying neurons have three interesting properties:

“Rectifying neurons are an even better model of biological neurons and yield equal or better performance than hyperbolic tangent [and sigmoid] networks in spite of the hard non-linearity and non-differentiability at zero, creating sparse representations with true zeros, which seem remarkably suitable for naturally sparse data.” (Glorot et al., 2011)

What do those properties mean?

1. Biological plausibility

Biological plausibility is the quality of a model being close to biological reality, with the hope that learnings from the model could inform our understanding of biology, or that conversely, that the model could inherit some of properties of the reality it drew inspiration from. Here, biological plausibility is about two things: (a) plausibility of the neuron’s activation function, and (b) plausibility of the neural net’s overall behavior.

(a) Biological neurons are units that receive an electrical current as input, and fire output electrical signals characterized by their frequency if the input current is strong enough. In the absence of input current, biological neurons are not activated, and with increasing current, their fire rate increases.

Left: an activation function motivated by biological observation (leaky integrate-and-fire model or LIF). Center and right: classic machine learning activation functions

The biological model is the leaky integrate-and-fire model to the left, and in contrast, we have 4 activations functions used in machine learning to the right. Tanh is considered worst from a biological plausibility perspective, because of the antisymmetry around 0. Sigmoid and softplus are better but continue firing across the real space. Rectifier is considered best because it is inactive for input values under 0.

(b) Because rectifiers output 0 for negative inputs, they cause networks to have a large number of inactive neurons during forward passes(all neurons with a negative weighted input end up producing no signal). This behavior matches observations from “real” neural networks:

“Studies on brain energy expense suggest that neurons encode information in a sparse and distributed way (Attwell and Laughlin, 2001), estimating the percentage of neurons active at the same time to be between 1 and 4% (Lennie, 2003). This corresponds to a trade-off between richness of representation and small action potential energy expenditure.” (Glorot et al., 2011)

Only rectifiers produce that behavior out of the 4 listed activation functions, winning it the title of most plausible function.

We note however that rectifiers are not perfect, and scale to infinity for example. A question that I ask myself is, if biological plausibility is so important, why not use the LIF model straight away, or x → max(0, tanh(x)), or a step function?

First off, step functions are not friendly to the back-propagation algorithm, which requires non-zero derivatives:

An excerpt from https://sefiks.com/2017/05/15/step-function-as-a-neural-network-activation-function/

LIF neurons have actually been used successfully in artificial neural networks, as described in Spiking Deep Networks with LIF Neurons (Hunsberger and Eliasmith, 2015)

Left: soft LIF activation used by Hunsberger and Eliasmith. Right: results obtained on MNIST by different models including LIF-based ones

Finally, generally, a trend seems to be happening Towards Biologically Plausible Deep Learning (Bengio et al., 2016), which seems to invite more plausibility in the field of artificial neural networks, challenging even one of the main algorithm in deep learning, the back-propagation algorithm

“Whereas back-propagation offers a machine learning answer, it is not biologically plausible, as discussed in the next paragraph. Finding a biologically plausible machine learning approach for credit assignment in deep networks is the main long-term question to which this paper contributes.” (Bengio et al., 2016)

In the context of this article however, we stick to a back-propagation based approach to neural networks, a context in which rectifiers offer a good compromise of performance, cost, and biological plausibility.

2. Performance

Rectifiers offer performance comparable to or better than that of tanh and softplus neurons:

Rectifier networks obtained better or comparable results (Glorot et al., 2011)

3. Sparsity

Finally, we read that rectifier networks create “sparse representations”. In the case of the MNIST above, this translated as networks with as high as 85% of its neurons producing true zeros, without impact on the network’s test error.

“After uniform initialization of the weights, around 50% of hidden units continuous output values are real zeros, and this fraction can easily increase with sparsity-inducing regularization.” (Glorot et al., 2011)

Glorot et al. claim that sparsity has mathematical advantages on top of being biologically plausible. I reproduce below and without demonstration some of the key arguments advanced by the paper, in favor of sparsity:

Information disentangling: sparsity reduces the complexity of connections between layers of neurons, isolating key non-zero neurons from irrelevant variations, therefore creating robust information paths.
Variable-size representation: each input activates an information path of its own, of a dimensionality that can ultimately be commensurate with complexity of the information received.
Computational efficiency: Sparse representations’ efficiency is greater than that of dense ones, even though dense representations are the richest ones.

With the case for neural networks using rectifiers now established, we will focus next article on creating real rectifier networks with Tensorflow, and will try and answer why (not whether) rectifier networks work.