Hidden Units in Neural Networks

What are the hidden layers in deep neural networks? How are they constructed?

Jake Batsuuri
Computronium Blog
12 min readMay 5, 2020

--

Here are the topics of study in this article:

  • A quick overview of Perceptrons and neural networks
  • Anatomy of a machine learning algorithm
  • The main functionality of hidden units
  • Criteria for best performance
  • Paper on activation functions

Overview of neural networks

If you just take the neural network as the object of study and forget everything else surrounding it, it consists of input, a bunch of hidden layers and then an output layer. That’s it. This neural network can be called a Perceptron. We saw before that output layers give you the:

The predicted value of the Perceptron given the training input x.

Then between the input and the output is the hidden layer(s). These layer(s) are responsible for the heavy lifting that occurs in finding small features, that eventually lead to the total prediction result.

Anatomy of a machine learning algorithm

In a way, you can think of Perceptrons as gates, like logic gates. Logic gates are operators on inputs, so a Perceptron as a black box is an operator as well. But if you open up the black box, this operator itself is made up of tinier operators.

In Keras, a layer instance looks like this:

keras.layers.Dense(512, activation='relu')

Programmatically you can think of this layer as having this form:

output = relu(dot(W, x) + b)

where ReLU is a mathematical max(z, 0) function, z is made up of:

  • dot is a dot product
  • W is the 2D tensor
  • x is a tensor, the training data point if you will
  • b is a vector, the bias vector

Now in mathematical terms, our z is equal to:

and the output, not to be confused with the output unit, is:

This output can be the output unit in rare cases. Generally, it’s just the output of the hidden unit. If the output unit spits out the predicted y, the hidden unit spits out the h, which is the input to the output unit.

Here, the x is the input, thetas are the parameters, h() is the hidden unit, O() is the output unit and the general f() is the Perceptron as a function.

The layers contain the knowledge “learned” from the optimizer. They store these in the form of weights, W. The weights help adjust the output, which is usually in the form of one or two tensors as well. The inputs pass through them, the inputs being usually one or two tensors.

Different Layer Structures are appropriate for different data. For example, simple vector data such as those that can be stored in a 2D tensor, samples & features, are often processed by densely connected layers, sometimes called fully connected. This is generally the Feedforward Neural Network. That’s the reference to Dense, in the code snippet above:

keras.layers.Dense(512, activation='relu')

Let’s talk a little bit about the activation functions…

The main functionality of hidden units

A lot of the objects we studied so far appear in both Machine Learning and Deep Learning, but hidden units and output units often are additional objects in Deep Learning. These objects, hidden units, can be one of many types.

Since this is an area of active research, there are many more being studied and have probably yet to be discovered. Since this is an area of active research, and probably in its infancy, the principles and definitions are not super set in stone. The closes thing to a formal definition is, a hidden unit takes in a vector/tensor, compute an affine transformation z and then applies an element-wise non-linear function g(z). Where z:

The way hidden units are differentiated from each other is based on their activation function, g(z):

  • ReLU
  • ELU
  • GELU
  • Maxout
  • PReLU
  • Absolute value rectification
  • LeakyReLU
  • Logistic Sigmoid
  • Hyperbolic Tangent
  • Hard Hyperbolic Tangent
  • Identity
  • Softplus
  • Softmax
  • RBF
  • etc

Here we explore the different types of hidden units so that when its time to choose one for an application you’re developing, you have some intuition about which one to use. When you’re in the initial stages of development, don’t be afraid to experiment through trial and error.

What’s ReLU?

ReLU stands for Rectified Linear Unit. Rectified Linear Units are pretty much the standard that everyone defaults to, but it’s only one out of the many options. And this activation function looks like:

Like I just mentioned, this max activation function is on top of the affine transformation, z. When mapped out it has these properties:

Why might these properties be important you ask?

This function is rectified in the sense that what would normally be a fully linear unit is made 0 on half its domain. The ReLU is not differentiable at 0 since its a sharp point there. We do want a fully differentiable function without any non-differentiable points, but it turns out gradient descent still performs quite well even with this point. Because we don’t expect to reach a point when the gradient is 0 anyway. I guess this is one of the reasons I really like deep learning and machine learning, at some, you can just relax the mathematical rigour and find something that works, it’s applied math.

However, in order for the gradient to avoid the 0 point, we initialize the b in the affine transformation to be a small positive value like 0.1. This also means that this particular version is not suited for when the activation is zero. A few variants of the ReLU try to address this issue. One is called Absolute Value Rectification, another is called Leaky ReLU, and another called PReLU or Parametric ReLU.

What’s Maxout?

Maxout is a flavour of a ReLU, which itself is a subset of activation functions, which is a component of a hidden unit. As such we know that a hidden unit will apply an affine transformation to a vector and then apply a nonlinear element-wise activation function. Since Maxout is a flavour of ReLU, you are right to assume it uses a max(0, z). But remember that an element-wise max function is not differentiable everywhere, so in order to make it practically differentiable, we group our elements into k groups. And select the max of the group. Thereby making it not likely to have a sharp point.

The Maxout unit is then the maximum element of one of these groups:

where:

is the indices of the inputs of group i.

With large enough k, a Maxout unit can learn to approximate any convex function with arbitrary fidelity. In particular, a Maxout layer with two pieces can learn to implement the same inputs as ReLU, PReLU, absolute value rectification and LeakyReLU.

The caveat here is that a Maxout unit is parametrized by k weight vectors instead of 1, and require more regularization, unless, the training set is large enough. In general, although there is no limit on k, lower is better as it requires less regularization.

What’s Logistic Sigmoid?

If the ReLU is the reigning queen of activation functions, then logistic sigmoid is the former, denoted:

What’s Hyperbolic Tangent?

A close relative to the logistic sigmoid is the hyperbolic tangent, related to logistic sigmoid by:

See the relation? They both saturate really extreme values to a small constant value, more on this later. The difference between them is that sigmoid is 1/2 at 0, whereas tanh is 0 at 0. In that sense, the tanh is more like the identity function, at least around 0.

Training a deep neural network:

is similar to training a linear model:

Sigmoidal activation functions are more useful in RNNs, probabilistic models and autoencoders. As they have additional requirements that rule out piecewise linear activation functions. And many of these functions that seem to have a horizontal asymptote give a difficult time to gradient descent.

What’s RBF?

This function, Radial Basis Function, becomes more active as x approaches a certain value vector, it saturates to 0 everywhere else, so can be annoying for gradient descent:

What’s Softplus?

This one is discouraged from use based on empirical evidence. Which is counter-intuitive. Since its meant to be an improvement on ReLU, making it differentiable everywhere. But in practice, it does worse.

What’s the hard hyperbolic tangent, or hard tanh?

It looks like the tanh or the rectifier. But unlike the rectifier, it is bounded. It’s computationally cheaper than many of the alternatives. It’s basically either -1 or the line a or 1.

What’s Identity?

Having an identity function as the activation function is exactly like having no activation function. A linear unit can be a useful output unit, but it can also be a decent hidden unit.

If every layer of the network is a linear transformation, the whole network is also a linear transformation, by transitivity?

Generally multiplying and adding vectors and matrices acts as a linear transformation that stretches, combines, rotates, compresses the input vector or matrix.

We just learned that neural networks consist entirely of tensor operations, and all of these tensor operations are just geometric transformations of the input data.

It follows that then neural networks are just geometric transformations of the input data.

Remember that a hidden unit is:

Our network has n inputs and p outputs. With this approach we replace that with:

The first layer is matrix U and the second weight matrix is V. If the first layer, U produces q parameters, together these layers produce (n+p)q parameters. Whereas just W, would produce np parameters. Linear hidden units, then offer an effective way to reduce the number of parameters in a network.

What’s Softmax?

These hidden units are often used in architectures where your goal is to learn to manipulate memory. When there is a classification problem and you need to pick one of the multiple categories, this is the one to use. As it always boosts the max category and drags the other categories down. This will be studied later.

Criteria for best performance

The earliest gates were discrete binary gates.

Then there were sigmoidal gates, which allowed for differentiation and backpropagation.

As networks got deeper, these sigmoidal proved ineffective. So ReLU was adopted into deep neural nets. Which makes hard decisions based on the input’s sign, this developed around 2010.

Using the learning from ReLU, ELU was adopted since 2016, ELU allows for negative values to pass, which sometimes increases training speed.

The final word on these is that, in general, many differentiable functions work just as well as the traditional activation functions. You will hear about a novel function only if it introduces a significant improvement consistently. Otherwise, in many situations, a lot of functions will work equally well.

Since many functions work quite well and sometimes the results are counter-intuitive. The best way to find high performing activation functions is to experiment. Lots of the activation function papers do an empirical evaluation of the proposed activation function against the standard activation functions in computer vision, natural language processing and speech tasks.

Paper on Activation Functions

What’s GELU?

GELU stands for Gaussian Error Linear Unit, and it is a proposed activation function, meant to be an improvement on ReLU and its cousins.

Where ReLU gates the inputs by their sign, the GELU gates inputs by their magnitude. The paper does an empirical evaluation of GELU against ReLU and ELU activation functions in MNIST, Tweet processing etc. And these guys found it performed better.

The overly eager practitioner can apparently use the CDF of the Normal distribution with parameters, mean and standard deviation, specifically make mu and sigma be learnable hyperparameters. But most cases mu and sigma of 0 and 1 will outperform ReLU. Avoids the vanishing gradient problem like it’s relatives in the ReLU class of activation functions, seems like an incremental upgrade to the ReLU.

Other Articles

This post is part of a series of stories that explores the fundamentals of deep learning:1. Linear Algebra Data Structures and Operations
Objects and Operations
2. Computationally Efficient Matrices and Matrix Decompositions
Inverses, Linear Dependence, Eigendecompositions, SVD
3. Probability Theory Ideas and Concepts
Definitions, Expectation, Variance
4. Useful Probability Distributions and Structured Probabilistic Models
Activation Functions, Measure and Information Theory
5. Numerical Method Considerations for Machine Learning
Overflow, Underflow, Gradients and Gradient Based Optimizations
6. Gradient Based Optimizations
Taylor Series, Constrained Optimization, Linear Least Squares
7. Machine Learning Background Necessary for Deep Learning I
Generalization, MLE, Kullback Leibler Divergence
8. Machine Learning Background Necessary for Deep Learning II
Regularization, Capacity, Parameters, Hyperparameters
9. Principal Component Analysis Breakdown
Motivation, Derivation
10. Feedforward Neural Networks
Layers, definitions, Kernel Trick
11. Gradient Based Optimizations Under The Deep Learning Lens
Stochastic Gradient Descent, Cost Function, Maximum Likelihood
12. Output Units For Deep Learning
Stochastic Gradient Descent, Cost Function, Maximum Likelihood
13. Hidden Units For Deep Learning
Activation Functions, Performance, Architecture

Up Next…

Coming up next is the architectural design of neural networks. If you would like me to write another article explaining a topic in-depth, please leave a comment.

For the table of contents and more content click here.

--

--