A Math Deep Dive on Gating in Neural Architectures

Freedom Preetham
Autonomous Agents
Published in
7 min readFeb 8, 2024

Gating mechanisms within the domain of neural networks, particularly in the context of recurrent neural networks (RNNs), such as Long Short-Term Memory (LSTM) units and Gated Recurrent Units (GRUs), have been foundational in advancing our understanding and capabilities in sequence modeling. This deep dive explores the mathematical intricacies underlying these mechanisms, expanding on the transformative role they play in controlling information flow across network layers and through time.

LSTM Gating Mechanism

LSTM units, introduced by Hochreiter and Schmidhuber in 1997, incorporate a complex gating mechanism designed to regulate the flow of information, thereby mitigating issues like vanishing and exploding gradients. The LSTM gating mechanism includes three distinct gates, each with a specific function:

Input Gate (it​)

This formula is crucial in the LSTM architecture, allowing the model to decide how much of the new information should be added to the cell state.

where,

  • it​ represents the input gate activation at time t.
  • σ denotes the sigmoid activation function, ensuring the output ranges between 0 and 1.
  • Wxi​, Whi​, and Wci​ are weight matrices for the input xt​, the previous hidden state ht−1​, and the previous cell state ct−1​, respectively.
  • bi​ is the bias term for the input gate.
  • ∘ denotes element-wise multiplication (Hadamard product).

Forget Gate (ft​)

This equation plays a critical role in the LSTM architecture by determining the extent to which the previous cell state ct−1​ should be retained or forgotten as the network processes a sequence.

where,

  • ft​ represents the forget gate activation at time t.
  • Wxf​, Whf​, and Wcf​ are weight matrices associated with the input xt​, the previous hidden state ht−1​, and the previous cell state ct−1​, respectively.
  • bf​ is the bias term for the forget gate.

Output Gate (ot​):

This formula is essential in determining the portion of the cell state (ct​) that should be passed through the tanh non-linearity to update the hidden state (ht​), thereby influencing the next state and the LSTM output at time t.

where,

  • ot​ denotes the output gate activation at time t.
  • Wxo​, Who​, and Wco​ are weight matrices for the input xt​, the previous hidden state ht−1​, and the current cell state ct​, respectively.

Cell State Update

This equation is critical for the LSTM’s ability to learn and remember information over long sequences, allowing it to dynamically forget irrelevant information and add relevant new information as it processes each time step.

where,

  • ct​ represents the updated cell state at time t.
  • ft​ denotes the activation of the forget gate, determining how much of the previous cell state ct−1​ is retained.
  • it​ is the activation of the input gate, controlling the addition of new information to the cell state.
  • ⁡tanh is the hyperbolic tangent function, providing a non-linear transformation of the input data and previous hidden state.
  • Wxc​ and Whc​ are weight matrices for the input xt​ and the previous hidden state ht−1​, respectively.

Hidden State Update:

This formula is crucial for modulating the amount of information carried from the cell state to the hidden state, which impacts the LSTM cell’s output and the information passed to subsequent time steps or layers in the network.

where,

  • ht​ represents the hidden state of the LSTM cell at time step t.
  • ot​ denotes the activation of the output gate, determining which parts of the cell state will influence the hidden state and, consequently, the output of the cell at time t.

In these equations, σ represents the sigmoid activation function, ensuring gate outputs range between 0 (fully closed) and 1 (fully open), and ∘∘ denotes element-wise multiplication. The input, forget, and output gates collaboratively decide which information is retained or discarded as the sequence is processed, enabling effective long-term dependency modeling.

Gated Recurrent Unit (GRU)

Introduced by J. Chung et al. in 2014, GRUs simplify the LSTM architecture by merging the cell state and hidden state and reducing the number of gates:

Update Gate (zt​)

This formula is critical in a GRU for determining how much of the past information should be carried over to the next time step, thereby balancing between retaining previous memory and incorporating new information.

where,

  • zt​ represents the update gate activation at time t.
  • σ denotes the sigmoid activation function, ensuring the output values are between 0 and 1.
  • Wz​ and Uz​ are weight matrices for the input xt​ and the previous hidden state ht−1​, respectively.
  • xt​ is the input at time t.
  • ht−1​ is the hidden state from the previous time step t−1.

Reset Gate (rt​)

This reset gate function is critical within the GRU framework, as it determines the degree to which the previous hidden state should influence the current state’s memory content, effectively allowing the model to discard irrelevant information from the past.

where,

  • rt​ denotes the reset gate activation at time t.
  • Wr​ and Ur​ are the weight matrices associated with the input xt​ and the previous hidden state ht−1​, respectively.

Candidate Hidden State (h~t​)

This formula plays a crucial role in the GRU’s operation by generating a new candidate hidden state that the network might transition to, based on the current input and the selectively modified previous hidden state.

where,

  • h~t​ is the candidate hidden state at time t, which is a potential new state that the GRU might adopt.
  • Wh​ and Uh​ are the weight matrices for the input xt​ and the gated previous hidden state, respectively.
  • rt​ represents the activation of the reset gate at time t, determining how much of the past information (previous hidden state ht−1​) should be combined with the current input.

Final Hidden State Update

This formula encapsulates the GRU’s ability to update its hidden state by blending the old state with new information contained in h~t​, based on the current input’s relevance as determined by the update gate zt​. The update gate effectively decides how much of the past information to keep versus incorporating the new candidate information, enabling efficient learning of dependencies across varying time scales.

where,

  • zt​ is the activation of the update gate at time t, controlling the blend between the previous hidden state ht−1​ and the candidate hidden state h~t​.

GRUs streamline the gating process, maintaining competitive performance with LSTMs on various tasks while offering computational efficiency and simplicity.

Evolution of Gating

The original gating concept has evolved beyond its RNN roots, encapsulating any multiplicative interaction within a network, often in conjunction with an activation function. This broadened interpretation includes, but is not limited to, mechanisms that do not directly manage sequence length interactions.

Generalized Gating in Neural Architectures

Modern neural networks incorporate gating in various forms, such as attention mechanisms in Transformers, where gating is implicit in the calculation of attention weights.

Have written in detail about this in Comprehensive Deconstruction of LLMS.

Here, gating emerges through the softmax operation, dynamically weighting the importance of each value (V) based on the query (Q) and key (K) compatibility.

Elementwise Multiplicative Gating

Elementwise multiplicative gating, often found in convolutional neural networks (CNNs) and newer architectures, utilizes gating to modulate features or attention without direct ties to sequence length:

This mechanism allows for the dynamic modulation of features within the network, enabling complex interactions and enhancing the network’s ability to capture relevant patterns in the data.

where,

  • G represents the gated output.
  • σ denotes an activation function, typically a sigmoid, which outputs values between 0 and 1, enabling the gate to control the flow of information.
  • Wg​ and Wf​ are the weight matrices for the gating and filtering operations, respectively.
  • ∗ indicates the convolution operation applied to the input X.
  • X is the input to the gating mechanism.

The Semantic Divergence of Gating

The transition from the specific gating mechanisms of RNNs to the generalized concept of gating in contemporary architectures illustrates a significant semantic shift. Originally designed to manage sequence information and long-term dependencies, gating now encompasses a wider range of multiplicative interactions across neural network designs, reflecting the continuous innovation and expanding capabilities in the field of deep learning. This evolution emphasizes the versatility and critical importance of gating mechanisms in enhancing neural network functionality, offering nuanced control over information processing and flow within various architectures.

--

--