Exploring Residual Connections In Transformers

A Key to Achieving State-of-the-Art Results

5 min readJun 25, 2023

This article is part of a series about the Transformer architecture. If you haven’t read the others, refer to the introductory article here.

In today’s modern era, algorithms powered by Neural Networks (NN) are ubiquitous across industries and have helped solve complex problems, such as understanding protein structures (Jumper et al., 2021). However, even with their capability, they are notoriously difficult to train effectively when their depth and complexity grow. Over the past decade, numerous improvements have been created to help improve training, but one, in particular, has proved monumental to the Transformer’s success — Residual Connections.

In this article, we build an intuitive understanding of Residual Connections and explore why they are important in Deep NNs and the Transformer architecture.

Residual Connections

Residual Connections were first introduced in the paper “Deep Residual Learning for Image Recognition” (He et al., 2015) in the ResNet architecture, a Deep Convolutional Neural Network (CNN) designed for image classification tasks. During their research, they explored the effects of increasing the depth of CNNs by adding more layers to them and came across a strange phenomenon — as the depth grows, the accuracy decreases.

Usually, when adding extra layers, the model’s accuracy increases. However, we…

Exploring Residual Connections In Transformers

A Key to Achieving State-of-the-Art Results

Residual Connections

Written by Ryan Partridge