Inspecting Layer Normalization In Transformers

A Simple Trick For Improving Model Performance

6 min readJul 1, 2023

This article is part of a series about the Transformer architecture. If you haven’t read the others, refer to the introductory article here.

Without a shadow of a doubt, the success of Deep Learning (DL) models heavily depends on breakthroughs in training techniques, as demonstrated by the ReLU activation function and Adam optimizer (Nair and Hinton, 2010; Kingma and Ba, 2017). Typically, training techniques involve tweaking the operations/components of a model, but one approach in particular drastically improves training efficiency without adjusting its operations. Instead, it focuses on modifying the data passed between layers. Welcome to the world of Normalization.

Before diving into Layer Normalization (LN; Ba et al., 2016), first let’s discuss what Normalization is and then consider a milestone in the space, Batch Normalization (BN; Ioffe and Szegedy, 2015) which forms the basis of LN.

Normalization, accompanied by Standardization, forms its own research area in Machine Learning (ML). At a basic level, Normalization typically refers to resizing values between a range of [0, 1], and Standardization focuses on rescaling data to follow a standard normal (Gaussian) distribution with a (μ = 0, σ² = 1) (mean and variance, respectively). Over the past decade, these two have blended into one, with Normalization being a generic term for both. Additionally, Normalization in ML (or…

Inspecting Layer Normalization In Transformers

A Simple Trick For Improving Model Performance

Written by Ryan Partridge