How Self-Attention with Relative Position Representations works

___
8 min readFeb 1, 2019

Introduction

This article is based on the paper titled Self-Attention with Relative Position Representations by Shaw et al. The paper introduced an alternative means to encode positional information in an input sequence inside a Tranformer. In particular, it modified the Transformer’s self-attention mechanism to efficiently consider the relative distances between sequence elements.

My goal is to explain the salient aspects of this paper in such a way people unaccustomed to reading academic papers can understand. I assume the reader has basic familiarity with Recurrent Neural Networks (RNNs) and the multi-head self-attention mechanism in Transformers.

Motivation

The architecture of an RNN allows it to implicitly encode sequential information using its hidden state. For example, the diagram below depicts an RNN that outputs a representation for each word in an input sequence where the input sequence is “I think therefore I am”:

The output representation for the second “I” is not the same as the output representation for the first “I” because the hidden state that is inputted into these words are not the same: For the second “I”, the hidden state has passed though the words “I think therefore” while for the first “I” the…

--

--