Lecture Notes for MIT 6.S191: Recurrent Neural Networks, Transformers, and Attention

Proud Jiao
3 min readJul 12, 2023

--

This lecture note will keep track of key concepts gone over in lecture 2 of the MIT 2023 deep learning course. For each concept, I will go over what they are and why they are important.

Sequential Models

WHAT: a model where the output of one layer is the input of another.
WHY: In everyday life, we meet sequential data. Given “I love to write Medium [articles]”, the sequential model allows us to predict the last word “article”. Similarly, voice, finance, medical data, and any data that is time-dependent can be modeled using sequential modeling.

Recurrent Neural Networks (RNN)

WHAT: Neural Networks where each iteration produces a memory state in addition to the output. The memory state gets fed into the input of the next iteration as a continued memory which will affect the output.
WHY: The most basic and intuitive NN Sequential Model.

Embeddings

WHAT: numerical indexes from texts
WHY: Say we want to feed “I love to write Medium” into our networks and predict “articles” as the next word. We can’t because NN doesn’t understand the text. we need to first encode texts to NN’s language, numbers. The simplest way to encode words is to assign an index to each word, say “a” is 1, “cat” is 2. Then we can use one-hot encoding to generate a vector of length N where N is how many total vocabularies there are. [0,1,0…] would represent “cat” and [1,0,…] would represent “a”. A more sophisticated way involves a learned neural network that map each word in the embedding space where each word is closer to words with similar semantic meaning.

Backpropagation Through Time (BPTT)

WHAT: Similar to the backpropagation algorithm in NN, backpropagation in RNN requires gradients to be computed for each timestamp, then backpropagated from the most current model to the least current model to adjust the weight. Common problems include exploding gradients and vanishing gradients. Exploding gradients can be solved with clipping and vanishing gradients solved by weight initialization to the identity matrix, better choice of activation functions such RELU, and using LSTM for Network architecture
WHY: BPTT is essential to training RNNs.

Limitations of RNN

WHAT: 1. Encoding bottleneck: 2. No Parallelization: 3. No Long Term Memory
WHY: realization of RNN’s setbacks prompts better models to be designed.

Self-Attention

WHAT: an algorithm used in Transformer model that identifies the most important features in the input.
WHY: First coined in the 2017 paper Attention is All You Need, the mechanism eliminated the problem of zero parallelization in RNN models.

--

--

Proud Jiao

Aspiring Data Scientist, Machine Learning Engineer, and Software Engineer