Visualizations of Recurrent Neural Networks
Recurrent neural networks (RNNs) have been shown to be effective in modeling sequences. It has wide applicability in many domains of text: chat bots, machine translation, language modeling, etc.
Neural networks are considered black boxes and deservedly so. But they can show a surprising amount of information once you unwrap them. Since RNNs are fairly complicated models, there are a lot to unpack.
Some of the visuals are implemented in a Gist. Much of it is adapted from the IMDB sentiment classifier example in Keras and it is also embedded at the bottom.
Model
Simple RNN
The Simple RNN (a.k.a. Elman RNN) is the most basic form of RNN and it’s composed of three parts.
- Input, hidden, output vectors at time t: x(t), h(t), y(t)
- Weight matrices: W1, W2, W3
- Activation function: f
The input and output vectors are drawn from the data and they are either word / character embeddings or output labels. The hidden vectors recurrently compose all the information from the past.
h(t) = f(W1 · x(t) + W2 · h(t-1))
The model is largely linear algebra (i.e. a dot product which multiplies each neuron with each other and adds them up) until the non-linear activation function. The output function could either be for each time step (e.g. language modeling),
y(t) = f(W3 · h(t))
or at the end of the sequence at time T (e.g. classification).
y = f(W3 · h(T))
Visualizations
Hidden Layer — Activations
The most straightforward visualization of an RNN is to simply show the values of the layers. The vector values of h(t) represent the activations of each neuron. [gist line 88]
Below is our visualizations of the hidden states in an RNN classifier for politeness. The hidden states are processed to follow a standard normal distribution.
Even if individual neurons are not interpretable, there is a distinct pattern in the hidden vectors. The word “crap” is clearly incongruent with the rest of the sentence.
The overall score of the sentence (0.839 in this case) is calculated by passing the last hidden vector through the output function. [gist line 87]
The score for each word can be calculated by passing all the hidden vectors through the output function. [gist line 153]
For multi-level RNNs, the hidden vectors would be more abstract. The hidden vector values at the highest level should be smoother than the lower levels across time steps.
Hidden Layer — Gradients
It’s hard to see which words are important from raw values. But notice that going from the word “the” to “crap” there is a large change. By observing how much change it makes, we can determine the salience of each word.
Gradients are perfect for this task. By taking the gradient of the loss function with respect to the RNN hidden layer, it is easy to tell which words are salient. [gist line 91]
Embedding Drift
Word embeddings have been covered extensively before. Essentially, word vectors are clustered together by their semantic similarity. When training RNN classifiers, we usually include pre-trained word embeddings as good priors (initial values).
As the word embeddings get tuned to the task, their meanings drift. [gist line 102]
The word “why” is usually at the start of the sentence and included alongside a question. In politeness research, questions containing “why” have a high chance of becoming impolite, unlike “what”.
Memory — LSTM Cells
Attention and memory in neural networks are hot. Simple RNNs theoretically remembers long range dependencies, but practically, it forgets within a few words. A memory mechanism can preserve some of the information in its history.
Long Short-Term Memory (LSTM) models is one such extension to the RNN. It adds a memory vector, c(t), that decides which neurons to pass through for a learned proportion, p(t). The latter is called a gate and it modulates between old and new information by using another neural network layer.
p(t) = f(W4 · x(t) + W5 · h(t-1))
c(t) = p(t) ⊙ c(t-1) + other stuff
Yes, these things have a litany of redundant calculations. It takes a scatter shot approach for modeling text since it eschews traditional featurization. The downside is that you need a bunch load of data for stable optimization.
We can visualize these gates as well. [gist line 89 for weight matrices but I wasn’t sure how to get the gate outputs themselves]
For the i(t) graph, we notice that the words “accommodation”, “discount” and “reservation” are unimportant.
Memory — Attention
Attention is another method of persisting memory. After the RNN layer, it adds (you guessed it!) another neural network layer.
u(t) = f(W6 · h(t))
This is made into a scaler and normalized so it adds up to 1.
a(t) = softmax(w7 · u(t))
Attention allows us to get interpretable weights (salience) for each word.
Composition
RNNs can combine multiple words. By composing multiple words, the hidden vector dynamics should change.
Below is a hidden vector visualization of negation.
We see some of the neurons flip as the negation words are added.
Activation Clustering
Last one is simple. We can manually look at each neuron and see which sentences has the highest (or lowest) values. Then we can hypothesize what the neuron signifies.
For example, Karpathy showed, for a character-level RNN language model, one of the neurons detected the end of the line.