Visualizations of Recurrent Neural Networks

Opening up the black box

5 min readOct 27, 2016

Recurrent neural networks (RNNs) have been shown to be effective in modeling sequences. It has wide applicability in many domains of text: chat bots, machine translation, language modeling, etc.

Neural networks are considered black boxes and deservedly so. But they can show a surprising amount of information once you unwrap them. Since RNNs are fairly complicated models, there are a lot to unpack.

Some of the visuals are implemented in a Gist. Much of it is adapted from the IMDB sentiment classifier example in Keras and it is also embedded at the bottom.

Model

Simple RNN

The Simple RNN (a.k.a. Elman RNN) is the most basic form of RNN and it’s composed of three parts.

Input, hidden, output vectors at time t: x(t), h(t), y(t)
Weight matrices: W1, W2, W3
Activation function: f

The input and output vectors are drawn from the data and they are either word / character embeddings or output labels. The hidden vectors recurrently compose all the information from the past.

h(t) = f(W1 · x(t) + W2 · h(t-1))

The model is largely linear algebra (i.e. a dot product which multiplies each neuron with each other and adds them up) until the non-linear activation function. The output function could either be for each time step (e.g. language modeling),

y(t) = f(W3 · h(t))

or at the end of the sequence at time T (e.g. classification).

y = f(W3 · h(T))

Visualizations

Hidden Layer — Activations

The most straightforward visualization of an RNN is to simply show the values of the layers. The vector values of h(t) represent the activations of each neuron. [gist line 88]

Below is our visualizations of the hidden states in an RNN classifier for politeness. The hidden states are processed to follow a standard normal distribution.

Politeness RNN classifier. Values are hidden layers after whitening. From FoxType.

Even if individual neurons are not interpretable, there is a distinct pattern in the hidden vectors. The word “crap” is clearly incongruent with the rest of the sentence.

The overall score of the sentence (0.839 in this case) is calculated by passing the last hidden vector through the output function. [gist line 87]

The score for each word can be calculated by passing all the hidden vectors through the output function. [gist line 153]

For multi-level RNNs, the hidden vectors would be more abstract. The hidden vector values at the highest level should be smoother than the lower levels across time steps.

Hidden Layer — Gradients

It’s hard to see which words are important from raw values. But notice that going from the word “the” to “crap” there is a large change. By observing how much change it makes, we can determine the salience of each word.

Gradients are perfect for this task. By taking the gradient of the loss function with respect to the RNN hidden layer, it is easy to tell which words are salient. [gist line 91]

Politeness RNN classifier. Values are normalized gradients of each neuron in the hidden layer for each word. From FoxType.

Embedding Drift

Word embeddings have been covered extensively before. Essentially, word vectors are clustered together by their semantic similarity. When training RNN classifiers, we usually include pre-trained word embeddings as good priors (initial values).

As the word embeddings get tuned to the task, their meanings drift. [gist line 102]

Embedding drift visualization from Aubakirova and Bansal (2016).

The word “why” is usually at the start of the sentence and included alongside a question. In politeness research, questions containing “why” have a high chance of becoming impolite, unlike “what”.

Memory — LSTM Cells

Attention and memory in neural networks are hot. Simple RNNs theoretically remembers long range dependencies, but practically, it forgets within a few words. A memory mechanism can preserve some of the information in its history.

Long Short-Term Memory (LSTM) models is one such extension to the RNN. It adds a memory vector, c(t), that decides which neurons to pass through for a learned proportion, p(t). The latter is called a gate and it modulates between old and new information by using another neural network layer.

p(t) = f(W4 · x(t) + W5 · h(t-1))
c(t) = p(t) ⊙ c(t-1) + other stuff

Yes, these things have a litany of redundant calculations. It takes a scatter shot approach for modeling text since it eschews traditional featurization. The downside is that you need a bunch load of data for stable optimization.

We can visualize these gates as well. [gist line 89 for weight matrices but I wasn’t sure how to get the gate outputs themselves]

LSTM internals from Palangi et al. (2016).

For the i(t) graph, we notice that the words “accommodation”, “discount” and “reservation” are unimportant.

Memory — Attention

Attention is another method of persisting memory. After the RNN layer, it adds (you guessed it!) another neural network layer.

u(t) = f(W6 · h(t))

This is made into a scaler and normalized so it adds up to 1.

a(t) = softmax(w7 · u(t))

Attention allows us to get interpretable weights (salience) for each word.

Attention visualization from Yang et al. (2016). Important words are highlighted.

Composition

RNNs can combine multiple words. By composing multiple words, the hidden vector dynamics should change.

Below is a hidden vector visualization of negation.

Composition of hidden vectors in an RNN. From Li et al. (2016).

We see some of the neurons flip as the negation words are added.

Activation Clustering

Last one is simple. We can manually look at each neuron and see which sentences has the highest (or lowest) values. Then we can hypothesize what the neuron signifies.

For example, Karpathy showed, for a character-level RNN language model, one of the neurons detected the end of the line.