Visualizing LSTM Networks. Part I.
Australian sign language model visualization.
Long-Short Term Memory networks are state-of-the-art tools for long sequence modeling. However, there is a problem with understanding what they have learned and investigating why they are making particular mistakes. Many articles and papers do it for convolutional neural networks, but for LSTM we do not have many tools to look inside and debug them.
In this article we try to partially fill this gap. We visualize LSTM network activations from Australian sign language (Auslan) sign classifying model. We do this by training a denoising autoencoder on LSTM layer activations. We use dense autoencoders to project 100-dimensional vector of LSTM activations to 2- and 3-dimensions. Thanks to that we can explore activations space visually to some extent. We analyze this low dimensional space and try to find out how this dimensionality reduction can be helpful for finding relations between examples in the dataset.
Auslan sign classifier
This article is an extension of Miroslav Bartold’s engineering thesis (Bartołd, 2017). The dataset used in this thesis comes from (Kadous, 2002). The dataset consist of 95 Auslan signs, captured using a glove with high-quality position trackers. However, because there was a problem with data files for one of the signs, we were left with 94 classes. Each sign was performed 27 times by a native signer, and each time step was encoded using 22 numbers (11 per hand). The longest sequence in the dataset had a length of 137, but because long sequences were rare, we kept those with length up to 90, and padded shorter ones with zeros at the beginning. The dataset and its detailed description can be found here.
In his thesis, Miroslav tested several classifiers, all based on the LSTM architecture. Classification accuracy was around 96%. For people unfamiliar with the subject, there is a very good explanation of LSTM networks on Christopher Olah’s blog.
In this research we focused on a single architecture with one hidden layer of 100 LSTM units. Last classifying layer had 94 neurons. The input were 22-dimensional sequences of 90 time steps. We used the Keras functional API, and the networks architecture is presented in Figure 1.
The Lambda element shown in Figure 1 was used to extract the last activation from a full sequence of activations (since we passed
return_sequences=True to the LSTM). For implementation details we refer the Reader to our repository.
First attempt to understand the internals of the LSTM network
Inspired by (Karpathy, 2015) we wanted to localize some neurons responsible for sub-gestures easily recognizable by humans (and shared between different signs), like making a fist or drawing a circle with a hand. This approach has failed, because of five main reasons:
- The signal from position trackers is insufficient to fully reconstruct the motion of hands,
- The representation of gesture is very different in the space of trackers and in reality,
- We have only videos of gestures from http://www.auslan.org.au, and not videos of the actual executions of the signs in the dataset,
- Words in the dataset and on the videos on http://www.auslan.org.au can origin from different dialects. So it could be similar to comparing words like “underground” and “subway”,
- 100 neurons and 94 signs is a very large space to comprehend by a person.
Therefore, we focused only on visualization techniques in hope that they will help us to reveal some mysteries of the LSTM cell and the dataset.
In order to visualize LSTM output activation sequences for all gestures we will try to project 100-dimensional vectors representing activations at each time step to 2- or 3-dimensional vectors using denoising autoencoders. Our autoencoders are composed of 5 fully connected layers, with the 3rd layer as a bottleneck with a linear activation function. If you are unfamiliar with the topic you can read more about autoencoders here.
The linear activation function turned out to be the best activation for the purpose of legible plots. For all tested activation functions all example paths (the term will be explained in next section) start near the (0,0) point on the plot. For non-antisymmetric functions (ReLU and sigmoid) all example paths were in the upper right quarter of the coordinate system. For antisymmetric functions, like tanh and linear (identity function), all paths were more or less evenly distributed in all quarters. However the tanh function squashed some paths near -1 and 1 (which made the plot too fuzzy), whereas the linear function did not. If you are interested in visualizations for other activation functions you can find the code in the repository.
In Figure 2 we present the architecture of the 2D autoencoder. The 3D autoencoder was almost identical except it had 3 neurons in the 3rd dense layer.
Autoencoders were trained on vectors of LSTM cell output activations for all single time steps of each gesture realization. These vectors of activation were shuffled and some redundant activation vectors were removed. By a redundant activation vector we mean those from the beginning and at the end of each gesture, where the activations remained approximately constant.
Noise in the autoencoders was added to the input vectors and it was draw from a normal distribution with mean 0 and standard deviation 0.1. The network was trained with the Adam optimizer and the mean-squared error was minimized.
By feeding a sequence of LSTM cell activations corresponding to a single gesture to the autoencoder we obtain the activations on the bottleneck. We refer to this lower-dimensional bottleneck activations sequence as an example path.
Near the last step of some examples we present the name of the sign it represents.
In Figure 3 we present example paths visualization for training set.
Each point in the visualization represents each 2D activation from the autoencoder for a single time step and for one example. The color scale represents the time step (from 0 to 90) in each sign execution and black lines are connecting points from a single example path. Each point before visualization was transformed by the function
lambda x: numpy.sign(x) * numpy.log1p(numpy.abs(x)). This transformation allowed us to look more closely at the beginning of each path.
In Figure 4 we present activations for the last steps for each training example. This is the 2D projection of input to the classification layer.
It is quite surprising that all paths look very smooth and are localized in their parts of space because all activations for each time step and example were shuffled before training the autoencoder. Spatial structure from Figure 4 explains why our last classifying layer reaches good accuracy on such a small training set (near 2000 examples).
For those who are interested in exploring this 2D space, we have rendered a much bigger version of Figure 2 here.
In Figure 5 we present LSTM activations visualization in 3D. For the sake of clarity we presented only points. For data analysis purposes we focus only on 2D visualizations in the second part of this article.
Visualizations look really nice, but is there something more meaningful in this? Does the closeness of some paths mean, that these signs are more similar?
Let us take a look at this space when we take into account partition on right-handed and both-handed signs (we did not observe only left-handed signs). This partition was made based on statistics of variability on each hand tracker signals. More details in our repository.
For the sake of clarity we plot in Figure 6 only paths without points. Right-handed signs are marked in cyan, and both-handed in magenta. We can see clearly, that both types of signs occupy other parts of the space and are not mixing with each other very often.
Now let us take a look at the pair drink-danger (names of signs are linking to films on Auslan signbank). These are the two cyan gestures occupying the middle-right mostly magenta part of Figure 6. In our data these two gestures are one-handed, but on the film from Auslan signbank danger is obviously two-handed.
This might be caused by mislabeling. Notice that the word dangerous is indeed one-handed, and also similar to drink (at least in his first part of the gesture). We therefore concluded that the label danger should actually be dangerous. We plot these two gestures in Figure 7.
Who and soon signs seem similar in Figure 8. The glove has only one bend tracker and the finger bend measurements are not very exact (as written in data description). This is why these two gestures can look more similar in Figure 8 than on the videos.
Crazy and think sign example paths occupy the same space in Figure 9. However think seems to be a main part of slightly longer crazy gesture. When we look at videos on Auslan signbank, we see that this relation is true and the crazy sign looks like think sign plus palm spread.
Although when we look at you sign in Figure 10 we can see that this sign goes perpendicular to other gestures like crazy, think, sorry (and many other not shown here). When we look on videos on signbank, we cant see anything similar between these signs and you.
We have to remember that each LSTM cell state has its own memory of the past, it is fed by the input sequence at each time step and there could be a difference in time moments when paths occupy the same space. Therefore there are more variables that determine the shape of the path than we take into account in this analysis. This is probably the reason why we can observe crossings of some example paths without observing any visual similarity between them.
Some of the close relations suggested by this visualization turn out to be false. Some of them are changing between retraining of the autoencoder or after retraining the LSTM model. Some of them do not, and they are more likely to represent real similarities. For example God and science sometimes share similar paths in 2D space and sometimes are far away from each other.
At the end let us look at the misclassified examples. In Figures 11, 12 and 13 we visualized them for the training, validation and test set respectively. Blue label above misclassified example is a true class. Below there is a label chosen by model, marked in red.
As expected, in validation and test set there are more misclassified examples, but the mistakes are made more often for gestures that are near in the projected space.
At the end we generated a film with the visualization of activations development during prediction.
We projected 100-dimensional vectors of activations to a low-dimensional space. The projection looks interesting and seems to preserve many, but not all, relations between signs. These relations seem to be similar to the relations we perceive while watching the gestures in real life, but without the actual videos matching the recorded gestures, we could not determine this beyond any doubt.
These tools can be used to some extent to look into LSTM representation structure and can be a better tool for finding relations between examples than using a raw input.
- Kadous, M. W., “Temporal Classification: Extending the Classification Paradigm to Multivariate Time Series”, PhD Thesis (draft), School of Computer Science and Engineering, University of New South Wales, 2002
- Karpathy, A., Johnson, J. and Fei-Fei, L., “Visualizing and understanding recurrent networks”, arXiv preprint arXiv:1506.02078, 2015
- Bartołd, M., “Wykorzystanie sieci LSTM do rozpoznania znaków języka migowego”, Engineering Thesis, Polish-Japanese Institute of Information Technology, 2017