Photo by JJ Ying on Unsplash

Stop Thinking with your Head — Start Thinking with…

I guess most of you know that feeling of struggling through a very hard piece of scientific paper. Reading every sentence thrice and every paragraph twice and in the end you somehow did not catch much more than the abstract and conclusion together would have already told you?

Well, I want to tell you about that one paper I read, which is different. Very different!

I know the codebase well enough that the cruft doesn’t slow me down. Like Han in the Millenium Falcon, it’s a hunk of junk, but it’s a hunk of junk that I know how to fly.

That is one example sentence from Stephen Merity’s Paper “Single Headed Attention RNN: Stop Thinking with your Head”. In this article I want to describe the gist of this paper. You don’t need much pre-knowledge apart from some basic knowledge about Neural Networks and NLP (Natural Language Processing), apart from that I will try to explain everything the best and simplest I can.

Figure 1: LSTM, Tamagotchi, Click Wheel iPod, Car Phone

First, I want to take you on a quick journey: What do these things have in common: A LSTM (Long Short Term Memory) Model, a Tamagotchi, a click wheel iPod and a car phone? According to most researchers, all of the four are outdated. But now imagine that this one paper “Attention is all you need”, that everybody knows so well in the NLP community, never would have been published? What would have happened?

Do we expect research progress to stall until multi-head attention is eventually reconstructed in a secret underground bunker built to protect the last few sur- viving bright eyed researchers from the unwarranted attacks of Reviewer #2?

No, I don’t think so, most likely progress would have been made in a different research area e.g. LSTMs. So let’s dive into the paper and understand what Stephen Merity did here:

The Task

Figure 2: Language Model Task

The task of the Language Model is the usual NLP task: predicting the (n+1)th token, given the n previous tokens as you can see in Figure 2 or to say it with Stephen’s words:

That awkward date where they’re trying desperately to finish your sentences to show you how smart they are — even ifthey’re getting every guess wrong.

A lone GPU

Stephen Merity decided to train the entire Language Model on a single GPU. He did not want to loose so much money for buying compute, especially as the compute would still be waiting for him there in case he really needed it. In my opinion, he also takes a stand there for sustainable compute which is nowadays way more a “hot topic” then it was back then.

Irrational as it seems I didn’t want to use a cluster in the cloudsomewhere, watching the dollars leave my bank account as I run various experiments.


Figure 3: The SHA-LSTM Architecture

You can see the Model in Figure 3. It consist of a LSTM, a single Attention Head and a BOOM Layer. All of these I want to elaborate further:


The LSTM resembles a solution for the short term memory because it consists of gates which can learn which informations to keep and which to forget. I don’t want to dive deeper into the LSTMs because it was already so beautifully explained by Michael Phi in this article. I totally recommend checking it out!

Stephen Merity however did not use a normal LSTM. He used a AWD-LSTM. I know this looks like somebody accidentally sat on my keyboard, so what does it mean?

AWD stands for “ASGD Weight-Dropped LSTM” — i guess that still sounds kind of confusing but let’s make it simple to understand:

Figure 4: ASGD (Averaged Stochastic Gradient Descent) Formula

ASGD (Averaged Stochastic Gradient Descent) is a different version of the SGD. I assume that you are familiar with SGD. The ASGD uses the same update step as the SGD but it returns a user-specific average. You can see the equation in Figure 4. K is the total number of iterations and T<K can be set by the user to control the averaging process. Why?, you ask. Because you have better convergence with this optimization and therefore an improved training process.

Figure 5 : Drop-Out vs Drop-Connect

Weight-Dropped (also called Drop-Connect) is a different version of Drop-Out. Instead of turning the activations to zero as it is common in Drop-Out the weights are turned to zero. You can check out the difference in Figure 5. Drop-Connect can be seen as a generalization of Drop-Out because you can represent more possible models by disabling only single weights and not entire neurons.

Let’s continue with the Model from Figure 3.

Single Head of Attention

First, does anyone have any freaking clue how manyattention heads we need?Second, why are we putting so much work into the attentionlayer? Are we sure about the benefit that this swarm ofattention heads, each involving substantial compute, bring?

Figure 6: A Singel Attention Head

These are the reasons Stephen Merity names for using only one attention head. And of course, that he stopped running out of memory when using just one attention head because you don’t need to store the large key and value tensors.

As you can see in Figure 6 the entire architecture is built to be highly computational efficient. The only matrix multiplication is the one on the query and apart from that only vector-vector operations are used which do not use much compute. “A” for example is only a scaled dot product.

The BOOM layer

First of all, why is it called BOOM layer? Because we take a vector which is small in the beginning (1024), make it big (4086) and then make it small again (1024).

It’s really not that hard to visualize — use your hands if you need to whilst shouting “boooOOOOmmm”

The Boom layer is related to the feed forward layer in a Transformer Model. However, there is no down projection layer as this would again result in a lot of compute. Instead the vector is divided into N vectors and then summed together, which kind of does the same job but without all the parameters needed. According to Stephen Merity that even results in better convergence.

Figure 7: GELU Activation

The Boom layer also uses a GELU activation which you can see in Figure 7. The main benefit of that activation is that it avoids the vanishing gradient problem, where the gradient get’s too small to actually have an impact on the weights and therefore learning stops.

Model Architecture

Figure 8: Language Model Architecture

Let’s take a look at the entire model architecture resembled in Figure 8. We start with an embedding layer. The embedding layer gives us a dense representation of the words and also transfers their meaning. It can be trained on text data and then be reused for many projects.
After the embedding layer we have the SHA-LSTM which I just explained to you and afterwards we have a softmax classifier which lets us to decide on one token which should be the next in our sequence.

There is also one special thing here, called Tied Weights. That means that the embedding layer and the softmax classifier use the same weights. This reduces the amount of parameters and therefore the amount of memory and compute — you can see a pattern here? — and even improves the performance according to the author.


Figure 9: EnWik8 datset on the left and WikiText-2 dataset on the right.

Stephen Merity used two datasets. EnWik8 and WikiText-2. You can see an example of both the in Figure 9.

The EnWik8 dataset is a bytelevel dataset which consists of the first 100 Mio bytes of Wikipedia XML. As information is quite broadly spread in the XML files, the dataset is usually being used to measure a model’s ability to compress data. There is even a prize for that competition called Hutter Prize.

The WikiText dataset is slightly more preprocessed than the EnWik8. It also contains wikipedia articles and has a closed vocabulary. It retains the original case, numbers and punctuation of the articles in contrast to other datasets like e.g. PTB. It is usually used to check if a model is able to take advantage of long term dependencies as wikipedia articles are quite large usually.


Figure 10: Results

In Figure 10 you see a table with the results Stephen Merity achieved with his Language Model. The values in the test column are bits per character (bpc), which is the average number of bite needed to encode one character. ASCII for example uses 7 or 8 bpc (not extended vs extended version). This bpc is a measurement for entropy. So what is Entropy?


Entropy is the average amount of information conveyed in a message. It is a parameter which resembles how much information one letter in a text can “carry”. In 8-bit ASCII each letter is composed of 8 bits. However, this is not the most efficient way to represent the letters. One could exploit the fact, that some letters are way more common(e.g. “e”) and should therefore be represented by a much lower number of bits whereas really seldom letters (e.g. “q”) can use some more bits as we do not need to encode them that often. This entropy can for example be represented by bpc which is frequently used for language models.
Why do we use entropy to compare language models? The goal of a language model is to convey information in as few tokens as possible. We do not want a language model that beats around the bush and in the end we still do not know what the actually message was. That’s why entropy is a good measurement to compare language models.

Back to the results.

Sadly I didn’t hit state of the art results, at least as they stand today with minimal hyper-parameter exploration.

In Figure 10you can see that the former LSTM models have a higher bpc than the Transformer models and furthermore that the SHA-LSTMs perform worse than the LSTMs and the Transformer models. However, Stephen Merity says:

Directly comparing the head count for LSTM models andTransformer models obviously doesn’t make sense but neither does comparing zero-headed LSTMs against bajillionheaded models and then declaring an entire species dead.

And there is one really interesting thing to find in the results: The night before Stephen Merity submitted the paper, he tried to run the experiment with a SHA-LSTM which had really only one attention head instead of one attention head per layer (= 4 heads). The model with only a single head performed almost as good as the one with an attention head per layer — 1.076 bpc vs 1.068 bpc. The single headed one took only 30 minutes training time per epoch compared to 67 minutes for the 4-headed model, which is more than double the time. This is quite a fascinating finding!

Now let’s go back to our thought-experiment from the beginning, where we imagined “Attention is all you need” never would have been published. What if this single headed attention experiment would have been published instead? Maybe we would have concentrate all our research power on LSTMs then…


So Stephen Merity states that Transformer are useless? No, of course not! I found these five take-away-messages:

  1. There should still exist competition and variety in the types of models that we put to our tasks. We can never be sure that the direction we are currently going is the correct one.
  2. Deep learning is the ultimate spaghetti code.
    If you make the smallest mistake in the implementation, the model will always find it and then pretend it is working although it is doing some crazy background logic which you do not even understand yourself.
  3. Mistakes help to understand.
    Stephen Merity admitted to have made some mistakes himself, that helped him understand quite a lot. He has an entire paragraph where he rewrites the mistakes as if it was made on purpose as this is what is expected from a good paper. However, let’s face it, we all make mistakes and learn from it, why not admit it in the first place and help other people to not make the same mistake and to not be ashamed of mistakes in the first place.
  4. Sustainability in Compute.
    Nowadays this is way more topic than back then, but why use compute just because it is easily accessible. More parameters does not always result in a better model! Choose compute wisely!
  5. Normally written papers are just so fun to work with!
    Check out the paper! It is worth your time and fun to read!

One finale quote from Stephen Merity:

Perhaps we were too quick to throw away the pastera of models simply due to a new flurry of progress.

And one finale quote from me:

Stop Thinking with your Head, Start thinking with the Swarm Intelligence we have if a lot of researchers are envolved in different directions.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store