Last month (on December 8th 2021) DeepMind published a paper where they present a clever trick to make transformer models perform better on language modelling tasks without having to use as many parameters as recent models have used. This is very interesting because a lot of progress in this domain has come from just making models bigger and throwing more computing power at it. This is fine but it means it’s harder for groups who don’t have millions of dollars to spend on training models can’t compete with the big industry research labs who have the resources for that.
Brief history of language modelling in the last 5 years
Language modelling used to be done with RNNs. RNNs have what’s called the vanishing gradient problem: they “forget” older information in the sequence of data you feed it. This constrained their performance on language tasks (imagine you had to answer someone’s question but you could never remember the first half of what they said?).
Then transformer models came out in 2017. In the original paper the authors showed that you could get away without any of the recurrent stuff and just use attention mechanisms.
Very roughly, all attention mechanisms are about teaching the model to give more weight the more relevant bits of information in the sequence of input data, not weighing them just based on how old or recent they are. This solved the vanishing gradient problem.