Rethinking Attention with Performers: Towards New Transformers’ Revolution
For the 10th edition of the LightOn AI Meetup we had Krzysztof Choromanski, Research Scientist at Google Robotics NYC and Adjunct Assistant Professor at Columbia University, presenting his work on Rethinking Attention with Performers, that will appear as oral presentation at ICLR 2021. Congratulations Krzysztof and co-authors!
The 📺 recording of the meetup is on LightOn’s Youtube channel. Subscribe to the channel and subscribe to our Meetup to get notified of the next videos and events!
The leading approaches in language modeling are all obsessed with TV shows of my youth — namely Transformers and Sesame Street. Transformers this, Transformers that, and over here a bonfire worth of GPU-TPU-neuromorphic wafer scale silicon.
From the abstract of Single Headed Attention RNN: Stop Thinking With Your Head, by Stephen Merity
Indeed, it is hard to ignore the spur of Transformer architectures in every machine learning benchmark in the past few years. It really looks like “Attention is All You Need”.
Moreover, attention mechanisms are very apt to parallelization ⛓️ and avoid catastrophic forgetting 😶🌫️ , however, they are not very scalable 📈 in particular, the memory cost increases quadratically with the sequence length. This is a roadblock 🚧 for applications in robotics or bioinformatics, for example.
Many works have proposed solutions based on priors like local attention, or sparse attention. While this is good enough for some applications, in the case of lifelong learning robotics we need long-range contexts with no attention priors.
It turns out that we can look at the attention mechanism as a kernel, and where there is a kernel… There are 🎰 random feature maps!
Why is this formulation helpful? Using random feature maps and thanks to the associativity property, we can perform matrix products in a different order, and thereby reduce the memory cost from quadratic to linear with respect to sequence length 📉
Krzysztof then showed how there are two different possible random features formulations for the softmax kernel, and that the so-called positive random features are far superior than the trigonometric random features: the former are much more reliable, with much more stable training curves.
Besides interesting results on protein sequences 🧫 that you can find in the paper, here is how the Performer does on the Long Range Arena ⚔️ benchmark:
At LightOn we build hardware for machine learning that you can use to compute random feature maps at scale. If you want to try out your latest idea, you can register to the LightOn Cloud for a Free Trial or apply to the LightOn Cloud for Research Program!
About Us
LightOn is a hardware company that develops new optical processors that considerably speed up Machine Learning computation. LightOn’s processors open new horizons in computing and engineering fields that are facing computational limits. Interested in speeding your computations up? Try out our solution on LightOn Cloud! 🌈
Follow us on Twitter at @LightOnIO, subscribe to our newsletter, and/or register for our workshop series. We live stream, so you can join from anywhere. 🌍
The author
Iacopo Poli, Lead Machine Learning Engineer at LightOn AI Research.