Rethinking Attention with Performers: Towards New Transformers’ Revolution

For the 10th edition of the LightOn AI Meetup we had Krzysztof Choromanski, Research Scientist at Google Robotics NYC and Adjunct Assistant Professor at Columbia University, presenting his work on Rethinking Attention with Performers, that will appear as oral presentation at ICLR 2021. Congratulations Krzysztof and co-authors!

The 📺 recording of the meetup is on LightOn’s Youtube channel. Subscribe to the channel and subscribe to our Meetup to get notified of the next videos and events!

The leading approaches in language modeling are all obsessed with TV shows of my youth — namely Transformers and Sesame Street. Transformers this, Transformers that, and over here a bonfire worth of GPU-TPU-neuromorphic wafer scale silicon.

From the abstract of Single Headed Attention RNN: Stop Thinking With Your Head, by Stephen Merity

Indeed, it is hard to ignore the spur of Transformer architectures in every machine learning benchmark in the past few years. It really looks like “Attention is All You Need”.

Moreover, attention mechanisms are very apt to parallelization ⛓️ and avoid catastrophic forgetting 😶‍🌫️ , however, they are not very scalable 📈 in particular, the memory cost increases quadratically with the sequence length. This is a roadblock 🚧 for applications in robotics or bioinformatics, for example.

Many works have proposed solutions based on priors like local attention, or sparse attention. While this is good enough for some applications, in the case of lifelong learning robotics we need long-range contexts with no attention priors.

Attention matrices using a “local” or “graph” prior.

It turns out that we can look at the attention mechanism as a kernel, and where there is a kernel… There are 🎰 random feature maps!

Why is this formulation helpful? Using random feature maps and thanks to the associativity property, we can perform matrix products in a different order, and thereby reduce the memory cost from quadratic to linear with respect to sequence length 📉

Left and right expressions are equivalent, however by multiplying the green and yellow matrices first on the right, we can reduce the memory cost from quadratic to linear in L.

Krzysztof then showed how there are two different possible random features formulations for the softmax kernel, and that the so-called positive random features are far superior than the trigonometric random features: the former are much more reliable, with much more stable training curves.

The Performer with positive random features outperforms trigonometric random features, and it is also much more stable in training. The performance at convergence is very close to the usual transformer implementation. Picture from

Besides interesting results on protein sequences 🧫 that you can find in the paper, here is how the Performer does on the Long Range Arena ⚔️ benchmark:

The Performer is the fastest attention-based architecture while retaining most of the performance of a transformer, and reducing the memory cost significantly.

At LightOn we build hardware for machine learning that you can use to compute random feature maps at scale. If you want to try out your latest idea, you can register to the LightOn Cloud for a Free Trial or apply to the LightOn Cloud for Research Program!

About Us

LightOn is a hardware company that develops new optical processors that considerably speed up Machine Learning computation. LightOn’s processors open new horizons in computing and engineering fields that are facing computational limits. Interested in speeding your computations up? Try out our solution on LightOn Cloud! 🌈

Follow us on Twitter at @LightOnIO, subscribe to our newsletter, and/or register for our workshop series. We live stream, so you can join from anywhere. 🌍

The author

Iacopo Poli, Lead Machine Learning Engineer at LightOn AI Research.

We are a technology company developing Optical Computing for Machine Learning. Our tech harvests Computation from Nature, We are at