I asked ChatGPT to code the Attention in Transformers and this happened!

Freedom Preetham
Autonomous Agents
Published in
2 min readFeb 22


Aright, I am not impressed easily when it comes to LLMs, I have my qualms. It does not do a good job when it comes to many logic tasks, winograd schemas, clozure, word sense disambihuation across long context etc…

But in the area where ChatGPT has excelled recently is in the world of “constrained grammar” with finite stochastic actions and infinite outcomes, Ala, coding.

Here I test drive the model by asking ChatGPT to write me a attention mechanism in transformers. First it gives me a lazy “self-attention” with a Luong score.

I am not impressed and ask for multi-head attention. It (they/them?) understands the context and corrects the self-attention with a multi-head mecahnisim.

I push the boundary and ask for regularisation and scaled dot product! And VOILA, I get a drop-out added post layer normalisation and a scaled dot product. The scaling factor to my surprise is a. “sqrt of the head-dimension of the key vector”. The fact that it could keep track that the code is multi-headed and the multiplicative scaling should happen for the heads is priceless. BOOM! Flawless execution.

Yes, one can argue that this code can be easily replicated by chatGPT from an existing transformer implementation.

That is NOT the main point of this exercise. It is the fact that I could have a dialog with it to alter the code in the same session across execution boundaries and I got “exactly” what I asked for. It’s NOT the understanding of my prompt that is impressive, that has been around for 10years. It’s the delivery of the most accurate code that is impressive.

Ahem, the only other person who can write code this fast from thought to execution is me ;) (just kidding).

On a lighter note, I think I found my founding engineering team!

Honestly, MIND BLOWN! 🤯