Attention Mechanisms in Deep Learning — Not So Special

Published in

The Blog of RETINA-AI Health, Inc.

8 min readJan 22, 2020

Attention mechanisms are very important in the sense that they are ubiquitous and are a necessary component of neural machine learning systems. However, for that very same reason, they are not so special. Contrary to what their newfound recognition would suggest, they are not some exotic new device only present in certain “next generation” models. Instead, they are a component of essentially all deep learning models. Giving them an explicit name and identity is certainly useful, but has misrepresented them as more special than they are. Attention is inherent to neural networks, and is furthermore a defining characteristic of such systems. To learn is to pay attention.

Before proceeding I must state that I very much do like the paper Attention is All you Need. I in fact placed it on my list of most important papers of the decade (2010–2019). Similarly, the paper Neural Machine Translation by Jointly Learning the Align and Translate is insightful. Yet, the more I think about attention, the more I realize that contrary to how the community has portrayed it, it is far from a radical shift from existing models and paradigms. Furthermore it is not new at all. It has always been an integral part of neural systems. In spite of my newfound insights, Attention is All you Need remains on my list of top ML papers of the decade:

What is Attention in Deep Learning, Really?

Attention mechanisms are essentially a way to non-uniformly weight the contributions of input feature vectors so as to optimize the process of learning some target. Sure it is also possible that some target requires equal attention be paid to all inputs (i.e. uniform distribution of weights), but the term attention connotes a selection process in which certain inputs are particularly more important than others.

At this point you may ask: “Isn’t weight selection something that all neural networks do? Isn’t that the whole purpose and end result of learning via stochastic gradient descent?” And to both those questions, I would answer: Yes and Yes. The neural network, however constructed, self learns to place higher weights on the connections from certain parts of the input data and lesser weights on connections from other parts.

One of Many Possible Attention Formalisms

As I stated above, attention mechanisms are a way to non-uniformly weight the contributions of various input features so as to optimize the learning of some target. One way to do this is to linearly transform all the inputs at a given level. The transformed inputs are called keys. Then the current input which we are looking to interpret is also transformed into something called a query. The linear transformation is into a space such that similarity between keys and queries carries information about the relevance of those keys during the translation of a given query. We determine similarity between keys and queries and use that similarity as weights for a linear convex combination of values. The values are the features at the level of the query (including itself). This summation is called a context vector and is fed into the corresponding cell in the next layer in the model. Below is a schematic of the attention mechanism from Neural Machine Translation by Jointly Learning the Align and Translate by Bahdanau et al:

Attention Mechanism from Bahdanau et al.

In the above schematic and equations, we see how the similarity weights are generated in Equation (2) and the context vector in Equation (3). They use an “alignment model,” Equation (1), to determine the similarity between query and keys. The alignment model is itself a neural network. The values that arise from this are then softmaxed to generate a distribution. This distribution is in turn used for the weighted summation of hidden states that generates the context vector in Equation (3).

An Overview of the Transformer Paper

In standard recurrent neural networks used for sequence-to-sequence translation, the recurrence imposes sequential structure and inhibits parallelization. This was the primary issue the Transformer paper addressed. And it did so by positionally encoding the entire sequence allowing for in toto consumption and processing. The other amendments of the paper are important optimizations, but in my view not distinguishing from the general class of LSTMs.

Is Attention Really All You Need?

That the Q, K are learnable suggests Transformer is more than meets the eye. And is perhaps not as different from vanilla LSTM as it initially seems to be. It just more explicitly spells out its attention mechanism than the LSTM does. In LSTM, the attention is essentially blackboxed as input, output, forget, and cell state gates — but not entirely, as the “forget gate” for instance is explicitly choosing what not to pay attention to. In both cases — LSTM and Transformer, we have increased expressivity in the form of network depth, however nested. Most of the Transformer benefit (over RNN) is primarily via parallelization and hence enabling more training on longer sequences.

The real game changer in Transformer is the part that is often brushed off as fungible and inconsequential — the positional encoding. Fungible it may be, inconsequential it is not. The positional encoding is where the action actually mainly happens in terms of distinguishing Transformer from LSTM.

And while parallelizability is the primary novelty of Transformer, and is made possible by positional encoding, one must keep an eye out for where this parallelization does not distinguish it from standard RNNs. One such place is in the multi-headed attention module. The multi-headedness is not unique to Transformer per say, as it is equally application to RNNs. It serves in effect to increase the depth and “expressivity” of the network by increasing the number of parameters. Multi-head attention can be viewed as a form of ensembling.

It may not be possible to attain complete explainability while optimizing learning capacity. This idea is analogous to the Heisenberg Uncertainty Principle which postulates impossibility of simultaneously knowing the momentum and position of a quantum mechanical particle. We conjecture an Explainability-Neural Learning Uncertainty principle.

An Overview of the Reformer Paper

In the paper, Reformer: The Efficient Transformer, the authors improve the computational time complexity of Transformer from quadratic to N*logN time by approximating the attention step instead of fully calculating it — where N is sequence length. This self attention step amounts to a mutual interaction, i.e matrix-matrix multiply, QK*, hence the O(N²). The authors instead employ a locality-sensitive hashing (LSH) mechanism to only compute the interaction with a select few of the neighbors — those closest to q_i for any given query q_i. This drops time complexity to N*logN. This works since most values in the resulting weight distribution are ~0, since the exponential function in the softmax formula exaggerates the larger values and suppresses the smaller ones. One can just ignore the small ones and focus on those k_j closest to q_i. The Locality-sensitive hashing scheme assigns hashes to vectors randomly within a space such that vectors close together in the space have a higher likelihood and receiving the same hash. All in all, it is a welcome approximate algorithm which improves on the time complexity of the Transformer without overly sacrificing learning performance.

From Reformer: The Efficient Transformer paper

Conclusion

The recent surge in interest in devices such as explicitly modeled attention mechanisms fall into the realm of explainability of neural networks. Claims on explainability and on the role of specific submodules of an architecture require proof, and such proof is often not provided in papers where these claims are made. One danger this direction poses is that of assuming more uniqueness, causality, and explainability than is actually occurring. The neural network already encodes attention mechanisms irrespective of our attempts to explicitly model it. And during the course of training, the neural network likely chooses to follow its own course to a greater extent than is apparent. To truly prevent a neural network from propagating and updating weights naturally and dynamically as driven by gradient descent, one would need to architect a highly restrictive architecture which would imply forfeiting the power of deep learning. It may not be possible to attain complete explainability while optimizing learning capacity. This idea is analogous to the Heisenberg Uncertainty Principle which postulates impossibility of simultaneously knowing the momentum and position of a quantum mechanical particle. We conjecture an Explainability-Neural Learning Uncertainty principle.

Attention is an inherent component of all deep learning systems. Often a blackbox — or more precisely a grey box. Recent attempts to explicitly design or identify the attention mechanism require further study and require proof of causality. Additionally, further investigation is needed into the relationship between explainability and learning capacity.

BIO

Dr. Stephen G. Odaibo is CEO & Founder of RETINA-AI Health, Inc, and is on the Faculty of the MD Anderson Cancer Center. He is a Physician, Retina Specialist, Mathematician, Computer Scientist, and Full Stack AI Engineer. In 2017 he received UAB College of Arts & Sciences’ highest honor, the Distinguished Alumni Achievement Award. And in 2005 he won the Barrie Hurwitz Award for Excellence in Neurology at Duke Univ School of Medicine where he topped the class in Neurology and in Pediatrics. He is author of the books “Quantum Mechanics & The MRI Machine” and “The Form of Finite Groups: A Course on Finite Group Theory.” Dr. Odaibo Chaired the “Artificial Intelligence & Tech in Medicine Symposium” at the 2019 National Medical Association Meeting. Through RETINA-AI, he and his team are building AI solutions to address the world’s most pressing healthcare problems. He resides in Houston Texas with his family.

www.retina-ai.com

REFERENCES:Attention is All you Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia PolosukhinReformer: The Efficient Transformer, Nikita Kitaev, Łukasz Kaiser, Anselm LevskayaNeural Machine Translation by Jointly Learning the Align and Translate, Dzmitry Bahdanau, Kyunghyun Cho, Yoshua BengioAuto-Encoding Variational Bayes, Diederik P Kingma, Max WellingGenerative Adversarial Networks, Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua BengioImageNet Classification with Deep Convolutional Networks, Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton