Summary: Structured Attention Networks

Robert L. Logan IV

Published in

UCI NLP

4 min readOct 13, 2018

[1702.00887] Structured Attention Networks

Abstract: Attention networks have proven to be an effective approach for embedding categorical inference within a deep…

arxiv.org

Authors: Yoon Kim, Carl Denton, Luong Hoang, Alexander M. Rush

Since its inception, attention has become an essential component of neural architectures that perform sequence transduction. Originally designed as a differentiable analog of the alignment matrix used in phrase-based machine translation [1], attention has also proven useful for tasks such as image captioning [2] and constituency parsing [3]. In fact, some have argued that it is all you need to build a state-of-the-art sequence transduction model [4].

In Structured Attention Networks [5], Kim et al. formulate attention as a graphical model which simulates selection from a set of inputs. Using this framework, they then demonstrate how more complex graphical models can be used to incorporate richer structural biases into a model (while still leaving it able to be trained in an end-to-end manner).

Concretely, given a sequence of input vectors x=[x⁽¹⁾,…,x⁽ⁿ⁾] and a query vector q, the context vector c produced by a typical attention mechanism can be viewed as the expected value of the function f(x,z)=x⁽ᶻ⁾ (called the annotation function) over a latent attention variable z∼p(z|x,q) (usually given by softmax(W[x;q] + b)). This corresponds to the graphical model in Figure 1.a below:

Figure 1: From the original paper (link: https://arxiv.org/abs/1702.00887)

The key idea behind the structured attention network is to substitute this graphical model with one that has richer structural dependencies better suited for a given application. For instance, instead of modeling the selection of a single input, it may be more appropriate to model the selection of multiple contiguous sub-sequences of inputs. In this case, the attention mechanism requires n binary latent variables z⁽¹⁾,…,z⁽ⁿ⁾ to represent whether a given input element is included in one of the selected sub-sequences. Additionally, further structural assumptions can be made such as whether to treat each z⁽ⁱ⁾ independently (as is depicted in Figure 1.b), or to impose dependency between the z⁽ⁱ⁾’s using something like a linear chain CRF (as is depicted in Figure 1.c).

With the graphical model structure defined, the output of the structured attention mechanism is computed using the exact same procedure described above; the context vector c is computed as the expected value of the annotation function f(x, z) over the attention distribution p(z|x, q). (Note: The annotation function needs to be slightly modified now that there are multiple z⁽ⁱ⁾’s; see Section 3.1 in the paper for more details). Since p(z|x,q) is given by a graphical model, one of the many well-studied techniques for inference on graphical models can be used to compute the expectation (the forward-backward algorithm in this case). Importantly, since these inference techniques typically boil down to a sequence of summation and multiplication operations, the context vector is fully differentiable!

The sub-sequence selection-based model provides a simple graphical model to illustrate how structured attention can be used. In the paper, the authors provide an additional example of a dependency tree-based model which is considerably more complex (see Section 3.2 for more details). In my opinion, the coolest aspect of this work is how general the formulation of structured attention is; the allowance for any graphical model to be used to describe the distribution of z provides a huge class of toys to experiment with.

Although the underlying idea behind structured attention is promising, the authors bring up a couple reasons why it might be difficult to use in practice. Firstly, from a computational standpoint, numerical stability tends to be an issue when working with graphical models. Because of this, computations are better carried out in log-space, which can add considerable complexity to the model code. Another issue raised by the authors is that naively applying current tools for automatic differentiation tends to be inefficient; in order to make structured attention tractable for large problems, gradient computations will typically need to be written by hand.

From a performance standpoint, the improvements conferred by structured attention are modest. The experiments section applies the model to four different tasks:

Translating mathematical formulas given in prefix notation to infix notation (using the dependency tree-based model).
Character and word-based Japanese-to-English machine translation (using the linear-chain CRF-based model).
Question answering on the bAbI dataset (using the linear-chain CRF-based model).
Natural language inference on the SNLI dataset (using the dependency tree-based model).

Of these experiments, structured attention only attains substantial improvement over standard attention on the first experiment (~2–3x improvement in average length to failure) and character-based machine translation (14.6 vs. 12.6 BLEU). For the remaining tasks, structured and simple attention have almost identical performance. Although these are by no means negative results, relative to the substantial gains in performance observed when attention was originally introduced they seem somewhat underwhelming, especially given the additional complexity required to implement/train these models.

Having that said, I think the experimental results are secondary to the true contribution of this paper, which is the introduction of a method to incorporate structured prediction into neural networks. This opens the doors for a wide variety of follow-up research into incorporating structured prediction tasks as intermediate stages in a neural pipeline — which is something I hope to cover in future summaries.

For more insight from author Yoon Kim, I highly recommend checking out his interview on the (awesome) NLP Highlights podcast.

References

[1] Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate.” arXiv preprint arXiv:1409.0473 (2014).

[2] Xu, Kelvin, et al. “Show, attend and tell: Neural image caption generation with visual attention.” International conference on machine learning. 2015.

[3] Vinyals, Oriol, et al. “Grammar as a foreign language.” Advances in Neural Information Processing Systems. 2015.

[4] Vaswani, Ashish, et al. “Attention is all you need.” Advances in Neural Information Processing Systems. 2017.

[5] Kim, Yoon, et al. “Structured attention networks.” arXiv preprint arXiv:1702.00887 (2017).

Summary: Structured Attention Networks

[1702.00887] Structured Attention Networks

Abstract: Attention networks have proven to be an effective approach for embedding categorical inference within a deep…

Written by Robert L. Logan IV