A Paper A Day: #11 Pointer Networks

Amr Sharaf
3 min readJun 3, 2017

Today we discuss the pointer networks paper by Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.

Pointer networks are sequence-to-sequence models where the output is discrete tokens corresponding to positions in an input sequence. The main difference between pointer networks and standard seq2seq models is two-fold:

  1. The output of pointer networks is discrete and correspond to positions in the input sequence
  2. the number of target classes in each step of the output depends on the length of the input, which is variable.

Pointer networks are suitable for problems like sorting, word ordering, or computational linguistic problems such as convex hulls and traveling sales person problems. One common characteristic for all these problems is that the size of the target dictionary varies depending on the input length.

pointer network solves the problem of variable size output dictionaries using a mechanism of neural attention. It differs from the previous attention attempts in that, instead of using attention to blend hidden units of an encoder to a context vector at each decoder step, it uses attention as a pointer to select a member of the input sequence as the output.

Main Contributions

The authors summarize the main contributions in this paper as follows:

We propose a new architecture, that we call Pointer Net, which is simple and effective. It deals with the fundamental problem of representing variable length dictionaries by using a softmax probability distribution as a “pointer”.

We apply the Pointer Net model to three distinct non-trivial algorithmic problems involving geometry. We show that the learned model generalizes to test problems with more points than the training problems.

Our Pointer Net model learns a competitive small scale (n ≤ 50) TSP approximate solver. Our results demonstrate that a purely data driven approach can learn approximate solutions to problems that are computationally intractable.

Pointer Networks

Ptr-Net — An encoding RNN converts the input sequence to a code (blue) that is fed to the generating network (purple). At each step, the generating network produces a vector that modulates a content-based attention mechanism over inputs. The output of the attention mechanism is a softmax distribution with dictionary size equal to the length of the input. (Credit: Vinyals, Oriol et al. NIPS 2015)

The figure above shows the pointer networks architecture. A traditional sequence-to-sequence model uses a softmax distribution over a fixed sized output dictionary to compute p(Ci|C1, . . . , Ci−1, P). Thus it cannot be used for problems where the size of the output dictionary is equal to the length of the input sequence. To solve this problem pointer networks model p(Ci|C1, . . . , Ci−1, P) using an attention mechanism as follows:

where softmax normalizes the vector ui (of length n) to be an output distribution over the dictionary of inputs, and v, W1, and W2 are learnable parameters of the output model. In pointer networks, we do not blend the encoder state ej to propagate extra information to the decoder, but instead, use uij as pointers to the input elements.


Experiments show that Ptr-Nets can be used to learn solutions to three different combinatorial optimization problems. Ptr-nets work on variable sized inputs (yielding variable sized output dictionaries), something the baseline models (sequence-to-sequence with or without attention) cannot do directly. Even more impressively, they outperform the baselines on fixed input size problems — to which both models can be applied.