From the Archives: Concise Integer Linear Programming Formulations for Dependency Parsing (ACL 2009 Best Paper)

Dheeru Dua
UCI NLP
Published in
3 min readOct 25, 2018

Authors: Andre F. T. Martins, Noah A. Smith, Eric P. Xing

The main idea behind dependency parsing is to create links between words in a sentence, such that the head modifies the base, thus establishing a relation between them. This paper formulates this problem as a linear program. Linear programming is a great way to induce global and local constraints to influence which words should be attached to one another.

Notation:

  • Let x = <w⁰, w¹,… wⁿ>, where x is a sentence and w’s are the words and y∈ Y(x) is a legal dependency tree for x, if a=<i,j>∈ y
  • The incidence vector, z, is an indicator of which arcs exist in training example out of all O(n²) possible arcs.
  • Let δ⁻(v) be the set of all incoming arcs to a vertex v and δ⁺(v) be the set of outgoing arcs
  • Let D = <V,A> be a complete directed graph and y = <V,B> is a legal dependency parse tree.

A standard LP problem can be formulated as below where F(x) is the feature vector, s is the corresponding score over the arcs and A is the constraint matrix which is of size O(n²). By dropping the integer constraint the problem can be solved as an LP-relaxation, which is much faster.

LP formulation for Dependency parsing(from the paper)

The key idea proposed in the paper is how to formulate these constraints so that the global and local compatibility is not compromised. The constraints can be formulated as:

  • Each vertex must have only one incoming arc
  • Root or node 0 has no incoming arcs
  • B doesn’t contain cycles which can be also interpreted as B is connected. The authors formulate this as a single commodity network flow problem for minimum spanning trees. The constraints for this formulation are as below where, ϕ, is the commodity that flows through each arc.

For higher-order features, the authors use a linearization trick by defining an auxiliary variable:

Words have a preference on order and type of arguments they accept which can be learned from the data. An indicator vector, valency, enforces constraints on which word can modify another.

The model can also be extended to a “nearly” projective parse tree by using multicommodity directed flow.

Results and Conclusion:

  • Experiments on seven languages showed similar accuracy to the exponential order constraints but a considerable speed-up in time.
  • The authors introduce an LP formulation which needs polynomial in sentence length constraints, making it faster than earlier attempts with cutting plane algorithm that used exponential constraints.
  • LP relaxations speed up the online discriminative training.
  • Soft constraints can be used to learn higher-order features, model word valencies from the data.
  • The paper allows a way to incorporate non-local constraints while keeping on polynomial order constraints.

--

--