A Review of “Compound Probabilistic Context-Free Grammars for Grammar Induction”

Ryan Cotterell
7 min readJun 29, 2019

--

Kim et al. (2019) is one of the most exciting papers I’ve had the pleasure to read in a while. Why? It challenges long-established wisdom in the field. I’ll explain in what follows.

To start with the basics, there is overwhelming evidence that human language is structured hierarchically. I once had the pleasure of hearing Chris Dyer, at a Dagstuhl workshop, call those who deny hierarchical structure in language equivalent to climate change deniers. Indeed, if you doubt the existence of constituency structure, just check out an elementary syntax textbook from your local library. Much of an elementary textbook will focus on tests that you can run on yourself that reveal hierarchical structure. Just see the Wikipedia page for a large list. Independent of their chosen theoretical toolkit, most syntacticians focus on mathematically modeling the apparent hierarchical structuring of phrases within each other.

To give a classic example, syntacticians are interested in explaining the phenomenon of clefting. Consider the sentence

(1) The engineer loves [training [the LSTM]]

Clefting is a sentence-to-sentence transformation whereby one can take a part of the initial sentence in English and front it after “It is…” Consider the following three sentences:

(2) It is the LSTM that the engineer loves training

(3) It is training the LSTM that the engineer loves

(4) *It is training the that the engineer loves LSTM

Why are (2) and (3) acceptable but (4) isn’t? The common explanation for this says that “the” and “LSTM” form a constituent, that is, “the” and “LSTM” bind tightly in the hierarchical structure and thus move as one unit. Likewise, “training the LSTM” forms a constituent, so it can move. However, “training” and “the” do not form a constituent and, thus, cannot move as a unit and cannot be clefted as the previous two strings could be. Below is a coarse-grained parse of (1) that shows a structure that explains these relationships.

One possible analysis of (1). (I tried to choose the least polemical non-terminal labelings I could.)

Indeed, parse trees are nice because they offer a very simple account of constituency in sentences.

As an aside, the acceptability judgements of sentences (2) and (3) are effectively facts. A context-free grammar is one model, albeit a non-probabilistic one, that allows the scientist to explain these facts. Of course, a CFG assumes that acceptability judgements are binary — sentences are either acceptable or not acceptable. Syntax as a field does not rest on the existence of trees, but rather models of sets of trees, i.e. a grammar, are an incredibly useful mathematical tool for explaining the facts on the ground. For fun, the next time someone in NLP rejects the utility of trees in linguistics, ask them for an alternative explanatory theory of the data. Odds are they won’t have one.

Now, to the task of grammar induction. A natural question that has busied the minds of computational linguists and NLPers for decades is how to make a computer model of language have the notion of hierarchical constituency. For anyone trained in the art of probability modeling, this is naturally viewed as a latent-variable problem. Since we never explicitly observe parse trees, the most natural setting for this sort of work is to treat the trees as a latent variable. Notationally, let Σ be the set of words in a language. Then, let X be a Σ*-valued random variable, i.e. a sentence-valued random variable, and let T be a tree-valued random variable. We will construct a joint distribution p(X=x, T=t) over sentences and their parse trees. We will then estimate the parameters of p from corpus data. Since we only ever observe unannotated sentences and, i.e. without labeled trees, we will want to maximize the marginal likelihood of the sentences in our corpus ($\{x_i\}_{i=1}^n$) and marginalize out the unobserved trees. That is, we want to maximize the following objective:

The marginal likelihood of the training corpus.

where ${\cal T}(x^{(i)})$ is the set of trees that have a yield of $x^{(i)}$. A probabilistic context-free grammar (PCFG) is one model that will give a joint distribution over strings in x and context-free derivations t. Moreover, in a PCFG, there is an efficient algorithm for marginalizing out the unobserved trees, so we may compute the marginal likelihood easily. Michael Collins has a nice tutorial on the formalism here, if you are unfamiliar. In a classic paper, Baker (1979) worked out the inside-outside algorithm for performing maximum likelihood estimation in PCFGs. This is a generalization of the Baum — Welch algorithm, which is just the name we give expectation-maximization (EM) when we apply it to hidden Markov models. What makes these algorithms more difficult to master than EM in, say, a Gaussian mixture model, is that they involve a dynamic program. In the case of inside-outside, this algorithm will run in O(|x|³), i.e. it is cubic in the length of the sentence to be parsed.

PCFGs are an elegant and simple model for grammar induction. However, there is one problem — the field had massive empirical evidence that they didn’t work very well in that the latent trees that were discovered didn’t match those annotated by humans. In fact, Klein and Manning (2002) present a surprising result that always choosing the right-branching tree on English does better than any set of parameters for a PCFG that EM had learned at that time. For clarification, this is not a theoretical property about EM or about PCFGs, but rather an empirical one about training a PCFG on corpus data with EM.

Figure 4 (a) from Klein and Manning (2002). Only compare RBRANCH and DEP-PCFG.

The common wisdom in the literature was that either the training objective (marginal likelihood) or the optimization algorithm (EM or gradient descent) was to blame. Or maybe it was both! As regards the training objective, Smith and Eisner (2005) write “[t]o recover traditional or useful syntactic structure, it is not enough to maximize training data likelihood [Carroll and Charniak, 1992, inter alia], and EM is notorious for mediocre results.” In other words, we need to do more than maximize likelihood. As for the problem of local maxima, Spitkovsky et al. (2013) and Gormley and Eisner (2013) attempt to construct global optimization techniques to avoid EM’s getting stuck in a bad maximum of the likelihood. Neither of these lines of work lead to that much success. This was such common knowledge that I was taught that maximizing the likelihood of a PCFG was inherently a broken idea in an undergraduate NLP class!

So, now enters Kim et al. (2019). What makes this paper so exciting is that they show that training a PCFG with EM actually works! (Actually, they use SGD, but there is a tight relationship between EM and SGD. See Eisner (2016) for a discussion.) The only substantial change that Kim, Dyer and Rush make to the PCFG is to the parameterization of the model. In appendix A.1 of their paper, the authors show that a very simple, relatively shallow neural parameterization of a PCFG, which they term a “neural PCFG” in the table below, works very well. In this model, it is still tractable to compute gradients and to decode, i.e. to get the argmax of the distribution p(t | x) over trees, as nothing has structurally changed — the model is just smoothed better. (Note that I am ignoring the additional improvements from the compound PCFG model in this post, but the compound model is also interesting!) The strong performance of the neural PCFG shows that neither the model, nor the optimization algorithms were inherently broken! Instead, we just had the wrong parameterization! If we simply share parameters in a distributed fashion, performance goes from abysmal (worse than right-branching), to amazing (close to state of the art for the task). It’s of historical interest that Berg-Kirkpatrick et al. (2010) present the same model, more or less, but used hand-engineered features rather than learned ones.

Table 1 from Kim et al. (2019)

On the evaluation: It’s not possible to compare the numbers between Klein and Manning (2002) and Kim et al. (2019). What’s important is the relative position of the right-branching baseline and the PCFG model.

In short, I think this paper is an amazing break-through. The authors were brave enough to challenge an assumption that was considered canon. I look forward to fiddling around with their code and extending it! There are so many questions I have. For instance, can we show similar improvements for dependency-based grammar induction models? PCFGs without explicitly accounting for movement are not great models of language in general. Indeed, there is no way that the model can handle questions in English, e.g. “What does the engineer love to train?” In such a sentence, “train what” forms a constituent. Likewise, the model of Kim et al. (2019) will fail to model Irish (or any primarily VSO language) at all. To really make progress on grammar induction, we will need a model that handles all sorts of different languages.

Acknowledgements

Thanks to Sebastian Mielke, Jason Eisner, Adina Williams and Hagen Blix for providing feedback!

--

--