Compound Probabilistic Context-Free Grammars for Grammar Induction: Where to go from here

Ryan Cotterell
Jul 1 · 14 min read

Ryan Cotterell and Adina Williams


After my (Ryan’s) recent blog post reviewing Kim et al. (2019), I (Ryan) received a request to comment on the restrictions of their neural PCFG model — both in its application to English and, more broadly, in its application to other languages whose syntax is typologically different than that of English. Of course, the natural thing to do when asked to comment on something that is not 100% in your area of expertise, in this case theoretical syntax, is to contact a domain expert, who is played by Adina here. This blog post serves as a tutorial on a theoretical syntactic construct known as movement, how it relates to grammar induction and, thereby, discusses several open problems in the area. In my previous post, I (Ryan) explained why linguists think hierarchical constituency structure exists and how previous PCFG-based models had failed to reliably induce it from raw text for 30 plus years. While Kim et al. (2019)’s model does not “solve” grammar induction in any sense of the word “solve” and, it is quite an exciting development. However, much more work needs to be done on the topic. This post, however, deals with a slightly different issue: Kim et al. (2019)’s model is not expressive enough to handle the full scope of the task, which is much more involved than the restricted version of grammar induction that Kim et al. treat. We (Ryan and Adina) will elucidate this point below through examples and discuss potential ways forward.

Disclaimer: The syntactic analysis here is done using classical techniques developed during the 1960s. Many, many developments have been made since in a myriad theoretical paradigms.

The Problem: Why We Grammar-Induce

Cognitive Relevance

An NLP “Task”

Another relevant point is that grammar induction has been leaderboarded, as most NLP tasks have, in a way that discourages many linguists from working on the task. By leaderboarded, we mean that it is often very difficult to publish a paper in an NLP conference on grammar induction that does not achieve state-of-the-art results, even if the paper might have genuine scientific impact! Many linguists find the idea of leaderboarding inherently unscientific, as it moves understanding to secondary importance behind the empirical performance of a system.

The Linguistics: Discontinuous Constituents

(1) The engineer trains the LSTM.

(2) What does the engineer train?

In (1), [trains [the LSTM]] forms a constituent. I (Ryan) discussed this in detail in my previous post. Importantly, this is a continuous constituent in the sense that the words comprising the constituent form a contiguous span of the sentence. (If you are familiar with the formal properties of context-free grammars (CFGs), a constituent under that model is the yield of a given non-terminal. In the case of a CFG, this yield will always be a contiguous span of the sentence.) However, when analyzing sentences like (2), many linguists have argued that [train what] also forms a constituent (Chomsky 1975) by analogy to [trains [the LSTM]]. As should be obvious, this constituent is discontinuous, e.g. the words that make up the constituent are not a contiguous string. What may not be obvious, however, is that this creates a problem for many (most?) grammar-inducers in the literature. First, let’s get into why we might want it to be the case [train what] forms a constituent, and then we’ll explain one theoretical device, known as movement, common in generative syntax to model this. Afterwards, we’ll get into some potential extensions to Kim et al. (2019).

The primary reason for considering [train what] as a single contituent is sthat “what” is the direct object of the verb “train” in (2) and we might desire that a verb and its direct object always form a constituent, as they do in (1). If you need convincing that it is reasonable to analyze [train what] as a constituent, consider this (colloquial) English sentence:

(3) The engineer trains what?

Although you might only utter (3) in contexts when a loud noise occurred while a friend was talking and obscured the object of their sentence, it is basically equivalent in meaning to (2). Also, consider that many languages, e.g. Chinese, do not exhibit wh-fronting. Consider the two Mandarin sentences

(4) 你 (nǐ; you)喜欢 (xǐhuān; love) 纽约. (niǔyuē; New York)

(5) 你 (nǐ; you)喜欢 (xǐhuān; love) 什么? (shénme; what)

We see that (4) and (5) pattern similarly in that the constituency structure of [xǐhuān [niǔyuē]] and [xǐhuān [shénme]] is more in line with the linear word order of the language. This phenomenon is known as wh-in-situ. We don’t have time to rehash them here, but the linguistics community has put forth a large number of arguments that [train what] should in fact be a constituent. We refer the reader to Adger (2003), Carnie (2006), or another syntax textbook for a summary of these arguments.

The motivation for movement is the following wish: Wouldn’t it be great if it were true that all verb phrases had similar structure? In other words, that the hierarchical structure in (1) and (2) reflected the fact the grammatical relations between “trains” and “the LSTM”, in (1), and “train” and “what”, in (2), were the same? Indeed, we might want to make a much broader typological statement, perhaps even a “language universal” that verbs and their direct objects form a constituent in all languages. Noam Chomsky thought this would be great, too, and, in fact, he introduced the idea of transformational grammar to solve problems just like this one (Chomsky 1965). The idea of transformational grammar is really simple: Sentences have two trees: and an underlying tree and a surface tree. In the underlying tree, the position of the direct object in (1) and (2) is the same, as seen in their parse trees displayed below. (Recall: Grammars are nothing more than mathematical models of language. If you go high enough up in the ontology of mathematical modeling, grammars are fundamentally no different in purpose than neural networks in that both grammars and neural networks are nothing more than precise mathematical expressions that (attempt to) describe the dynamics of natural language.)

Parse tree for (1)
Parse tree for (2) before movement

So, if we grant that the above parse trees are good underlying representations of the sentences (1) and (2), the question becomes, how do we get the word ordering right? Here enters movement. Movement is a transformation that takes an underlying tree, like the one for (2), and transforms it into the surface tree, shown below.

Parse tree for (2) after movement

Note: We have avoided talking about the phenomenon in English known as do-insertion, whereby the verb “to do” makes a surprise appearance in (2).

Traditionally, arrows are shown indicating the original position of the words and their new position. The element t is called a trace and is left unpronounced — it only serves to mark the original position of “what.”

If you’re a computer scientist (to whom this blog post is directed), you might ask about how to compute such a transformation. The answer, although very much an aside, lies in tree automata. In fact, such tree-to-tree transductions were one of the first use cases of tree automata. If you’re an NLPer, you can find uses of tree-automata for various NLP applications in Graehl, Knight and May (2008).

Of course, you might say that questions aren’t that common in English — to the consternation of many an NLPer working on question answering — and maybe treating them in another way is good enough. But, wait, there are entire languages that require an in-depth movement analysis if we want verbs and their direct objects to form constituents! Let’s consider the case of Modern Irish, whose default word order is verb-subject-object. Here is a sentence in Modern Irish taken from Carnie (2006; pg 150):

(6) Phóg (kissed) an fear (the man) an mhuc (the pig)

Again, if we want it to be a language universal that verbs and their direct objects form constituents, we require [phóg an mhuc] (kissed the pig) to be a single constituent. This is again a discontinuous constituent. Of course, for the same reasons as English questions, a PCFG-based model of grammar induction cannot easily model this case. One traditional analysis of (6) involves movement as the tree below shows:

Parse tree for (6) after movement

One other likely suspect for syntactic movement is relative clauses (i.e., DPs embedding a larger clause):

(7) I like the man who the cat loves ___.

Here too, it seems we need to have a trace of the wh-element in the object position and move it to the front of the DP, to get the correct word order. Of the topics studied in linguistics, syntactic movement is by far one of the most popular, and we can only give the coarsest of examples here (i.e., we leave out the differences between phrasal movement and head movement, and many other important distinctions). But, we feel that given these examples, syntactic movement, or something that accomplishes the same goals, is completely indispensable. If we want an accurate description of the known linguistic facts, so that we can make good predictions about previously unstudied languages, we need to acknowledge the existence of movement phenomena.

Before we leave this topic, it is worth mentioning two things: (1) movement certainly isn’t the only example of so-called discontinuous constituents (for the curious, other examples of discontinuous constituents might include control and binding, which doubtless have implications for NLP tasks such as coreference resolution), and (2) linguistics research into these topics has advanced considerably since 1965 (e.g., things as arcane as multidominance have been proposed), and even if our understanding of transformations has advanced, they are largely used here for explication purposes.

The Limitations of Kim et al. (2019)

PTB-style parse tree for (2)

Note that we made use of a different non-terminal labeling scheme, but grammar induction is typically not evaluated using the non-terminal labels, so this does not matter.

With this (insufficient) gold annotation, the model could actually get full credit for having discovered the syntactic structure of the question without having figured out that [train what] forms a constituent. No bueno. In other words, Kim et al. treat only a toy-ish version of the full task of grammar induction. Despite being toy, it is still incredibly difficult, as 30 years of research have shown, to create a good grammar-inducer. So, how do we get around this limitation and start modeling discontinuous constituents? One way would be to explicitly learn a model of movement. The inimitable Jason Eisner has a great way of doing this that we haven’t fully wrapped our heads around yet, but he’s easy to buttonhole at conferences if you want his (hot) take on it. Another possible solution lies in formal language theory (including but not limited to work by T. Graf, S. Laszakovits and colleagues) that has worked on the construction of grammar formalisms that allow for discontinuous constituents. One such grammar that is tree-adjoining grammar (TAG). Below is an example of a TAG tree with a discontinuous constituent.

TAG analysis of “What does John like”, taken from Fei Xia’s thesis.

As was shown by (Vijay-Shanker and Weir 1994), TAG admits a O(|x|⁶) algorithm to sum over all trees. It would be relatively straight-forward to adapt Kim et al. (2019)’s model to TAG, but it might be computationally infeasible, since O(|x|⁶), while polynomial, is still slow.

In the previous post, we argued for constituents pretty heavily, but it’s often the hierarchical relationships between words that matter to linguists more than precisely capturing all the phrasal constituents. You might wonder why we talked about constituents at all, then? Well, recovering constituent boundaries is specifically the task that Kim et al. (2019) worked on. Indeed, it is evaluated by unlabeled constituency F1 as in Table 1 of Kim et al. (2019). You might also ask why the oracle parse in their Table 1 is not 100%. The reason is that PTB trees are non-binary and these non-binary trees are being approximated with binary trees, so the induced binary trees will always fall short of the gold constituent boundaries from the gold non-binary trees.

Building on the deemphasis of the importance of constituency, we move to dependency grammar. If you are not tied to the idea that constituents are underlyingly continuous, then the issues faced by discontinuous constituents goes away. For example, ignoring constituency is effectively how dependency grammar (Tesnière 1959) handles the problem of discontinuous constituents. As an example, the dependency parses of (1) and (2) are shown below:

Dependency parse of (1)
Dependency parse of (2)

In the case of (1), we have that [the LSTM] has “trains” as a head and, in the case of (2), we have that [what] has “trains” as a head — this is exactly what we wanted. Here, we do not have an underlying structure in (2) where [train what] is an underlying contiguous constituent. In fact, the way we show that “what” has the same relationship to “train” as “the LSTM” does is with an arc labeled with the syntactic relation (here, ‘DOJB’). Indeed, in our assessment, one of the reasons that the Universal Dependencies project annotates dependency trees, rather than constituency trees, is because dependency grammars dispense with needing both an underlying and a surface tree, which would require better trained annotators, to find the underlying positions of many words.

As a technical note, both of the dependency parses above are projective, i.e., the arcs do not cross when drawn above the sentence. In a construction given by Johnson (2007), you can transform projective trees into a context-free parse. However, such a transformation will result in a CFG with spurious ambiguity, i.e., there is more than one derivation tree in the grammar that corresponds to a single derived dependency parse. This is problematic, because the different derivation trees exhibit different constituency structure. This means that dependency grammar can be seen as underspecified with respect to constituency structure. Also note that, in the formalization of Kim et al. (2019), a grammar inducer is not required to recover the syntactic heads, which are also specified in dependency grammar. So, a dependency-based grammar inducer has to discover head relationship (more information than Kim et al.’s model discovers), but does not have to discover phrasal constituency boundaries (less information than Kim et al.’s model discovers).

Where to go from here?

Table 1 from Pitler et al (2013)

While English syntax is covered by projective trees relatively well, i.e. > 95%, Table 1 above shows that languages like Dutch are not well covered by such a restriction. In other words, the DMV, as applied to Dutch, could not correctly get the syntax right for more than 63.6%, which is not great. There are natural ways to extend dependency-based grammar inducers to handle non-projective trees, e.g. Cohen et al. (2011) give a model that admits exact inference for non-projective trees, albeit in O(|x|⁸) time. While Cohen et al.’s algorithm for exact marginal inference is quite slow, Carlos Gómez Ródriguez, one of the more innovative researchers in NLP, has a series of recent papers that provide probabilistic models that can be normalized over a set non-projective dependency trees in O(|x|⁴) time. We hope to find some time to meld that line of work with the recent proposal by Kim et al. in the near future.