After my (Ryan’s) recent blog post reviewing Kim et al. (2019), I (Ryan) received a request to comment on the restrictions of their neural PCFG model — both in its application to English and, more broadly, in its application to other languages whose syntax is typologically different than that of English. Of course, the natural thing to do when asked to comment on something that is not 100% in your area of expertise, in this case theoretical syntax, is to contact a domain expert, who is played by Adina here. This blog post serves as a tutorial on a theoretical syntactic construct known as movement, how it relates to grammar induction and, thereby, discusses several open problems in the area. In my previous post, I (Ryan) explained why linguists think hierarchical constituency structure exists and how previous PCFG-based models had failed to reliably induce it from raw text for 30 plus years. While Kim et al. (2019)’s model does not “solve” grammar induction in any sense of the word “solve” and, it is quite an exciting development. However, much more work needs to be done on the topic. This post, however, deals with a slightly different issue: Kim et al. (2019)’s model is not expressive enough to handle the full scope of the task, which is much more involved than the restricted version of grammar induction that Kim et al. treat. We (Ryan and Adina) will elucidate this point below through examples and discuss potential ways forward.
Disclaimer: The syntactic analysis here is done using classical techniques developed during the 1960s. Many, many developments have been made since in a myriad theoretical paradigms.
The Problem: Why We Grammar-Induce
Grammar induction is a weird task in NLP. We’ll explain how we feel about it and how we see the task as having evolved. The weirdness, as perceived by us, stems from the fact that grammar induction is a cognitively relevant task for the linguistics community, but, almost contradictorily, it is primarily investigated by people who are not always trained in linguistics.
The reason the NLP community works on grammar inducers is because it is cognitively relevant. Children manage to learn the syntax of their language only based on positive examples. So, shouldn’t computers be able to do the same? Unlike many tasks in NLP, e.g. question answering and machine translation, it is not clear grammar induction has any practical engineering or commercial purposes. Indeed, neither of us is one of those researchers who thinks that machine translators or chat bots, for example, need a notion of syntax to serve their commercial purpose. (However, for an alternative perspective, see Naradowsky, Riedel and Smith (2012) and Gormely, Mitchell, Van Durme and Dreze (2013) for the successful incorporation of grammar induction into semantic-role labeling and, potentially, other downstream tasks.) Thus, barring some revelation that we need latent syntax for state-of-the-art NLP models, the only reason for continued interest in grammar induction is cognitive. Personally, we find this motivation more than satisfactory and believe grammar induction to be one of the most exciting use cases for machine learning as applied to the scientific study of language (that’s just another way to say “linguistics,” by the way).
An NLP “Task”
On the other hand, grammar induction researchers are not as tuned into the cognitive science community as they could (or dare we say, should?) be. Most of those who work on grammar induction aren’t trained in cognitive science or linguistics and don’t stay up to date with that literature. (Note: We wrote most, not all.) Even though grammar induction is an inherently interdisciplinary task (linguistics + computer modeling), it is, unfortunately, mostly worked on almost exclusively by one community, computer science. We’ve always found that this contradiction makes the task weird. And, sadly, it means that the larger NLP community that hosts the research, can’t always appreciate the importance of grammar induction to science.
Another relevant point is that grammar induction has been leaderboarded, as most NLP tasks have, in a way that discourages many linguists from working on the task. By leaderboarded, we mean that it is often very difficult to publish a paper in an NLP conference on grammar induction that does not achieve state-of-the-art results, even if the paper might have genuine scientific impact! Many linguists find the idea of leaderboarding inherently unscientific, as it moves understanding to secondary importance behind the empirical performance of a system.
The Linguistics: Discontinuous Constituents
We’ll start with some linguistic observations that make grammar induction hard. Let’s consider the following two sentences:
(1) The engineer trains the LSTM.
(2) What does the engineer train?
In (1), [trains [the LSTM]] forms a constituent. I (Ryan) discussed this in detail in my previous post. Importantly, this is a continuous constituent in the sense that the words comprising the constituent form a contiguous span of the sentence. (If you are familiar with the formal properties of context-free grammars (CFGs), a constituent under that model is the yield of a given non-terminal. In the case of a CFG, this yield will always be a contiguous span of the sentence.) However, when analyzing sentences like (2), many linguists have argued that [train what] also forms a constituent (Chomsky 1975) by analogy to [trains [the LSTM]]. As should be obvious, this constituent is discontinuous, e.g. the words that make up the constituent are not a contiguous string. What may not be obvious, however, is that this creates a problem for many (most?) grammar-inducers in the literature. First, let’s get into why we might want it to be the case [train what] forms a constituent, and then we’ll explain one theoretical device, known as movement, common in generative syntax to model this. Afterwards, we’ll get into some potential extensions to Kim et al. (2019).
The primary reason for considering [train what] as a single contituent is sthat “what” is the direct object of the verb “train” in (2) and we might desire that a verb and its direct object always form a constituent, as they do in (1). If you need convincing that it is reasonable to analyze [train what] as a constituent, consider this (colloquial) English sentence:
(3) The engineer trains what?
Although you might only utter (3) in contexts when a loud noise occurred while a friend was talking and obscured the object of their sentence, it is basically equivalent in meaning to (2). Also, consider that many languages, e.g. Chinese, do not exhibit wh-fronting. Consider the two Mandarin sentences
(4) 你 (nǐ; you)喜欢 (xǐhuān; love) 纽约. (niǔyuē; New York)
(5) 你 (nǐ; you)喜欢 (xǐhuān; love) 什么? (shénme; what)
We see that (4) and (5) pattern similarly in that the constituency structure of [xǐhuān [niǔyuē]] and [xǐhuān [shénme]] is more in line with the linear word order of the language. This phenomenon is known as wh-in-situ. We don’t have time to rehash them here, but the linguistics community has put forth a large number of arguments that [train what] should in fact be a constituent. We refer the reader to Adger (2003), Carnie (2006), or another syntax textbook for a summary of these arguments.
The motivation for movement is the following wish: Wouldn’t it be great if it were true that all verb phrases had similar structure? In other words, that the hierarchical structure in (1) and (2) reflected the fact the grammatical relations between “trains” and “the LSTM”, in (1), and “train” and “what”, in (2), were the same? Indeed, we might want to make a much broader typological statement, perhaps even a “language universal” that verbs and their direct objects form a constituent in all languages. Noam Chomsky thought this would be great, too, and, in fact, he introduced the idea of transformational grammar to solve problems just like this one (Chomsky 1965). The idea of transformational grammar is really simple: Sentences have two trees: and an underlying tree and a surface tree. In the underlying tree, the position of the direct object in (1) and (2) is the same, as seen in their parse trees displayed below. (Recall: Grammars are nothing more than mathematical models of language. If you go high enough up in the ontology of mathematical modeling, grammars are fundamentally no different in purpose than neural networks in that both grammars and neural networks are nothing more than precise mathematical expressions that (attempt to) describe the dynamics of natural language.)
So, if we grant that the above parse trees are good underlying representations of the sentences (1) and (2), the question becomes, how do we get the word ordering right? Here enters movement. Movement is a transformation that takes an underlying tree, like the one for (2), and transforms it into the surface tree, shown below.
Note: We have avoided talking about the phenomenon in English known as do-insertion, whereby the verb “to do” makes a surprise appearance in (2).
Traditionally, arrows are shown indicating the original position of the words and their new position. The element t is called a trace and is left unpronounced — it only serves to mark the original position of “what.”
If you’re a computer scientist (to whom this blog post is directed), you might ask about how to compute such a transformation. The answer, although very much an aside, lies in tree automata. In fact, such tree-to-tree transductions were one of the first use cases of tree automata. If you’re an NLPer, you can find uses of tree-automata for various NLP applications in Graehl, Knight and May (2008).
Of course, you might say that questions aren’t that common in English — to the consternation of many an NLPer working on question answering — and maybe treating them in another way is good enough. But, wait, there are entire languages that require an in-depth movement analysis if we want verbs and their direct objects to form constituents! Let’s consider the case of Modern Irish, whose default word order is verb-subject-object. Here is a sentence in Modern Irish taken from Carnie (2006; pg 150):
(6) Phóg (kissed) an fear (the man) an mhuc (the pig)
Again, if we want it to be a language universal that verbs and their direct objects form constituents, we require [phóg an mhuc] (kissed the pig) to be a single constituent. This is again a discontinuous constituent. Of course, for the same reasons as English questions, a PCFG-based model of grammar induction cannot easily model this case. One traditional analysis of (6) involves movement as the tree below shows:
One other likely suspect for syntactic movement is relative clauses (i.e., DPs embedding a larger clause):
(7) I like the man who the cat loves ___.
Here too, it seems we need to have a trace of the wh-element in the object position and move it to the front of the DP, to get the correct word order. Of the topics studied in linguistics, syntactic movement is by far one of the most popular, and we can only give the coarsest of examples here (i.e., we leave out the differences between phrasal movement and head movement, and many other important distinctions). But, we feel that given these examples, syntactic movement, or something that accomplishes the same goals, is completely indispensable. If we want an accurate description of the known linguistic facts, so that we can make good predictions about previously unstudied languages, we need to acknowledge the existence of movement phenomena.
Before we leave this topic, it is worth mentioning two things: (1) movement certainly isn’t the only example of so-called discontinuous constituents (for the curious, other examples of discontinuous constituents might include control and binding, which doubtless have implications for NLP tasks such as coreference resolution), and (2) linguistics research into these topics has advanced considerably since 1965 (e.g., things as arcane as multidominance have been proposed), and even if our understanding of transformations has advanced, they are largely used here for explication purposes.
The Limitations of Kim et al. (2019)
Now, building on the intuition given by the examples above, let’s dive into the limitations of Kim et al. (2019)’s grammar-inducer. The point of a constituency-based grammar-inducer is to recover phrasal constituent boundaries without direct supervision. That is, the model takes a raw text corpus as input and returns some representation, perhaps a tree from a context-free grammar, as the output. The model, as Kim and company describe, cannot handle discontinuous constituents, because it has a PCFG as a backbone. Thus, there is no way the model can return a representation in which “train what” forms a constituent in (2). Or, that [phóg [an mhuc]] forms a constituent in (6). The evaluation in this paper and most other papers on grammar induction, is a hack that forces all trees to have continuous constituents, even though this is linguistically insufficient. To see this, look at the following parsed Penn Treebank-style parse of a question:
Note that we made use of a different non-terminal labeling scheme, but grammar induction is typically not evaluated using the non-terminal labels, so this does not matter.
With this (insufficient) gold annotation, the model could actually get full credit for having discovered the syntactic structure of the question without having figured out that [train what] forms a constituent. No bueno. In other words, Kim et al. treat only a toy-ish version of the full task of grammar induction. Despite being toy, it is still incredibly difficult, as 30 years of research have shown, to create a good grammar-inducer. So, how do we get around this limitation and start modeling discontinuous constituents? One way would be to explicitly learn a model of movement. The inimitable Jason Eisner has a great way of doing this that we haven’t fully wrapped our heads around yet, but he’s easy to buttonhole at conferences if you want his (hot) take on it. Another possible solution lies in formal language theory (including but not limited to work by T. Graf, S. Laszakovits and colleagues) that has worked on the construction of grammar formalisms that allow for discontinuous constituents. One such grammar that is tree-adjoining grammar (TAG). Below is an example of a TAG tree with a discontinuous constituent.
As was shown by (Vijay-Shanker and Weir 1994), TAG admits a O(|x|⁶) algorithm to sum over all trees. It would be relatively straight-forward to adapt Kim et al. (2019)’s model to TAG, but it might be computationally infeasible, since O(|x|⁶), while polynomial, is still slow.
In the previous post, we argued for constituents pretty heavily, but it’s often the hierarchical relationships between words that matter to linguists more than precisely capturing all the phrasal constituents. You might wonder why we talked about constituents at all, then? Well, recovering constituent boundaries is specifically the task that Kim et al. (2019) worked on. Indeed, it is evaluated by unlabeled constituency F1 as in Table 1 of Kim et al. (2019). You might also ask why the oracle parse in their Table 1 is not 100%. The reason is that PTB trees are non-binary and these non-binary trees are being approximated with binary trees, so the induced binary trees will always fall short of the gold constituent boundaries from the gold non-binary trees.
Building on the deemphasis of the importance of constituency, we move to dependency grammar. If you are not tied to the idea that constituents are underlyingly continuous, then the issues faced by discontinuous constituents goes away. For example, ignoring constituency is effectively how dependency grammar (Tesnière 1959) handles the problem of discontinuous constituents. As an example, the dependency parses of (1) and (2) are shown below:
In the case of (1), we have that [the LSTM] has “trains” as a head and, in the case of (2), we have that [what] has “trains” as a head — this is exactly what we wanted. Here, we do not have an underlying structure in (2) where [train what] is an underlying contiguous constituent. In fact, the way we show that “what” has the same relationship to “train” as “the LSTM” does is with an arc labeled with the syntactic relation (here, ‘DOJB’). Indeed, in our assessment, one of the reasons that the Universal Dependencies project annotates dependency trees, rather than constituency trees, is because dependency grammars dispense with needing both an underlying and a surface tree, which would require better trained annotators, to find the underlying positions of many words.
As a technical note, both of the dependency parses above are projective, i.e., the arcs do not cross when drawn above the sentence. In a construction given by Johnson (2007), you can transform projective trees into a context-free parse. However, such a transformation will result in a CFG with spurious ambiguity, i.e., there is more than one derivation tree in the grammar that corresponds to a single derived dependency parse. This is problematic, because the different derivation trees exhibit different constituency structure. This means that dependency grammar can be seen as underspecified with respect to constituency structure. Also note that, in the formalization of Kim et al. (2019), a grammar inducer is not required to recover the syntactic heads, which are also specified in dependency grammar. So, a dependency-based grammar inducer has to discover head relationship (more information than Kim et al.’s model discovers), but does not have to discover phrasal constituency boundaries (less information than Kim et al.’s model discovers).
Where to go from here?
A natural next step for extending Kim et al. (2019) is to use their parameterization for the Dependency Model with Valence (DMV) of Klein and Manning (2004). However, models for grammar induction need to handle all of the world’s languages. After all, a child can learn the syntax of any language that its exposed to at a young age. The major restriction here is that the DMV model only handles projective dependency trees.
While English syntax is covered by projective trees relatively well, i.e. > 95%, Table 1 above shows that languages like Dutch are not well covered by such a restriction. In other words, the DMV, as applied to Dutch, could not correctly get the syntax right for more than 63.6%, which is not great. There are natural ways to extend dependency-based grammar inducers to handle non-projective trees, e.g. Cohen et al. (2011) give a model that admits exact inference for non-projective trees, albeit in O(|x|⁸) time. While Cohen et al.’s algorithm for exact marginal inference is quite slow, Carlos Gómez Ródriguez, one of the more innovative researchers in NLP, has a series of recent papers that provide probabilistic models that can be normalized over a set non-projective dependency trees in O(|x|⁴) time. We hope to find some time to meld that line of work with the recent proposal by Kim et al. in the near future.