Bengio v Marcus, and the Past, Present and Future of Neural Network Models of Language

The Past

Researchers, a lot of them, have worried for a long time, about whether neural networks could generalize effectively enough to capture the richness of language. It’s been a major theme of my work, since the the 1990s, and before me Fodor and Pylyshyn and Pinker and Prince in 1988 in Cognition made closely related points. Brenden Lake and his collaborators have made similar points, earlier this year.

To take but one example, here’s something I wrote on the topic in January:

Deep learning systems work less well when there are limited amounts of training data available, or when the test set differs importantly from the training set, or when the space of examples is broad and filled with novelty. And some problems cannot, given real- world limitations, be thought of as classification problems at all. Open-ended natural language understanding, for example, should not be thought of as a classifier mapping between a large finite set of sentences and large, finite set of sentences, but rather a mapping between a potentially infinite range of input sentences and an equally vast array of meanings, many never previously encountered.

The Present

Yoshua Bengio and his labs just wrote a paper about this, confirming — from inside the neural network community — what a bunch of outsiders from the cognitive science community (myself included) have said for a long time: current deep learning techniques can’t in fact cope with the complexity of language.

Here’s a quote from their abstract

We put forward strong evidence that current deep learning methods are not yet sufficiently sample efficient when it comes to learning a language with compositional properties.

As an important aside that gets at an all-too common issue in the current machine learning literature, there was ZERO discussion of the prior literature. That’s not good; we used to have to a word for it, “unscholarly”, which meant that you followed in a direction blazed by earlier pioneers and pretended you were original. It’s not a very nice word. It applies.

In any event, I was thrilled that the Bengio lab had converged on the same things I had long been trying to say, posted this on Twitter

Major news re #deeplearning & its limits: Yoshua Bengio’s lab confirms a key conclusion of Marcus 2001 and 2018: deep learning isn’t efficient enough with data to cope with the compositional nature of language. export.arxiv.org/abs/1810.08272

As is usually the case, my remarks drew a lot of antipathy from the deep learning community. In a response to my Tweet, Bengio replied (in a Facebook post the next day, that was brought to my attention)

Looks like a confusion of conclusions here. We said that current DL+RL does not seem yet satisfactory in terms of sample complexity to learn to understand compositional languages, based on our experiments. But that is very different a conclusion from Gary’s, in that we believe that we can continue making progress and expand on the current scientific foundation of deep learning and reinforcement learning. Gary was claiming a definitive negative “deep learning isn’t efficient enough with data to cope with the compositional nature of language”, whereas we think that current DL techniques can be augmented to better cope with the kind of compositionality which we need for systematic generalization (to new distributions with the same underlying causal mechanisms). This is precisely the kind of research we are undertaking, e.g., see my consciousness prior proposal (in arXiv).

In essence, he was saying we are not there — yet.

Well maybe. Then again, maybe not. Maybe deep learning — alone — will never get us there. We at least need to consider that possibility.

I first made that argument — in a principled way, based on how back propagation works — twenty years ago. Promises of unknown mechanisms and future success were issued immediately.

Those promises still haven’t been paid off: Two decades. — and billions of dollars of research — later, deep learning hasn’t made any notable progress on compositionality.

The only thing that has really changed in the last two decades is that the neural network community has finally started noticing the problem.

The Future

Bengio and I actually agree on a lot. We both agree current models won’t cut it. And we both agree that deep learning has to be (in his words) augmented.

The real question is what augmented means.

Bengio is free to spell out his view.

Here is what I think, which is the same prediction as I have made for the last 20 years: deep learning must be augmented with some operations borrowed from classical symbolic systems, which is to say we need hybrid models, which take the best of classical AI (which allows for explicit representation of hierachical structure and abstract rules) and combine that with the strengths of deep learning.

Many (not all) neural network advocates have tried to avoid adding such things to their networks. It’s not impossible; it’s a matter of orthodoxy. Certainly deep learning alone has not thus far solved the problem. Maybe it is time to try something else.

I don’t (contra repeated misrepresentations from folks like Yann LeCun) think that deep learning won’t play a role in natural understanding, only that deep learning can’t succeed on its own.

My prediction remains: without innate tools for compositionality that represent rules and structured representations (per what I argued in the 2001 book The Algebraic Mind), we will see little progress in neural network models of language comprehension.

Once the deep learning community needlessly stops defining itself in opposition to classical AI, we will begin to see progress.