In defense of skepticism about deep learning

“All truth passes through three stages: First, it is ridiculed. Second, it is violently opposed. Third, it is accepted as self-evident.”
— Often attributed to Schopenhauer

In a recent appraisal of deep learning (Marcus, 2018) I outlined ten challenges for deep learning, and suggested that deep learning by itself, although useful, was unlikely to lead on its own to artificial general intelligence. I suggested instead the deep learning be viewed “not as a universal solvent, but simply as one tool among many.”

In place of pure deep learning, I called for hybrid models, that would incorporate not just supervised forms of deep learning, but also other techniques as well, such as symbol-manipulation, and unsupervised learning (itself possibly reconceptualized). I also urged the community to consider incorporating more innate structure into AI systems.

Within a few days, thousands of people had weighed in over Twitter, some enthusiastic (“e.g, the best discussion of #DeepLearning and #AI I’ve read in many years”), some not (“Thoughtful… But mostly wrong nevertheless”).

Because I think clarity around these issues is so important, I’ve compiled a list of fourteen commonly-asked queries. Where does unsupervised learning fit in? Why didn’t I say more nice things about deep learning? What gives me the right to talk about this stuff in the first place? What’s up with asking a neural network to generalize from even numbers to odd numbers? (Hint: that’s the most important one). And lots more. I haven’t addressed literally every question I have seen, but I have tried to be representative.

1. What is general intelligence?

Thomas Dietterich, an eminent professor of machine learning, and my most thorough and explicit critic thus far, gave a nice answer that I am very comfortable with:

“General intelligence” is a system that can behave intelligently across a wide range of goals and environments. See, for example, Russell and Norvig’s textbook and their definition of Intelligence as “Acting Rationally”.

2. Marcus wasn’t very nice to deep learning. He should have said more nice things about all of its vast accomplishments. And he minimizes others.

Dietterich, mentioned above, made both of these points, writing:

Disappointing article by @GaryMarcus. He barely addresses the accomplishments of deep learning (eg NL translation) and minimizes others (eg ImageNet with 1000 categories is small (“very finite”)

On the first part of that, true, I could have said more positive things. But it’s not like I didn’t say any. Or even like I forgot to mention Dietterich’s best example; I mentioned it on the first page:

Deep learning has since yielded numerous state of the art results, in domains such as speech recognition, image recognition, and language translation and plays a role in a wide swath of current AI applications.

More generally, later in the article I cited a couple of great texts and excellent blogs that have pointers to numerous examples. A lot of them though, would not really count as AGI, which was the main focus of my paper. (Google Translate, for example, is extremely impressive, but it’s not general; it can’t, for example, answer questions about what it has translated, the way a human translator could.)

The second part is more substantive. Is 1,000 categories really very finite? Well, yes, compared to the flexibility of cognition. Cognitive scientists generally place the number of atomic concepts known by an individual as being on the order of 50,000, and we can easily compose those into a vastly greater number of complex thoughts. Pets and fish are probably counted in those 50,000; pet fish, which is something different, probably isn’t counted. And I can easily entertain the concept of “a pet fish that is suffering from Ick”, or note that “it is always disappointing to buy a pet fish only to discover that it was infected with Ick” (an experience that I had as a child and evidently still resent). How many ideas like that I can express? It’s a lot more than 1,000.

I am not precisely sure how many visual categories a person can recognize, but suspect the math is roughly similar. Try google images on “pet fish”, and you do ok; try it on “pet fish wearing goggles” and you mostly find dogs wearing goggles, with a false alarm rate of over 80%.

Machines win over nonexpert humans on distinguishing similar dog breeds, but people win, by a wide margin, on interpreting complex scenes, like what would happen to a skydiver who was wearing a backpack rather than a parachute.

In focusing on 1,000 category chunks the machine learning field is, in my view, doing itself a disservice, trading a short-term feeling of success for a denial of harder, more open-ended problems (like scene and sentence comprehension) that must eventually be addressed. Compared to the essentially infinite range of sentences and scenes we can see and comprehend, 1000 of anything really is small. [See also Note 2 at bottom]

3. Marcus says deep learning is useless, but it’s great for many things

Of course it is useful; I never said otherwise, only that (a) in its current supervised form, deep learning might be approaching its limits and (b) that those limits would stop short from full artificial general intelligence — unless, maybe, we started incorporating a bunch of other stuff like symbol-manipulation and innateness.

The core of my conclusion was this:

Despite all of the problems I have sketched, I don’t think that we need to abandon deep learning.
Rather, we need to reconceptualize it: not as a universal solvent, but simply as one tool among many, a power screwdriver in a world in which we also need hammers, wrenches, and pliers, not to mention chisels and drills, voltmeters, logic probes, and oscilloscopes.

4. “One thing that I don’t understand. — @GaryMarcus says that DL is not good for hierarchical structures. But in @ylecun nature review paper [says that] that DL is particularly suited for exploiting such hierarchies.”

This is an astute question, from Ram Shankar, and I should have been a LOT clearer about the answer: there are many different types of hierarchy one could think about. Deep learning is really good, probably the best ever, at the sort of feature-wise hierarchy LeCun talked about, which I typically refer to as hierarchical feature detection; you build lines out of pixels, letters out of lines, words out of letters and so forth. Kurzweil and Hawkins have emphasized this sort of thing, too, and it really goes back to Hubel and Wiesel (1959)in neuroscience experiments and to Fukushima. (Fukushima, Miyake, & Ito, 1983) in AI. Fukushima, in his Neocognitron model, hand-wired his hierarchy of successively more abstract features; LeCun and many others after showed that (at least in some cases) you don’t have to hand engineer them.

But you don’t have to keep track of the subcomponents you encounter along the way; the top-level system need not explicitly encode the structure of the overall output in terms of which parts were seen along the way; this is part of why a deep learning system can be fooled into thinking a pattern of a black and yellow stripes is a school bus. (Nguyen, Yosinski, & Clune, 2014). That stripe pattern is strongly correlated with activation of the school bus output units, which is in turn correlated with a bunch of lower-level features, but in a typical image-recognition deep network, there is no fully-realized representation of a school bus as being made up of wheels, a chassis, windows, etc. Virtually the whole spoofing literature can be thought of in these terms. [Note 3]

The structural sense of hierarchy which I was discussing was different, and focused around systems that can make explicit reference to the parts of larger wholes. The classic illustration would be Chomsky’s sense of hierarchy, in which a sentence is composed of increasingly complex grammatical units (e.g., using a novel phrase like the man who mistook his hamburger for a hot dog with a larger sentence like The actress insisted that she would not be outdone by the man who mistook his hamburger for a hot dog). I don’t think deep learning does well here (e.g., in discerning the relation between the actress, the man, and the misidentified hot dog), though attempts have certainly been made.

Even in vision, the problem is not entirely licked; Hinton’s recent capsule work (Sabour, Frosst, & Hinton, 2017), for example, is an attempt to build in more robust part-whole directions for image recognition, by using more structured networks. I see this as a good trend, and one potential way to begin to address the spoofing problem, but also as a reflection of trouble with the standard deep learning approach.

5. “It’s weird to discuss deep learning in [the] context of general AI. General AI is not the goal of deep learning!”

Best twitter response to this came from University of Quebec professor Daniel Lemire: “Oh! Come on! Hinton, Bengio… are openly going for a model of human intelligence.”

Second prize goes to a math PhD at Google, Jeremy Kun, who countered the dubious claim that “General AI is not the goal of deep learning” with “If that’s true, then deep learning experts sure let everyone believe it is without correcting them.”

Andrew Ng’s recent Harvard Business Review article, which I cited, implies that deep learning can do anything a person can do in a second. Thomas Dietterich’s tweet that said in part “it is hard to argue that there are limits to DL”. Jeremy Howard worried that the idea that deep learning is overhyped might itself be overhyped, and then suggested that every known limit had been countered.

DeepMind’s recent AlphaGo paper [See Note 4] is positioned somewhat similarly, with Silver et al (Silver et al., 2017) enthusiastically reporting that:

Our results comprehensively demonstrate that a pure [deep] reinforcement learning approach is fully feasible, even in the most challenging of domains”

In that paper’s concluding discussion, not one of the 10 challenges to deep learning that I reviewed was mentioned. (As I will discuss in a paper coming out soon, it’s not actually a pure deep learning system, but that’s a story for another day.)

The main reason people keep benchmarking their AI systems against humans is precisely because AGI is the goal.

6. What Marcus said is a problem with supervised learning, not deep learning.

Yann LeCun presented a version of this, in a comment on my Facebook page:

I’ve had no time for a proper response, but in short: (1) I think it’s mostly wrong, and it would be considerably less wrong if all instances of “deep learning” were replaced by “supervised learning” in the paper. (2) finding ways to extend deep learning concepts for unsupervised learning and reasoning is exactly what I’ve been advocating in all my talk of the last 2.5 years. I haven’t just been advocating for it, I’ve actually been working on it … you are well aware of this, but it doesn’t transpire [sic] in your paper.”

The part about my allegedly not recognizing LeCun’s recent work is, well, odd. It’s true that I couldn’t find a good summary article to cite (when I asked LeCun, he told me by email that there wasn’t one yet) but I did mention his interest explicitly:

deep learning pioneers Geoff Hinton and Yann LeCun have both recently pointed to unsupervised learning as one key way in which to go beyond supervised, data-hungry versions of deep learning.

I also noted that:

To be clear, deep learning and unsupervised learning are not in logical opposition. Deep learning has mostly been used in a supervised context with labeled data, but there are ways of using deep learning in an unsupervised fashion.

My conclusion was positive, too. Although I expressed reservations about current approaches to building unsupervised systems, I ended optimistically:

If we could build [unsupervised] systems that could set their own goals and do reasoning and problem-solving at this more abstract level, major progress might quickly follow.

What LeCun’s remark does get right is that many of the problems I addressed are a general problem with supervised learning, not something unique to deep learning; I could have been more clear about this. Many other supervised learning techniques face similar challenges, such as problems in generalization and dependence on massive data sets; relatively little of what I said is unique to deep learning. In my focus on assessing deep learning at the five year resurgence mark, I neglected to say that.

But it doesn’t really help deep learning that other supervised learning techniques are in the same boat. If someone could come up with a truly impressive way of using deep learning in an unsupervised way, a reassessment might be required. But I don’t see that unsupervised learning, at least as it currently pursued, particularly remedies the challenges I raised, e.g., with respect to reasoning, hierarchical representations, transfer, robustness, and interpretability. It’s simply a promissory note. [Note 5]

As Portland State and Santa Fe Institute Professor Melanie Mitchell’s put it in a thus far unanswered tweet:

… @ylecunn says GM essay is “all wrong”, but “less wrong” if restricted to SL. I’d love to hear examples of (existing) non-SL projects that show GM’s args to be wrong.”

I would, too.

In the meantime, I see no principled reason to believe that unsupervised learning can solve the problems I raise, unless we add in more abstract, symbolic representations, first.

7. Deep learning is not just convolutional networks [of the sort Marcus critiqued], it’s “essentially a new style of programming — ”differentiable programming” — and the field is trying to work out the reusable constructs in this style. We have some: convolution, pooling, LSTM, GAN, VAE, memory units, routing units, etc” — Tom Dietterich

This seemed (in the context of Dietterich’s longer series of tweets) to have been proposed as a criticism, but I am puzzled by that, as I am a fan of differentiable programming and said so. Perhaps the point was that deep learning can be taken in a broader way.

In any event, I would not equate deep learning and differentiable programming (e.g., approaches that I cited like neural Turing machines and neural programming). Deep learning is a component of many differentiable systems. But such systems also build in exactly the sort of elements drawn from symbol-manipulation that I am and have been urging the field to integrate (Marcus, 2001; Marcus, Marblestone, & Dean, 2014a; Marcus, Marblestone, & Dean, 2014b), including memory units and operations over variables, and other systems like routing units stressed in the more recent two essays. If integrating all this stuff into deep learning is what gets us to AGI, my conclusion, quoted below, will have turned out to be dead on:

To the extent that the brain might be seen as consisting of “a broad. array of reusable computational primitives — elementary units of. processing akin to sets of basic. instructions in a microprocessor — perhaps wired together in parallel, as in the reconfigurable integrated circuit type known as the field-programmable gate array”, as I have argued elsewhere (Marcus, Marblestone, & Dean, 2014), steps towards enriching the instruction set out of which our computational systems are built can only be a good thing.

8. Now vs the future. Maybe deep learning doesn’t work now, but it’s offspring will get us to AGI.

Possibly. I do think that deep learning might play an important role in getting us to AGI, if some key things (many not yet discovered) are added in first.

But what we add matters, and whether it is reasonable to call some future system an instance of deep learning per se, or more sensible to call the ultimate system “a such-and-such that uses deep learning”, depends on where deep learning fits into the ultimate solution. Maybe, for example, in truly adequate natural language understanding systems, symbol-manipulation will play an equally large role as deep learning, or an even larger one.

Part of the issue here is of course terminological. A very good friend recently asked me, why can’t we just call anything that includes deep learning, deep learning, even if it includes symbol-manipulation? Some enhancement to deep learning ought to work. To which I respond: why not call anything that includes symbol-manipulation, symbol-manipulation, even if it includes deep learning?

Gradient-based optimization should get its due, but so should symbol-manipulation, which as yet is the only known tool for systematically representing and achieving high-level abstraction, bedrock to virtually all of the world’s complex computer systems, from spreadsheets to programming environments to operating systems.

Eventually, I conjecture, credit will also be due to the inevitable marriage between the two, hybrid systems that bring together the two great ideas of 20th century AI, symbol-processing and neural networks, both initially developed in the 1950s. Other new tools yet to be invented may be critical as well.

To a true acolyte of deep learning, anything is deep learning, no matter what it’s incorporating, and no matter how different it might be from current techniques. (Viva Imperialism!) If you replaced every transistor in a classic symbolic microprocessor with a neuron, but kept the chip’s logic entirely unchanged, a true deep learning acolyte would still declare victory. But we won’t understand the principles driving (eventual) success if we lump everything together. [Note 6]

9. No machine can extrapolate. It’s not fair to expect a neural network to generalize from even numbers to odd numbers.

Here’s a function, expressed over binary digits.

f(110) = 011;

f(100) = 001;

f(010) = 010.

What’s f(111)?

If you are an ordinary human, you are probably going to guess 111. If you are neural network of the sort I discussed, you probably won’t.

If you have been told many times that hidden layers in neural networks “abstract functions”, you should be a little bit surprised by this.

If you are a human, you might think of the function as something like “reversal”, easily expressed in a line of computer code. If you are a neural network of a certain sort, it’s very hard to learn the abstraction of reversal in a way that extends from evens in that context to odds. But is that impossible? Certainly not if you have a prior notion of an integer. Try another, this time in decimal: f(4) = 8; f(6) = 12. What’s f(5)? None of my human readers would care that questions happens to require you to extrapolate from even numbers to odds; a lot of neural networks would be flummoxed.

Sure, the function is undetermined by the sparse number of examples, like all functions, but it is interesting and important that most people would (amid the infinite range of a priori possible inductions), would alight on f(5)=10.

And just as interesting that most standard multilayer perceptrons, representing the numbers as binary digits, wouldn’t. That’s telling us something, but many people in the neural network community, François Chollet being one very salient exception, don’t want to listen.

Importantly, recognizing that a rule applies to any integer is roughly the same kind of generalization that allows one to recognize that a novel noun that can be used in one context can be used in a huge variety of other contexts. From the first time I hear the word blicket used as an object, I can guess that it will fit into a wide range of frames, like I thought I saw a blicket, I had a close encounter with a blicket, and exceptionally large blickets frighten me, etc. And I can both generate and interpret such sentences, without specific further training. It doesn’t matter whether blicket is or not similar in (for example) phonology to other words I have heard, nor whether I pile on the adjectives or use the word as a subject or an object. If most machine learning [ML] paradigms have a problem with this, we should have problem with most ML paradigms.

Am I being “fair”? Well, yes, and no. It’s true that I am asking neural networks to do something that violates their assumptions.

A neural network advocate might, for example, say, “hey wait a minute, in your reversal example, there are three dimensions in your input space, representing the left binary digit, the middle binary digit, and rightmost binary digit. The rightmost binary digit has only been a zero in the training; there is no way a network can know what to do when you get to one in that position.” For example, Vincent Lostenlan, a postdoc at Cornell, said

“I fail to understand what you’re trying to prove on 3.11. f is the identity function trained on the vertices of an (n-1)-dimensional hypercube in the input space. How are you surprised by a DNN — or indeed any ML model — not “generalizing” to the n’th dim?”

Dietterich, made essentially the same point, more concisely:

“Marcus complains that DL can’t extrapolate, but NO method can extrapolate.”

But although both are right about why odds-and-evens are (in this context) hard for deep learning, they are both wrong about the larger issues for three reasons.

First, it can’t be that people can’t extrapolate. You just did, in two different examples, at the top of this section. Paraphrasing Chico Marx. who are you going to believe, me or your own eyes?

To someone immersed deeply — perhaps too deeply — in contemporary machine learning, my odds-and-evens problem seems unfair because a certain dimension (the one which contains the value of 1 in the rightmost digit) hasn’t been illustrated in the training regime. But when you, a human, look at my examples above, you will not be stymied by this particular gap in the training data. You won’t even notice it, because your attention is on higher-level regularities.

People routinely extrapolate in exactly the fashion that I have been describing, like recognizing string reversal from the three training examples I gave above. In a technical sense, that is extrapolation, and you just did it. In The Algebraic Mind I referred to this specific kind of extrapolation as generalizing universally quantified one-to-one mappings outside of a space of training examples. As a field we desperately need a solution to this challenge, if we are ever to catch up to human learning — even if it means shaking up our assumptions.

Now, it might reasonably be objected that it’s not a fair fight: humans manifestly depend on prior knowledge when they generalize such mappings. (In some sense, Dieterrich proposed this objection later in his tweet stream.)

True enough. But in a way, that’s the point: neural networks of a certain sort don’t have a good way of incorporating the right sort of prior knowledge in the place. It is precisely because those networks don’t have a way of incorporating prior knowledge like “many generalizations hold for all elements of unbounded classes” or “odd numbers leave a remainder of one when divided by two” that neural networks that lack operations over variables fail. The right sort of prior knowledge that would allow neural networks to acquire and represent universally quantified one-to-one mappings. Standard neural networks can’t represent such mappings, except in certain limited ways. (Convolution is a way of building in one particular such mapping, prior to learning).

Second, saying that no current system (deep learning or otherwise) can extrapolate in the way that I have described is no excuse; once again other architectures may be in the choppy water, but that doesn’t mean we shouldn’t be trying to swim to shore. If we want to get to AGI, we have to solve the problem.

(Put differently: yes, one could certainly hack together solutions to get deep learning to solve my specific number series problems, by, for example, playing games with the input encoding schemes; the real question, if we want to get to AGI, is how to have a system learn the sort of generalizations I am describing in a general way.)

Third, the claim that no current system can extrapolate turns out to be, well, false; there are already ML systems that can extrapolate at least some functions of exactly the sort I described, and you probably own one: Microsoft Excel, its Flash Fill function in particular (Gulwani, 2011). Powered by a very different approach to machine learning, it can do certain kinds of extrapolation, albeit in a narrow context, by the bushel, e.g., try typing the (decimal) digits 1, 11, 21 in a series of rows and see if the system can extrapolate via Flash Fill to the eleventh item in the sequence (101).

Spoiler alert, it can, in exactly the same way as you probably would, even though there were no positive examples in the training dimension of the hundreds digit. The systems learns from examples the function you want and extrapolates it. Piece of cake. Can any deep learning system do that with three training examples, even with a range of experience on other small counting functions, like 1, 3, 5, …. and 2, 4, 6 ….?

Well maybe, but only the ones that are likely do so are likely to be hybrids that build in operations over variables, which are quite different from the sort of typical convolutional neural networks that most people associate with deep learning.

Putting all this very differently, one crude way to think about where we are with most ML systems that we have today [Note 7] is that they just aren’t designed to think “outside the box”; they are designed to be awesome interpolators inside the box. That’s fine for some purposes, but not others. Humans are better at thinking outside boxes than contemporary AI; I don’t think anyone can seriously doubt that.

But that kind of extrapolation, that Microsoft can do in a narrow context, but that no machine can do with human-like breadth, is precisely what machine learning engineers really ought to be working on, if they want to get to AGI.

10. Everybody in the field already knew this. There is nothing new here.

Well, certainly not everybody; as noted, there were many critics who think we still don’t know the limits of deep learning, and others who believe that there might be some, but none yet discovered.

That said, I never said that any of my points was entirely new; for virtually all, I cited other scholars, who had independently reached similar conclusions.

11. Marcus failed to cite X.

Definitely true; the literature review was incomplete. One favorite among the papers I failed to cite is Shanahan’s Deep Symbolic Reinforcement (Garnelo, Arulkumaran, & Shanahan, 2016); I also can’t believe I forgot Richardson and Domingos’ (2006) Markov Logic Networks. I also wish I had cited Evans and Edward Grefenstette (2017), a great paper from DeepMind. And Smolensky’s tensor calculus work (Smolensky et al., 2016). And work on inductive programming in various forms (Gulwani et al., 2015) and probabilistic programming, too, by Noah Goodman (Goodman, Mansinghka, Roy, Bonawitz, & Tenenbaum, 2012) All seek to bring rules and networks close to together.

And older stuff by pioneers like Jordan Pollack (Smolensky et al., 2016). And Forbus and Gentner’s (Falkenhainer, Forbus, & Gentner, 1989) and Hofstadter and Mitchell’s (1994) work on analogy; and many others. I am sure there is a lot more I could and should have cited.

Overall, I tried to be representative rather than fully comprehensive, but I still could have done better. #chagrin.

12. Marcus has no standing in the field; he isn’t a practitioner; he is just a critic.

Hesitant to raise this one, but it came up in all kinds of different responses, even from the mouths of certain well-known professionals. As Ram Shankar noted, “As a community, we must circumscribe our criticism to science and merit based arguments.” What really matters is not my credentials (which I believe do in fact qualify me to write) but the validity of the arguments.

Either my arguments are correct, or they are not.

[Still, for those who are curious, I supply an optional mini-history of some of my relevant credentials in Note 8 at the end.]

13. Re: hierarchy, what about Socher’s tree-RNNs?

I have written to him, in hopes of having a better understanding of its current status. I’ve also privately pushed several other teams towards trying out tasks like Lake and Baroni (2017) presented.

Pengfei et al (2017) offers some interesting discussion.

14. You could have been more critical of deep learning.

Nobody quite said that, not in exactly those words, but a few came close, generally privately.

One colleague for example pointed out that there may be some serious errors of future forecasting around

there’s this feeling that successes will accrue exponentially faster… It is far more likely that we’re on broad fitness plane with lots of low hanging fruit, but once that fruit is gone, progress towards deep reasoning [will be slower]. Moreover, it’s not clear why one should think that the harder problems of AGI, ethics, morality, etc, are easily attainable now that we can identify cats 95% of the time. The former kinds of problems likely exist in a much more intricate space.

The same colleague added

[Researchers] have been too quick to claim victory in some domains. For example image processing: We’ve found a class of image processing problems that computers are better at, sure, but those same algorithms can still be confused by adversarial attacks. Moreover, when they are wrong, they are often wrong in crazy ways. Contrary to this, when driving down the street, I might misidentify a tree as a lampost, but I wouldn’t make the sorts of bizarre errors that these DLN’s make (which is because I deeply understand meaning & context). It’s true that these limitations are frequently acknowledged, but at the same time, there is this underlying perspective that as a consequence of the Imagenet results, computers are better at image recognition than people.

Another colleague, ML researcher and author Pedro Domingos, pointed out still other shortcomings of current deep learning methods that I didn’t mention:

Like other flexible supervised learning methods, deep learning systems can be unstable in the sense that slightly changing the training data may result in large changes in the resulting model.

They can require lots of data even when less would suffice. (Data augmentation, in particular, is very costly and, judging from humans, should not be necessary.)
They can be brittle: a small change to the data can cause catastrophic failure (e.g., flipping black and white in a digit dataset (Hosseini, Xiao, Jaiswal, & Poovendran, 2017)).
There’s often less to their accuracy than we infer (e.g., Ribeiro, Singh and Guestrin (2016) found that highly accurate discrimination of wolves from dogs on a dataset extracted from ImageNet was the result of detecting white snow patches in the wolf images).
In the history of machine learning. so far, each paradigm has tended to dominate for about a decade before losing prominence (e.g., neural networks dominated in the 80s, Bayesian learning in the 90s, and kernel methods in the 2000s).

As Domingos notes, there’s no guarantee this sort of rise and decline won’t repeat itself. Neural networks have risen and fallen several times before, all the way back to Rosenblatt’s first Perceptron in 1957. We shouldn’t mistake cyclical enthusiasm for a complete solution to intelligence, which still seems (to me, anyway) to be decades away.

If we want to reach AGI, we owe it to ourselves to be as keenly aware of challenges we face as we are of our successes.


  1. Thanks to Amy Bernard, Josh Cohen, Ernie Davis, Shlomo Shraga Engelson, Jose Hernandez-Orallo, Adam Marblestone, Melanie Mitchell, Ajay Patel, Omar Uddin and Brad Wyble for comments.

2. There are other problems too in relying on these 1,000 image sets. For example, in reading a draft of this paper, Melanie Mitchell pointed me to important recent work by Loghmani and colleague (2017) on assessing how deep learning does in the real world. Quoting from the abstract, the paper “analyzes the transferability of deep representations from Web images to robotic data [in the wild]. Despite the promising results obtained with [representations developed from Web image], the experiments demonstrate that object classification with real-life robotic data is far from being solved.”

3. And that literature is growing fast. In late December there was a paper about fooling deep nets into mistaking a pair of skiers for a dog [] and another on a general-purpose tool for building real-world adversarial patches: (See also It’s frightening to think how vulnerable deep learning can be real-world contexts.

And for that matter consider Filip Pieknewski’s blog on why photo-trained deep learning systems have trouble transferring what they have learned to line drawings, Vision is not as solved as many people seem to think.

4. As I will explain in the forthcoming paper, AlphaGo is not actually a pure [deep] reinforcement learning system, although the quoted passage presented it as such. It’s really more of a hybrid, with important components that are driven by symbol-manipulating algorithms, along with a well engineered deep-learning component.

5. AlphaZero, by the way, isn’t unsupervised, it’s self-supervised, using self-play and simulation as a way of generating supervised data; I will have a lot more to say about that system in a forthcoming paper.

6. Consider, for example Google Search, and how one might understand it. Google has recently added in a deep learning algorithm, RankBrain, to the wide array of algorithms it uses for search. And Google Search certainly takes in data and knowledge and processes them hierarchically (which according to Maher Ibrahim is all you need to count as being deep learning). But, realistically, deep learning is just one cue among many; the knowledge graph component, for example, is based instead primarily on classical AI notions of traversing ontologies. By any reasonable measure Google Search is a hybrid, with deep learning as just one strand among many.

Calling Google Search as a whole. “a deep learning system” would be grossly misleading, akin to relabeling carpentry “screwdrivery”, just because screwdrivers happen to be involved.

7. Important exceptions include inductive logic programming, inductive function programming (the brains behind Microsoft’s Flash Fill) and neural programming. All are making some progress here; some of these even include deep learning, but they also all include structured representations and operations over variables among their primitive operations; that’s all I am asking for.

8. My AI experiments begin in adolescence, with, among other thing, a Latin-English translator that I coded in the programming language Logo. In graduate school, studying with Steven Pinker, I explored the relation between language acquisition, symbolic rules, and neural networks. (I also owe a debt to my undergraduate mentor Neil Stillings.) The child language data I gathered (Marcus et al., 1992) for my dissertation have been cited hundreds of times, and were the most frequently-modeled data in the 90’s debate about neural networks and how children learned language.

In the late 1990’s I discovered some specific, replicable problems with multilayer perceptrons, (Marcus, 1998b; Marcus, 1998a)); based on those observation, I designed a widely-cited experiment. published in Science (Marcus, Vijayan, Bandi Rao, & Vishton, 1999), that showed that young infants could extract algebraic rules, contra Jeff Elman’s (1990) then popular neural network. All of this culminated in a 2001 MIT Press book (Marcus, 2001), which lobbied for a variety of representational primitives, some of which have begun to pop up in recent neural networks; in particular that the use of operations over variables in the new field of differentiable programming (Daniluk, Rocktäschel, Welbl, & Riedel, 2017; Graves et al., 2016) owes something to the position outlined in that book. There was a strong emphasis on having memory records, as well, which can be seen in the memory networks being developed e.g., at Facebook (Bordes, Usunier, Chopra, & Weston, 2015).) The next decade saw me work on other problems including innateness (Marcus, 2004) (which I will discuss at length in the forthcoming piece about AlphaGo) and evolution (Marcus, 2004; Marcus, 2008), I eventually returned to AI and cognitive modeling, publishing a 2014 article on cortical computation in Science (Marcus, Marblestone, & Dean, 2014) that also anticipates some of what is now happening in differentiable programming.

More recently, I took a leave from academia to found and lead a machine learning company in 2014; by any reasonable measure that company was successful, acquired by Uber roughly two years after founding. As co-founder and CEO I put together a team of some of the very best machine learning talent in the world, including Zoubin Ghahramani, Jeff Clune, Noah Goodman, Ken Stanley and Jason Yosinski, and played a pivotal role in developing our core intellectual property and shaping our intellectual mission. (A patent is pending, co-written by Zoubin Ghahramani and myself.)

Although much of what we did there remains confidential, now owned by Uber, and not by me, I can say that a large part of our efforts were addressed towards integrating deep learning with our own techniques, which gave me a great deal of familiarity with joys and tribulations of Tensorflow and vanishing (and exploding) gradients. We aimed for state-of-the-art results (sometimes successfully, sometimes not) with sparse data, using hybridized deep learning systems on a daily basis.


Bordes, A., Usunier, N., Chopra, S., & Weston, J. (2015). Large-scale Simple Question Answering with Memory Networks. arXiv.

Daniluk, M., Rocktäschel, T., Welbl, J., & Riedel, S. (2017). Frustratingly Short Attention Spans in Neural Language Modeling. arXiv.

Elman, J. L. (1990). Finding structure in time. Cognitive science, 14(2)(2), 179–211.

Evans, R., & Grefenstette, E. (2017). Learning Explanatory Rules from Noisy Data. arXiv, cs.NE.

Falkenhainer, B., Forbus, K. D., & Gentner, D. (1989). The structure-mapping engine: Algorithm and examples. Artificial intelligence, 41(1)(1), 1–63.

Fukushima, K., Miyake, S., & Ito, T. (1983). Neocognitron: A neural network model for a mechanism of visual pattern recognition. IEEE Transactions on Systems, Man, and Cybernetics, 5, 826–834.

Garnelo, M., Arulkumaran, K., & Shanahan, M. (2016). Towards Deep Symbolic Reinforcement Learning. arXiv, cs.AI.

Goodman, N., Mansinghka, V., Roy, D. M., Bonawitz, K., & Tenenbaum, J. B. (2012). Church: a language for generative models. arXiv preprint arXiv:1206.3255.

Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A. et al. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626)(7626), 471–476.

Gulwani, S. (2011). Automating string processing in spreadsheets using input-output examples., 46(1)(1), 317–330.

Gulwani, S., Hernández-Orallo, J., Kitzelmann, E., Muggleton, S. H., Schmid, U., & Zorn, B. (2015). Inductive programming meets the real world. Communications of the ACM, 58(11)(11), 90–99.

Hofstadter, D. R., & Mitchell, M. (1994). The copycat project: A model of mental fluidity and analogy-making. Advances in connectionist and neural computation theory, 2(31–112)(31–112), 29–30.

Hosseini, H., Xiao, B., Jaiswal, M., & Poovendran, R. (2017). On the Limitation of Convolutional Neural Networks in Recognizing Negative Images. arXiv, cs.CV.

Hubel, D. H., & Wiesel, T. N. (1959). Receptive fields of single neurones in the cat’s striate cortex. The Journal of physiology, 148(3)(3), 574–591.

Lake, B. M., & Baroni, M. (2017). Still not systematic after all these years: On the compositional skills of sequence-to-sequence recurrent networks. arXiv.

Loghmani, M. R., Caputo, B., & Vincze, M. (2017). Recognizing Objects In-the-wild: Where Do We Stand? arXiv, cs.RO.

Marcus, G. F. (1998a). Rethinking eliminative connectionism. Cogn Psychol, 37(3)(3), 243 — 282.

Marcus, G. F. (1998b). Can connectionism save constructivism? Cognition, 66(2)(2), 153 — 182.

Marcus, G. F. (2001). The Algebraic Mind: Integrating Connectionism and cognitive science. Cambridge, Mass.: MIT Press.

Marcus, G. F. (2004). The Birth of the Mind : how a tiny number of genes creates the complexities of human thought. Basic Books.

Marcus, G. F. (2008). Kluge : the haphazard construction of the human mind. Boston : Houghton Mifflin.

Marcus, G. (2018). Deep Learning: A Critical Appraisal. arXiv.

Marcus, G.F., Marblestone, A., & Dean, T. (2014a). The atoms of neural computation. Science, 346(6209)(6209), 551 — 552.

Marcus, G. F., Marblestone, A. H., & Dean, T. L. (2014b). Frequently Asked Questions for: The Atoms of Neural Computation. Biorxiv (arXiv), q-bio.NC.

Marcus, G. F. (2001). The Algebraic Mind: Integrating Connectionism and cognitive science. Cambridge, Mass.: MIT Press.

Marcus, G. F., Pinker, S., Ullman, M., Hollander, M., Rosen, T. J., & Xu, F. (1992). Overregularization in language acquisition. Monogr Soc Res Child Dev, 57(4)(4), 1–182.

Marcus, G. F., Vijayan, S., Bandi Rao, S., & Vishton, P. M. (1999). Rule learning by seven-month-old infants. Science, 283(5398)(5398), 77–80.

Nguyen, A., Yosinski, J., & Clune, J. (2014). Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. arXiv, cs.CV.

Pengfei, L., Xipeng, Q., & Xuanjing, H. (2017). Dynamic Compositional Neural Networks over Tree Structure IJCAI. Proceedings from Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17).

Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. arXiv, cs.LG.

Richardson, M., & Domingos, P. (2006). Markov logic networks. Machine learning, 62(1)(1), 107–136.

Sabour, S., dffsdfdsf, N., & Hinton, G. E. (2017). Dynamic Routing Between Capsules. arXiv, cs.CV.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A. et al. (2017). Mastering the game of Go without human knowledge. Nature, 550(7676)(7676), 354–359.

Smolensky, P., Lee, M., He, X., Yih, W.-t., Gao, J., & Deng, L. (2016). Basic Reasoning with Tensor Product Representations. arXiv, cs.AI.