Why the Deep Learning for AI Idea is Flawed

The Symbol Grounding Problem Revisited

Joachim De Beule
Code Biology
4 min readJun 4, 2014

--

Triggered by all the fuss about deep learning lately, I decided to read Yoshua Bengos paper on learning deep architectures for AI. It’s a fantastic paper, and deep learning is undoubtedly a step forward. It marks the beginning of another sweep in the dominant paradigm of AI towards connectionism again. But whether it will solve our deepest AI problems is far from sure. In this blog, I want to discuss why I think it doesn’t. The reason is not new, however, it’s just the symbol grounding problem again.

Deep Learning in a Nutshell

The goal of deep learning is to automate the detection of abstract concepts in data. Prototypical examples of abstract concepts are FACE and CAR. They are called abstract, I believe, because we think of them as consisting of several parts, so called lower level concepts such as EYES and HEADLIGHTS. These are in turn composed of even lower level concepts, like EDGES and ORIENTATIONS, all the way down to PIXELS. Concepts at a lower level that interact in specific and correlated ways, be it spatially, temporally or causally, together form higher or more abstract concepts. In this view, abstract concepts are concepts that are higher up in the hierarchy of concepts.

The idea behind deep learning is that hierarchies of concepts can be made accessible in raw input data by passing it through cascades of nonlinear transformations. The difference between a picture of a cat and a picture of a house, for instance, is not a priori clear from looking at the binary image data. By transforming the data, the difference may become more apparent. This is what happens for instance when binary image data is transformed into an image of pixels on screen. More technically, as more transformations are performed and combined hierarchically, an abstract feature space may emerge in which useful abstract concepts are well separated and easy to detect.

Deep learning assumes that it is a-priori possible to find a hierarchy of useful concepts in a large collection of raw real-world data, e.g. all pictures of real things on the Internet. This is a reasonable assumption, after all people easily distinguish between pictures of faces and pictures of cars. Nevertheless, recent findings indicate otherwise. More likely, there are many hierarchies to find, possibly none of which fully captures our own concept of CAR. Furthermore, deciding which among the many hierarchies are useful and which just capture noise is not something that can be done based on the raw data alone. This is not an assumption of existing deep architectures (as far as I know, all of them involve a supervised training phase), but it is something that we will have to think about if we really want to build AI.

Turtles all the Way Down

Imagine that we wish to build a system that autonomously learns to recognise images of cars. Raw images are in binary format, so that each image is just a series of zero’s and one’s, that is a large, binary number. In other words, we want our system to find a function that maps a number to the label “car” if the corresponding picture contains a car, and to the label “no car” if it doesn’t. In turn, these labels are numbers themselves. In the standard ASCII character set they are represented by the numbers 099097114 and 110111032099097114. So in fact we want our system to find a mapping between numbers representing images, and numbers representing English labels. There are practically an infinite amount of such mappings. Moreover, even if our system were to stumble upon the CAR/NO-CAR distinction, it would still don’t know that it actually found the mapping we are looking for. No matter how many cascades of non-linear transformations are performed on the raw data, the result will always be just numbers.

The Symbol Grounding Problem

This is only where trouble starts however. The real problem is that the final mapping, the one from the numbers 099097114 and 110111032099097114 to the English labels “car” and “no car”, is symbolic. The defining characteristic of a symbolic mapping is that it is arbitrary. In the terminology of physics, it is degenerate: there just is no explanation for it in the data.

In fact the same problem already appears at the beginning: unless one knows how to decode a sequence of zero’s and one’s into an image of pixels on screen (the image’s codec), the sequence could code for anything or nothing at all. For all you know, it is an encoding of a Michael Jackson song. It’s just a number remember!

A problem is well defined if the question it asks is clear, that is when it is clear how to test if an answer is in fact a solution to the problem or not. The central question left unanswered by deep learning is how a learning system can decide autonomously whether what it has learned is useful or not. In other words, we must understand the distinction between the world of meaningless symbol manipulation and numbers on the one hand, and the real world that matters on the other hand. This is the symbol grounding problem.

--

--