Google Brain’s Co-inventor Tells Why He’s Building Chinese Neural Networks

Andrew Ng on the state of deep learning at Baidu

Caleb Garling
Published in
7 min readFeb 2, 2015


To chat with Andrew Ng I almost have to tackle him. He was getting off stage at Re:Work’s Deep Learning Summit in San Francisco when a mob of adoring computer scientists descended on (clears throat) the Stanford deep learning professor, former “Google Brain” leader, Coursera founder and now chief scientist at Chinese web giant Baidu.

Deep learning has become one of computing’s hottest topics, in large part due to the last decade of work by Geoff Hinton, now a top Googler. The idea is that if you feed a computer lots of images of, say, dogs, the computer will eventually learn to recognize canines. And if we can teach machines that, technophiles and businesspeople alike hope, machines will soon — truly, in the human sense — understand language and images. This approach is being applied to aims as disparate as having computers spot tumors and travel guides that recognize the mood of a restaurant.

Ng and I chatted about the challenges he faces leading the efforts for “China’s Google” to understand our world through deep learning. Ng insists that Baidu is “only interested in tech that can influence 100 million users.” Despite the grand visions, he is very friendly and soft-spoken, the kind of person you’d feel really guilty interrupting.

(Conversation edited so it doesn’t read like two people trying to hear each other at a loud conference)

[Caleb Garling] Often people conflate the wiring of our biological brains with that of a computer neural network. Can you explain why that’s not accurate?

[Andrew Ng] A single neuron in the brain is an incredibly complex machine that even today we don’t understand. A single “neuron” in a neural network is an incredibly simple mathematical function that captures a minuscule fraction of the complexity of a biological neuron. So to say neural networks mimic the brain, that is true at the level of loose inspiration, but really artificial neural networks are nothing like what the biological brain does.

Today machines can recognize, say, a dog jumping. But what if someone is holding a piece of meat above the dog. We recognize that that’s a slightly different concept, a dog trick. And the piece of meat isn’t just a piece of meat, it’s a treat—a different linguistic idea. Can we get computers to understand these concepts?

Deep learning algorithms are very good at one thing today: learning input and mapping it to an output. X to Y. Learning concepts is going to be hard.

One thing Baidu did several months ago was input an image — and the output was a caption. We showed that you can learn these input-output mappings. There’s a lot of room for improvement but it’s a promising approach for getting computers to understand these high level concepts.

[Earlier at the conference I asked Simon Osindero, who leads the deep learning photo tagging efforts at Flickr, the same question about machines learning concepts: “We’d need to take a completely different approach for that.”]

Mandarin and English are very different languages, in almost every way. How different are the machine frameworks for understanding the two languages?

Man…the technology isn’t mature enough for me to give a pithy answer. We have the English language. Now we’re figuring out Chinese.

English has 26 letters in its alphabet. Chinese has roughly 5000. If you look at a modest-sized English corpus, you’ll see everything in the alphabet. If you look at a Chinese corpus there might be characters that you only see once. So how do you learn to recognize that Chinese character?

Romance languages are much easier. Moving from French to English is much easier than Chinese to English.

So if you have a picture tagged in English, how do you convert those tags to Chinese?

I think there are a lot of things still worth trying — they haven’t all been explored yet.

One thing we do see is multi-task learning. Let’s say that you have a network for recognizing images with English tags and you want to train a network to recognize things with Chinese tags. If you train one network to do both tasks, chances are, you will do better with both tasks, than having a separate English network and a separate Chinese network.

The gains haven’t been huge but there are gains. The reason is that at the first level [the machine] might learn to detect edges in an image, and then it might learn to detect corners. This is knowledge that is common to the two languages. Once you’ve learned to recognize an object in English it will actually help to learn it in Chinese as well because you can detect those edges and the objects.

What about words that don’t exist in one language versus another?

In English there is one word for sister. In Chinese, there are two separate words, for elder and younger sister. This is actually a translation problem because if you see the word sister, you don’t know how to translate it to Chinese because you don’t know if it’s an elder sister or younger. But I think that having recognized your sister as distinct from all of the other objects in the room makes it easier to add this additional distinction, than if you’d had to learn the concept of a “sister” from scratch.

Training becomes more expensive — the exception is if your neural network is too small.

Wait, what’s a small neural network?

[Laughs] That changes every day. One metric we use is the number of connections between neural networks. [Baidu] often trains neural networks with tens of billions of connections.

Let’s talk about language recognition. Does Baidu look at specific sounds or letter combinations like a “th” (called phonemes) and then work from there?

That’s what speech recognition used to do. All speech recognition used to have this standard pipeline where you input the audio and try to predict the phonemes. And then you have another system for mapping from phonemes to words.

But recently there has been a debate whether phonemes are a fundamental fact of language or are they a fantasy of linguists? I tried for years to convince people that phonemes are a human construct — they’re not a fundamental fact of language. They are a description of language invented by humans. Many linguists vehemently disagreed with me, sometimes in public.

One of the things we did with the Baidu speech system was not use the concept of phonemes. It’s the same as the way a baby would learn: we show [the computer] audio, we show it text, and we let it figure out its own mapping, without this artificial construct called a phoneme.

I learned to speak English long before anyone taught me what a phoneme was.

What about movies? Are you guys at Baidu looking at that?

There’s been a bunch of work with deep learning with video. But I don’t think they’ve been very successful so far at using the time dimension in a fundamental way. This is something deep learning researchers debate: how fundamental is time to the development of intelligence in our systems?

Um, can you elaborate on studying time?

By moving your head, you see objects in parallax. (The idea being that you’re viewing the relationship between objects over time.) Some move in the foreground, some move in the background. We have no idea: Do children learn to segment out objects, learn to recognize distances between objects because of parallax? I have no idea. I don’t think anyone does.

There have been ideas dancing around some of the properties of video that feel fundamental but there just hasn’t yet been that result. My belief is that none of us have come up with the right idea yet, the right way to think about time.

Animals see a video of the world. If an animal were only to see still images, how would its vision develop? Neuroscientists have run experiments in cats in a dark environment with a strobe so it can only see still images—and those cats’ visual systems actually underdevelop. So motion is important, but what is the algorithm? And how does [a visual system] take advantage of that?

I think time is super important but none of us have figured out the right algorithms for exploring it.

[That was all we had time for at the Deep Learning Summit. But I did get to ask Ng a followup via email.]

Do you see AI as a potential threat?

I’m optimistic about the potential of AI to make lives better for hundreds of millions of people. I wouldn’t work on it if I didn’t fundamentally believe that to be true. Imagine if we can just talk to our computers and have it understand “please schedule a meeting with Bob for next week.” Or if each child could have a personalized tutor. Or if self-driving cars could save all of us hours of driving.

I think the fears about “evil killer robots” are overblown. There’s a big difference between intelligence and sentience. Our software is becoming more intelligent, but that does not imply it is about to become sentient.

The biggest problem that technology has posed for centuries is the challenge to labor. For example, there are 3.5 million truck drivers in the US, whose jobs may be affected if we ever manage to develop self-driving cars. I think we need government and business leaders to have a serious conversation about that, and think the hype about “evil killer robots” is an unnecessary distraction.

Follow Backchannel: Twitter | Facebook