Baidu explains how it’s mastering Mandarin with deep learning


On Aug. 8 at the International Neural Network Society conference on big data in San Francisco, Baidu senior research engineer Awni Hannun presented on a new model that the Chinese search giant has developed for handling voice queries in Mandarin. The model, which is accurate 94 percent of the time in tests, is based on a powerful deep learning system called Deep Speech that Baidu first unveiled in December 2014.

In this lightly edited interview, Hannun explains why his new research is important, why Mandarin is such a tough language to learn and where we can expect to see future advances in deep learning methods.


SCALE: How accurate is Deep Speech at translating Mandarin?

AWNI HANNUN: It has a 6 percent character error rate, which essentially means that it gets wrong 6 out of 100 characters. To put that in context, this is in my opinion — and to the best of our lab’s knowledge — the best system at transcribing Mandarin voice queries in the world.

In fact, we ran an experiment where we had a few people at the lab who speak Chinese transcribe some of the examples that we were testing the system on. It turned out that our system was better at transcribing examples than they were — if we restricted it to transcribing without the help of the internet and such things.

“We give it enough data that it’s able to learn what’s relevant from the input to correctly transcribe the output, with as little human intervention as possible.”

What is it about Mandarin that makes it such a challenge compared with other languages?

There are a couple of differences with Mandarin that made us think it would be very difficult to have our English speech system work well with it. One is that it’s a tonal language, so when you say a word in a different pitch, it changes the meaning of the word, which is definitely not the case in English. In traditional speech recognition, it’s actually a desirable property that there is some pitch invariance, which essentially means that it tries to ignore pitch when it does the transcription. So you have to change a bunch of things to get a system to work with Mandarin, or any Chinese for that matter.

Awni Hannun. Source: Baidu

However, for us, it was not the case that we had to change a whole bunch of things, because our pipeline is much simpler than the traditional speech pipeline. We don’t do a whole lot of pre-processing on the audio in order to make it pitch-invariant, but rather just let the model learn what’s relevant from the data to most effectively transcribe it properly. It was actually able to do that fine in Mandarin without having to change the input.

The other thing that is very different about Chinese — Mandarin, in this case — is the character set. The English alphabet is 26 letters, whereas in Chinese it’s something like 80,000 different characters. Our system directly outputs a character at a time as it’s building its transcription, so we speculated it would be very challenging to have to do that on 80,000 characters at each step versus 26. That’s a challenge we were able to overcome just by using characters that people commonly say, which is a smaller subset.

Baidu has been handling a fairly high volume of voice searches for a while now. How is the Deep Speech system better than the previous system for handling queries in Mandarin?

Baidu has a very active system for voice search in Mandarin, and it works pretty well. I think in terms of total query activity, it’s still a relatively small percentage. We want to make that share larger, or at least enable people to use it more by making the accuracy of the system better.

Source: Baidu / Awni Hannun

Can you describe the difference between a search-based system like Deep Speech and something like Microsoft’s Skype Translate, which is also based on deep learning?

Typically, the way it’s done is there are three modules in the pipeline. The first is a speech-transcription module, the second is the machine-translation module and the third would be the speech-synthesis module. What we’re talking about, specifically, is just the speech-transcription module, and I’m sure Microsoft has one as part of Skype Translate.

Our system is different than that system in that it’s more what we call end-to-end. Rather than having a lot of human-engineered components that have been developed over decades of speech research — by looking at the system and saying what what features are important or which phonemes the model should predict — we just have some input data, which is an audio .WAV file on which we do very little pre-processing. And then we have a big, deep neural network that outputs directly to characters. We give it enough data that it’s able to learn what’s relevant from the input to correctly transcribe the output, with as little human intervention as possible.

One thing that’s pleasantly surprising to us is that we had to do very little changing to it — other than scaling it and giving it the right data — to make this system we showed in December that worked really well on English work remarkably well in Chinese, as well.

“We want to build a speech system that can be used as the interface to any smart device, not just voice search.”

What’s the usual timeline to get this type of system from R&D into production?

It’s not an easy process, but I think it’s easier than the process of getting a model to be very accurate — in the sense that it’s more of an engineering problem than a research problem. We’re actively working on that now, and I’m hopeful our research system will be in production in the near term.

Baidu has plans — and products — in other areas, including wearables and other embedded forms of speech recognition. Does the work you’re doing on search relate to these other initiatives?

We want to build a speech system that can be used as the interface to any smart device, not just voice search. It turns out that voice search is a very important part of Baidu’s ecosystem, so that’s one place we can have a lot of impact right now.

The Baidu Eye wearable computer. Source: Baidu

Embedding deep learning models everywhere

Is the pace of progress and significant advances in deep learning as fast it seems?

I think right now, it does feel like the pace is increasing because people are recognizing that if you take tasks where you have some input and are trying to produce some output, you can apply deep learning to that task. If it was some old machine learning task such as machine translation or speech recognition, which has been heavily engineered for the past several decades, you can make significant advances if you try to simplify that pipeline with deep learning and increase the amount of data. We’re just on the crest of that.

In particular, processing sequential data with deep learning is something that we’re just figuring out how to do really well. We’ve come up with models that seem to work well, and we’re at the point where we’re going to start squeezing a lot of performance out of these models. And then you’ll see that right and left, benchmarks will be dropping when it comes to sequential data.

“Where there’s a lot of data and where it makes sense to use a deep learning model, success is with high probability going to happen.”

Beyond that, I don’t know. It’s possible we’ll start to plateau or we’ll start inventing new architectures to do new tasks. I think the moral of this story is: Where there’s a lot of data and where it makes sense to use a deep learning model, success is with high probability going to happen. That’s why it feels like progress is happening so rapidly right now.

It really becomes a story of “How can we get right data?” when deep learning is involved. That becomes the big challenge.

A chart showing steady decline in the error rate for Mandarin transcription. Source: Baidu / Awni Hannun

Architecturally, Deep Speech runs on a powerful GPU-based system. Where are the opportunities to move deep learning algorithms onto smaller systems, such as smartphones, in order to offload processing from Baidu’s (or anyone else’s) servers?

That’s something I think about a lot, actually, and I think the future is bright in that regard. It’s certainly the case that deep learning models are getting bigger and bigger but, typically, it also has also been the case that the size and expressivity of the model is more necessary during training than it is during testing.

There has been a lot of work that shows that if you take a model that has been trained at, say, 32-bit floating point precision and then compress it to 8-bit fixed point precision, it works just as well at test time. Or it works almost as well. You can reduce it by a factor of four and still have it work just as well.

There’s also a lot of work in compressing existing models, like how can we take a giant model that we’ve trained to soak up a lot of data and then, say, train another, much smaller model to duplicate what that large model does. But that small model we can actually put into an embedded device somewhere.

Often, the hard part is in training the system. In those cases, it needs to be really big and the servers have to be really beefy. But I do think there’s a lot of promising work with which we can make the models a lot smaller and there’s a future in terms of embedding them in different places.

Of course, something like search has to go back to cloud servers unless you’ve somehow indexed the whole web on your smartphone, right?

Yeah, that would be challenging.


For some additional context on just how powerful a system Deep Speech is — and why Baidu puts so much emphasis on systems architecture for its deep learning efforts — consider this explanation offered by Baidu systems research scientist Bryan Catanzaro:

“As with other deep neural networks, our system gets more and more
accurate as it is trained on larger and larger datasets. [Baidu]
researchers have been working hard to find large datasets from which our
model can learn all the nuances and complexities of spoken Chinese, which
is a very diverse language with many dialects and local accents. As we
amass these datasets, we encounter interesting systems problems as we try
to scale the training of our system.
“To give some context, training Deep Speech on our full Chinese dataset takes tens of exaflops — that’s more than 10 quintillion (billion billion) multiplications and additions. In order to evaluate whether a new neural network or additional data will improve Deep Speech, we have to wait for this training process to converge, which can take quite some time. Accordingly, the more rapidly we can train Deep Speech, the more ideas we can evaluate, and the more rapidly we make progress.
“This is why we pay special attention to systems issues when training our models. We have noticed that as we improve the efficiency of our training system, accuracy improvements follow rather directly. We parallelize the training of our system across multiple GPUs in order to reduce this training time. Our current system sustains more than 26 teraflops while training a single model on 8 GPUs, which allows us to train Deep Speech on a large dataset in a matter of days. We continue pushing the boundaries of scalability, because we’ve observed that our accuracy continues to improve as we scale our training set.”