Alexa, why don’t you understand me?

Jakob Havtorn
Corti
Published in
11 min readOct 4, 2018
Figure 1. Audio, model and then what?

In January 2017, a morning show on San Diego’s CW6 News covered a story on how a little girl from Dallas, Texas, accidentally ordered a $300 doll house and four pounds of sugar cookies by asking the family’s Amazon Alexa if it wanted to play dollhouse. The purpose of the show was to discuss a new set of issues that consumers were facing, as these voice-based assistants had made their entry into our homes.

However, the story didn’t end here. It made a turn for the worse and highlighted a fundamental problem with the voice technology in these digital assistants. Endeared by the little girl, one of the anchors of the show made his closing remarks and said: “I love the little girl, saying ‘Alexa order me a dollhouse’”. Minutes later, regular people all over Southern California started reporting spontaneous purchases of dollhouses being made by their voice assistants, triggered by the anchor’s “order”.

As the usage of speech-based interfaces started to surge, everybody from ‘The Jetsons’ nostalgics to George Orwell disciples voiced their opinions on how our society would now permanently change in the hands of these new digital assistants. However, the many limitations and vulnerabilities related to voiced-based digital assistants like Amazon’s Alexa have since been the overshadowing reality for the majority of us, as we haven’t experienced these services go beyond simple command-based interfaces that only work under ideal circumstances.

The sad truth is that although we were promised access to transcending cognitive capabilities through voice-based interfaces, these interfaces keep failing due to problems like recognizing speakers, background noise, or one of the many other complexities that are a natural part of almost all of our daily interactions. As a disappointing consequence, automatic speech recognition (ASR) still hasn’t made a notable impact on society, despite what headlines such as “Siri is better than you” might suggest.

Building real-time ASR to have an impact

At Corti, we need ASR to help the people who help us on the worst days of our lives. We have a strong belief that speech-based technologies will assist medical professionals when they perform critical decision making. That is why we are building real-time decision support software that assists 911 operators in diagnosing and triaging patients faster and more accurately in emergency situations. In these life or death scenarios, there is no room for mistaken dollhouse purchases.

The audio we deal with is dialogue-based; it’s noisy; it contains many speakers, dialects, and accents; and it is often as frantic as you can imagine an emergency being. We tested numerous ASR services and found that even the biggest ASR platforms perform at a level that is unusable in practice when applied to a conversation like an emergency call.

So although we are a startup with constrained resources, in order to make an impact, we decided to build our entire speech recognition pipeline from the ground up.

“We tested numerous ASR services and found that even the biggest ASR platforms perform at a level that is unusable in practice when applied to a conversation like an emergency call.”

A few questions had to be answered: What exactly happens that makes state-of-the-art ASR systems fail so often? And how do we then build an ASR platform that can actually make an impact during as critical a conversation as an emergency call?

As with all machine learning, good data is essential for training well-performing ASR systems. The data has to be representative of the task that is solved, and enough of it has to be available for the ASR model to extract something general from it.

The ASR models we interact with today are often trained on datasets consisting of audio recordings of a single speaker, reading aloud written text; be it news reports such as in the Wall Street Journal (WSJ) dataset or audio books as in LibriSpeech. This means we are training our ASR frameworks to understand human vocal exchange as a “single-player game” performed in optimal acoustic circumstances. Other datasets such as Fisher and Switchboard are based on phone calls in which two strangers calmly discuss a predefined topic such as pets or politics. These datasets do not in any significant way resemble the natural transaction of people actually interacting in authentic ways, which is why we keep seeing ASR fail.

When we see ASR systems, such as voice assistants, fail to work in any but the most simple and nicely conditioned situations, this can be understood as a case of failing to generalize. The idea here is that when an ASR model is trained, it learns to transcribe speech to text. After training, it is the model’s ability to generalize knowledge from the training data to new and previously unseen data that defines its usability. When models fail to generalize, it often comes down to something called overfitting: the model has learned the training data by heart rather than understanding the general features that define the relationship between speech and text as was intended.

State-of-the-art ASR models have often learned all they know from training on data that is not generally representative of natural conversational speech. In this way, ASR models become completely unaware of many of the natural conditions of speech. Imagine training self-driving cars in the same way: No obstacles, pedestrians, or bad weather conditions, only open highways.

“Imagine training self-driving cars in the same way: No obstacles, pedestrians, or bad weather conditions, only open highways.”

To show you an example of the failure to generalize, let’s have a look at a state-of-the-art open sourced ASR system and how it performs on a real-life transcription task.

State of the Art and Mozilla Deep Speech

Initially released in November 2017, Mozilla Deep Speech is an open source project implementation of Baidu’s similarly-named research papers: Deep Speech and Deep Speech 2. The project is the result of a painstaking effort by brilliant people at Mozilla and provides access to a high-performing pre-trained neural ASR model that can be used to transcribe text from the speech in audio. Besides all the technical challenges related to implementation, the performance of the model is highly determined by its training data and an extensive tuning process.

Figure 2. Illustration of Mozilla’s Deep Speech project. Credit: Mozilla

Similarly to many voice assistants, Mozilla Deep Speech is trained on a combination of speech corpora, including TED-LIUM, LibriSpeech, Fisher, and Switchboard. Table 1 gives a brief overview of these datasets, but the short story is that the data consists of thousands of hours of audio, recorded in nice acoustic environments with only one or two well-defined speakers and a high signal to noise ratio.

Table 1. List of four common ASR datasets along with brief descriptions of the type of data they contain.

As a combined result of the modeling, the data and the tuning, Deep Speech achieves a word error rate (WER) of about 6.5% on LibriSpeech. Although not flawless, this is state-of-the-art in ASR, and we will use this model as an example of that.¹

Enter: Noise

So, the training procedure of Mozilla’s Deep Speech model adheres largely to the current ASR paradigm and performs well on clear-spoken speech. However, as soon as we introduce some of the irregularities that occur in human conversations, the performance drops significantly.

To demonstrate the impact of noise, we partnered with one of the world’s premier fire departments, the Seattle Fire Department, to benchmark ASR performance on an anonymized dataset of 600 hours of real-life emergency call data. This data consists of real people calling a 911 dispatch center to report real emergencies. We then compared the results with Wall Street Journal recordings, a more or less noiseless dataset often used for training ASR models.²

The Deep Speech model is trained on audio with a sample rate of 16kHz, whereas Seattle’s emergency calls are recorded at 8kHz. That means Seattle’s dataset would need to be upsampled to be compatible with the model, which may negatively affect results. To even the playing field, we therefore resampled the WSJ audio from its native 16kHz down to 8kHz and back up to 16kHz.

To further mitigate the effect of upsampling, we employed two different sampling methods. The first method is simple, resampling the audio by forward copying each amplitude value once to effectively double the number of samples. Aurally (yes, this is indeed a word), inspecting the upsampled audio confirms that the speech is understandable although slightly ‘metallic’ sounding. The second method uses the SoX library to perform a high quality down and upsampling which produces better-sounding samples than the simple method.

Testing the model on the noisy Seattle emergency call data and the WSJ data, with the different resampling approaches outlined above, results in the WERs shown in Table 2.

Table 2. Overview of WER results of the Deep Speech model on WSJ and emergency call data from the Seattle Fire Departmnt.

The performance of Mozilla Deep Speech on the Seattle emergency call data has more than eight times the WER of the original 16 kHz WSJ data and almost five times that of WSJ after resampling by forward copying.

On the Seattle emergency call data, the high quality SoX resampler yields a WER that is 3.2 percentage points lower than by copying. On WSJ, SoX results in a WER that is 3 percentage points lower than by copying. Additionally we can observe that it is 3.6 percentage points higher than for the original samples.

Whether or not the WER on the Seattle emergency call data would have been 3 percentage points lower on the original 8kHz data than on the upsampled version, as it is on WSJ, can only be hypothesized. With the WSJ baseline, it is however more than likely that the poor performance of the Seattle emergency call data is only in small part due to resampling while most of the performance drop is due to the much more challenging data domain.

“The performance of Deep Speech on the Seattle dispatch data has more than eight times the WER of the original 16 kHz WSJ data […]”

To better understand what this poor WER means in practice, we can have a look at a few transcripts. For comparison, here are some transcripts from the native WSJ data. Try to see if you can understand the transcript before looking at the reference transcript.

The transcripts are quite nice, although with a few mistake that are often closely related to the reference transcript. Notice how a small error in the final part of the third example begins to change the entire meaning of the sentence. This is one inherent danger in ASR: Small errors can inadvertently have significant impacts. Now consider a few examples from the dispatch data. See if you can understand the model transcripts.

The model struggles to construct readable and understandable transcripts from the emergency call data. Keep in mind that we use this kind of transcript to help dispatchers with diagnostics and triaging, and just as with a chain, the system cannot be stronger than its weakest link. It is essential that these transcripts are easily understandable.

How do we build ASR for the real world?

Although the state-of-the-art Deep Speech ASR model performs well on benchmark data, the WER above 60% on noisy, highly stressed conversational data made it almost unusable. It shows a failure for the ASR to generalize to real-world conversational input. Sometimes, the errors in the transcripts can even give a completely different meaning to a sentence, as was the case in one of the above transcription examples. It’s not hard to imagine how such an error in an emergency call could have unacceptable consequences.

This serves to show that current plug-and-play ASR models are not viable in a real-life dialogue settings, which in turn underscores that the general perception of the performance of ASR technologies is often far from the reality. With the current machine learning methodologies, the application of ASR in mission critical settings requires a much more focused approach.

“This serves to show that current plug-and-play ASR models are not viable in a real-life dialogue settings, which in turn underscores that the general perception of the performance of ASR technologies is often far from the reality.”

You could argue that expecting models trained on virtually noiseless speech to generalize well to emergency calls isn’t reasonable. After all, how would they know better? But this is exactly my point. There is no way that we will achieve impactful ASR on natural speech by using models that are trained on “nice” datasets. The world of natural speech doesn’t play nice.

Instead, we need to change the way we think about ASR. ASR has definitely not been solved. Although some results are comparable to human level, depending on how you measure it, they are straight out of the laboratory. Laboratories are great, but ultimately it is the real-world application of these technologies that matters. And as we’ve seen, this step can be more than challenging if the paradigms of the laboratory have diverged from the real world. As it stands, the current state of the art is not applicable to the real-world problems that so crave ASR that actually works.

So how do we build impactful ASR systems? We need to achieve conversational interfaces in which you can seamlessly interact with technology instead of having to tune into your “Alexa voice”. However, the machine learning methods that these systems are based on are not ready for this sort of complexity. They fail in solving the broad scope of various conversational settings that may occur.

At Corti, we believe that different types of conversations need to be solved one-by-one similarly to what Google so magnificently demonstrated with Duplex. But isn’t this exactly what command-based systems like Alexa are trying to achieve? The answer is yes, however, these systems cannot be certain that the input they receive is actually confined to the simple commands that they expect. Nevertheless, they often react as if they are. This is probably also why we will have to wait some time before a successful and seamless Google Duplex conversational interface is available, at least beyond simple restaurant bookings.

In fact, there are machine learning methodologies that can help ascertain whether a sample is within the comfort zone of the conversational interface or not. At Corti, we are inherently aware of these methods and are actively researching within them. We want to know when the system is uncertain or at least when it should be. We are constantly researching novel approaches to build better and more robust ASR systems that do not encounter the caveats that we see in this post.

If you feel like learning more about the work we do with ASR and other topics within natural language processing and machine learning, have a look at some of our other blog posts at Corti’s Medium page or publications on our own research page.

Footnotes

  1. Recently, Microsoft and IBM have been competing on the state of the art with WERs closing in on 5% on Switchboard (among others). Human level performance on transcription is probably somewhere between 4% and 5.9% WER for “careful transcription” and about 9.6% for “quick transcription” depending on the type of speech and how the experiment is set up.
  2. The model is evaluated on the “dev93” subset of WSJ according to the Kaldi recipe.

--

--