Automatic speech recognition for specific domains

Published in

Product AI

6 min readJun 8, 2022

Automatic speech recognition is one of the oldest tasks involving natural language, having its roots in 1952 when IBM developed a phonetic-based system called “Audrey.” Since then, a lot has changed. Nowadays, most of the systems used for speech recognition in companies use either neural network-based solutions or HMM. These systems have demonstrated competitive performance in general domain speech recognition tasks in recent years.

Before we start with common benchmarks, it’s worth reminding that a typical metric used for ASR systems evaluation is word error rate (WER). This counts the number of insertions, deletions, and substitutions that should be made to restore the original sequence from the predicted one.

General domain data

One of the benchmarks typically used for ASR systems evaluation is TIMIT Acoustic-Phonetic Continuous Speech Corpus. This contains recordings of 630 English speakers. At the moment, the best performing system for this corpus is the wav2vec model proposed by Facebook in 2019.

Another benchmark that is frequently used for ASR evaluation is the LibriSpeech dataset. This dataset is created on the basis of audiobooks and it contains around 1000 hours of speech. It is divided into two parts with regard to the general quality of ASR systems (the parts are called clean and other according to higher or smaller scores). The current state-of-the-art for the clean part of the dataset is shown by the Google research model based on the Conformer network with wav2vec pretraining. For the other part of the dataset, this was achieved by the W2V-BERT model proposed by Google Brain lab in 2021.

Wall Street Journal corpus is also used for ASR models evaluation. It is based on the dictations of the WSJ newspaper and contains 80 hours of speech. This benchmark also has 2 versions of test data: eval92 and eval93. For the former, the state-of-the-art is the SpeechStew model, and for the latter it is DeepSpeech 2.

The results for all these datasets are given in “Table 1.”

Table 1. State of the art results on the most popular ASR benchmarks.

Domain-specific data

While we see that for general domain speech recognition the results are always over 90% accuracy, when it comes to the real-life conditions with domain-specific vocabulary, the situation worsens drastically.

This problem is especially visible when one tries to apply the ASR system trained on general domain audios to recognize recordings full of specific terminology. This causes an increase in WER value by over 50%. This can be explained either by the presence of many UNK tags which replace all out-of-vocabulary words or words from the general domain that a model used to replace some special terms (for example, the word “Audi” can be replaced with “howdy” in the output sequence).

While virtual assistants like Siri or Alexa can work using models trained on general domain data, these models can rarely fulfill business requirements such as voice chatbots for an application or for a call center. These applications require recognition of product names or services which are almost always out of vocabulary words specific only for some particular company. Such words are usually names of specific manufacturing equipment. The problem gets even more complicated due to the existence of several names for the same device.

All of this requires training the model on the data specific to this domain, however, in most cases it is not possible because the companies do not have enough data or cannot share it due to some legal reasons.

Nevertheless, even when the data is scarce, it is possible to adapt the model and enrich its recognition capabilities. It can be done using fine-tuning techniques which consist in taking a large pre-trained model which has already captured knowledge about language in general and learning it to recognize specific words and word combinations. This method allows to overcome the issue of OOV words and decrease the model’s WER.

Our case

Our company works on various tasks related to natural language processing including automatic speech recognition. Our current case for ASR is helping a call center specializing in electric equipment to increase the model’s recognition quality of the speech by enriching it with specific domain words.

The client provided us with 11.27 hours of labeled call recordings. This data was divided to 7.47 hours of speech for training and 3.8 hours of speech for testing. It contains dialogs with clients about the equipment produced by the company.

In order to verify that our model is able to fulfill the client’s needs, we introduced a new metric that would allow us to evaluate if the client’s main goal is achieved. This metric is “term error rate (TER),” and it is computed as a number of correctly recognized terms over the total number of terms in the reference text.

For our purposes, we chose the NVIDIA Riva Conformer-CTC model as a key solution and wav2vec and Google Cloud Speech-to-Text as two possible alternatives.

Riva ASR models are really advantageous for our case as they are customizable, have different options for neural network’s architecture, are optimized for GPU, and even Riva out-of-the-box models give close to the state-of-the-art performance. We chose a Conformer-CTC model for our task, which is a convolution variation of the transformer architecture and it is the most accurate model available in Riva. To improve its results we fine-tuned it with our small corpus. In addition to this, we augmented the model’s vocabulary and tuned the parameters for the language model used along with the acoustic model.

Wav2Vec2 is a model proposed by Facebook AI in 2020 and at the moment its variations give the best results on many ASR benchmarks. It is pre-trained on the task of recovering masked parts of the audio and then tuned for the speech recognition task. For our purposes, we fine-tuned the pre-trained wav2vec2-large-robust model by Facebook.

Google Cloud Speech-to-Text is another popular solution for speech recognition, the main advantage of which is its relative ease of usage. The core model which is used for recognition inside the framework is also the conformer model. There is also a limited possibility to adapt the model, so we boosted it by introducing new words to the model’s vocabulary.

In “Table 2” and “Table 3,” we present the results of all these models, with both out-of-the-box and fine-tuned versions. “Table 2” stands for the results on the test set provided by the client, and “Table 3” shows the results on the LibriSpeech other test set.

Table 2. WER of the NVIDIA Riva Conformer-CTC model, Wav2Vec2 model and Google Cloud Speech-to-Text models on the customer’s dataset.

Table 3. WER of the NVIDIA Riva Conformer-CTC model, Wav2Vec2 model and Google Cloud Speech-to-Text models on the LibriSpeech **other** test set.

As we can see, Riva outperforms both Google and the current state of the art solution for the ASR task. This difference is especially remarkable after the fine-tuning stage, where Riva shows WER of 10.79%, while Wav2Vec gives WER only of 16.23%.

Little difference on LibriSpeech other dataset for Conformer-CTC and Wav2Vec2 models means that they are still good at general speech not presented in the domain-specific train dataset. Dramatic WER increase of Google Cloud indicates that the model is overfitted.

This proves that NVIDIA Riva models might be the best solution in similar cases, where we have only a small portion of the annotated data to fine-tune the model.