Make you ASR battleworthy in real-life noisy environment

Data Monsters
Product AI
Published in
3 min readOct 13, 2022

Automatic Speech Recognition systems in the real world work with data from authentic sources that are subject to various interferences, distortions, and perturbations. City noise, a dog’s bark, speaker mode on a phone, digital and analog interference, various codecs for signal conversion, and so on. These and many other problems are serious challenges that can hinder high-quality speech recognition. For example, the sound of special urban vehicles can confuse the model, while a person would still be able to successfully recognize speech.

At Data Monsters we have developed a pipeline to produce robust ASR models that successfully handle any type of signal issue. If a person is still able to recognize speech, then the model will also be able to do it successfully. As a result, the model cuts out all unnecessary sounds, understands distorted speech, and then returns an accurate transcription.

The model is trained on both source data and augmented data. Let’s take a closer look at the augmented data.

Data description

The original domain-specific data consists of audio files and corresponding manual transcriptions. Total duration: ~500 hours. The data was enriched with noise and perturbations.

Room Impulse Response and Noise Database (SLR28) was used:

  • 12 hours of authentic background and foreground noises from different sources
  • 20 hours of room impulse responses from different environments

We have significantly increased the amount of background and foreground noises with the Google AudioSet as well:

  • ~5000 hours of labeled real-world noises

Also, different filters were applied to the original audio to simulate white noise, interferences, issues with the signal, different codecs, etc. Optimal parameters of perturbations are selected taking into account natural acoustic environments.

Using our approach, the amount of training, dev, and test data has been increased by 10 times.

The pipeline we have developed can be used to uniquely extend any audio data.

Examples of augmented data

Below are examples of augmented data.

Original audio sample without noise, “What is Natural Language Processing?”

Background and foreground noise:

1. Сar driving on a wet road

2. Singing

3. Digital beeps

4. Purring cat

5. Dog’s bark

Different types of perturbations:

1. Change of the impulse, big hall

2. White noise

3. Slow down without tone change

4. Speed up with tone change

Combined perturbation — background and foreground noise plus change of the impulse:

1. Special urban transport signal

2. Phone ringing

3. City Park

4. Rooster crows

5. Сlicks

6. Phone vibrating

Each time a unique combination of noise and perturbations is generated. Using a large amount of available noise, we can generate a large number of options.

Robust ASR Model description

The Robust ASR Model is based on the Conformer-CTC architecture. As the initial weights, we used a pre-trained Regular ASR Model trained on unenriched (original) data only.

The model was tuned on the NVIDIA DGX-2 Server with 16 NVIDIA V100 Tensor Core GPUs. NVIDIA NeMo Toolkit was used as a framework for training and evaluation.

Moreover, the model has been launched in NVIDIA Riva GPU-accelerated SDK to make it production-ready with performance and quality benefits. The model shows excellent real-time performance.

Table 1. Regular and Robust model comparison.
1) Regular (Original) ASR Model trained without data augmentation 2) Robust ASR Model obtained using data augmentation 3) Test set of the original dataset 4) Test set of the augmented original dataset 5) Test set of another dataset without augmentation

The Robust ASR Model performs best not only on noisy data but also on the original data. Thus, the model learned to cut off noise and better understand the distorted signal.

You can see the improvement on the “Another non-noisy test” set. We didn’t add augmentations to the training part of this dataset, but the Robust model works better on it. This suggests that enriching some dataset using our approach gives an improvement on any other datasets.

We plan to further develop the data enrichment pipeline in order to receive high-quality ASR models.

--

--