Building Custom Speech Recognition Models Within Minutes

Published in

IBM watsonx Assistant

6 min readAug 26, 2019

Ever wanted to create your personalized AI bot to identify whatever you say to it? You probably must have at some point but would have dropped the idea because either it’s too complex or time consuming to train it or create it. In this blog you will learn how you can create your custom speech models using IBM Watson and the best part is that you do not have to write a single line of code for this. Let’s get started !

Prerequisites:

IBM Cloud account [Upgrade to a Pay-As-You-Go account and get $200 credits free]
Create a Speech-to-Text service

Overview:

Introduction IBM Watson Speech-to-Text
What are custom acoustic and language models and how to build them
Testing Model accuracy with WER

Speech recognition and Speech to Text services have been there from a long time but it does not offer the full capabilities for example they focus on smaller audio sets or are not trainable. IBM Watson® Speech to Text on IBM Cloud offers features such as transcribing real-time audio in 7 different languages with a very high accuracy rate plus it supports various use cases. Best part is that you tweak/train the service to your standards and this is what we will be looking at now.

The Speech to Text service focuses on creating two different kinds of custom speech recognition models - acoustic models and language models. We will learn to create both of them.

Acoustic Models

An acoustic model let’s you adapt a base model for the acoustic characteristics of your environment and speakers. You can create an acoustic model in such cases:

Your acoustic environment is unique. For example, the environment is noisy, microphone quality or positioning are sub-optimal, or the audio suffers from far-field effects.
The speaker’s speech pattern is abnormal. For example they speak too fast or too slow.
Your speaker has an accent.

Now let’s look at how we can create an acoustic model which deals with a noisy environment and with speakers who have Indian and Scottish accents.

Steps to create an Acoustic Model:

Let’s download some audio sets! You can download some audios with a noisy environment from this link. This is a subset of this data set. You can download audio’s with Indian accent from here and with Scottish accent from here.

Open the terminal on your computer and enter the following command. Replace the ‘apikey’ with your own apikey (credentials available on IBM Cloud) and you can also give a name to the model. I have named it ‘My first acoustic model’.

curl -X POST -u “apikey:{apikey}” — header “Content-Type: application/json — data “{\”name\”: \”My first acoustic model\”, \”base_model_name\”: \”en-US_BroadbandModel\”, \”description\”: \”This model contains noisy background audios and speakers with Indian and Scottish accents\”}” “https://stream.watsonplatform.net/speech-to text/api/v1/acoustic_customizations"

This command creates a customization ID. This customization ID is the identification number for the custom acoustic model we are going to create. This ID will be used in your applications as well so keep note of it.

Next, we will add audio files to this base custom model we have created so we can send it for training. There will be 3 different audio files (make sure they are in zipped folders) so we run this command 3 times with the different zip folders we have. To run this command cd into the same directory as the audio files. Make sure to replace and add your apikey, name of the audio folder and customization ID from the previous step:

curl -X POST -u “apikey:{apikey}” — header “Content-Type: application/zip” — header “Contained-Content-Type: audio/l16;rate=16000” — data-binary @voices_dirty.zip “https://stream.watsonplatform.net/speech-to-text/api/v1/acoustic_customizations/{customization_id}/audio/voices_dirty"

The next step is to train our model with these audio files so enter the following command and wait for about 20–30 minutes to finish:

curl -X POST -u “apikey:{apikey}” “https://stream.watsonplatform.net/speech-to text/api/v1/acoustic_customizations/{customization_id}/train"

You can check the status of the training by entering the following command:

curl -X GET -u “apikey:{apikey}”
“https://stream.watsonplatform.net/speech-to-text/api/v1/acoustic_customizations/{customization_id}"

Language Models

This service was created for the broader and general audience. The base model includes many words that we use in everyday conversations but the vocabulary extends when we dive into an industry focus for example the medical industry or the oil and gas industry. By using the language model customization we can tailor the vocabulary to include domain-specific terminology. This is done by adding a corpus file. A corpus file contains text/ key words that is usually used in the industry. We will be focusing on the oil and gas industry in the blog.

Steps to create a Language Model

First step is to add the corpus text file to the model. So download the file from here. This file contains most of the terminology that is used in an oil and gas industry.

Now run the following command to create a custom language model and this will create a customization ID as well:

curl -X POST -u "apikey:{apikey}" --header "Content-Type: application/json" --data "{\"name\": \"Oil and Gas Model\", \"base_model_name\": \"en-US_BroadbandModel\", \"description\": \"This is a custom language model for the Oil and Gas industry\"}" "https://stream.watsonplatform.net/speech-to-text/api/v1/customizations"

In this step we will add the corpus file to the model so we can send it for training on those words. Make sure you cd into the same directory as the corpus file when running the command and add the customization ID from the previous step:

curl -X POST -u “apikey:{apikey}” — data-binary @oil-gas-corpus.txt “https://stream.watsonplatform.net/speech-to-text/api/v1/customizations/{customization_id}/corpora/oilgas"

Now let’s look at what the model has analyzed. Run the command below:

curl -X GET -u “apikey:{apikey}”
“https://stream.watsonplatform.net/speech-to-text/api/v1/customizations/{customization_id}/corpora/oilgas"

After running the command, you will see the following results on the left. This means the file contains 2193 words out of this the model is not trained on 10 words. So once we send the model for training it will be trained on these 10 out of vocabulary words.

Enter the following command to send it for training and wait for 15–30 minutes:

curl -X POST -u “apikey:{apikey}”
“https://stream.watsonplatform.net/speech-to-text/api/v1/customizations/{customization_id}/train"

You can check the training status by entering the following commands:

curl -X GET -u “apikey:{apikey}”
“https://stream.watsonplatform.net/speech-to-text/api/v1/customizations/{customization_id}"

Accuracy & Testing

We have finally created our custom models on the Cloud but how can we actually test whether they are working and how accurate they are? To test the model and check the accuracy we will be running our models on a blind audio set with a transcription.

A common technique used to test the accuracy of speech models is WER — Word Error Rate. The lower the rate, the better the accuracy. It is calculated using the formula below:

S: substitutions (replacing a word).
I: for insertions (inserting a word).
D: deletions (omitting a word).
N is the number of words that were actually said Note: WER will be calculated incorrectly if you forget to normalize capitalization, punctuation, numbers, etc. across all transcripts

We will be writing a python script to test the accuracy of both our models. Python offers a library called jiwer to automatically calculate the word error rate for speech recognition models.

Steps for testing and getting an accuracy rate:

Download some sample audio files from this audio set and get their transcription from here.

Download the following code files to get the results:

After running these 2 files on the samples you will get an accuracy rate close to 0.04 in an excel sheet. Which is a pretty good accuracy rate.

Conclusion

Voila! You have finally created your personalized speech recognition models with a very high accuracy rate and that too within minutes without having to write any complex code. Now you can embed these models into any application.