Data Collection and Training for Speech Projects

Published in

IBM Data Science in Practice

8 min readDec 2, 2019

Many AI projects have a speech recognition component such as voice assistants (for example: IBM’s Watson Assistant for Voice Interaction). Speech recognition models including IBM Speech to Text require training to understand new domains and that training requires a data collection exercise. This post describes the data collection and speech training at a high level to demonstrate how each step relates to the others. For a deeper treatment on data collection and training, see the fantastic “How to Train Your Speech Dragon” series by Marco Noel — part 1, part 2, and part 3.

Data Collection components

The following diagram outlines the high-level steps in the data collection exercise and how these steps support the training of your speech model.

Flow diagram for Speech Data Collection and Training

1. Find out what users will need to say

2. Determine the domain-specific language

3. Build a “script” from a sample of the domain-specific language

4. Identify target population

5. Record humans reading from your script

6. Transcribe what the humans spoke

7. Build a test set

8. Train a language model

9. Train an acoustic model

Each step is described in greater detail in the remainder of this post. Our end goal is to collect a large enough set of audio data to train a speech model with maximum effectiveness. As of this writing we target 25 hours of audio for speech model training.

Step 1: What are they going to need to say?

In order to effectively train a speech model you need to understand what users are expected to say. A speech model is like any other machine learning model in that it requires data as “representative” as possible and thus you should go as close to the source as you can.

If you are training a voice assistant to supplement an existing (text) chat assistant take logs from user interactions. If you have call center recordings transcribe some of them. If you. Are writing a new chat dialog look at all the responses you nudge your users to provide. Use data from your eventual real users wherever you can and augment that only if you must.

Cari Jacobs has a wonderful post on data collection strategies that you can review in detail for this step and the next.

Step 2: What are they going to say (that’s specific to my domain)?

In the previous step you will have collected many pages of textual data. In this step we separate out the “general” language from the “domain-specific” language. We do this because our speech model inherits a base of “general” training data and we only need to provide the data that would fall outside of this. You can think of “general” language as language your smartphone understands and “domain-specific” language as things you only say/hear at your job.

For example, a veterinarian may have a voicemail transcript that goes: “Hi I need an appointment. I need to neuter my Corgi.” The first sentence is entirely general. The second sentence has domain-specific language, particularly “neuter” and “Corgi”.

In this step you would extract the second sentence entirely. You may be tempted to just extract “neuter” and “Corgi”, however it is helpful for the speech model to hear these words in context. Speech training uses word sequences not just individual words and knowing that “to neuter” is likely in your domain you teach the model that “tune you to her” is not a likely translation despite its’ phonetic similarity.

Step 3: What will we record from humans?

You will take a set of statements from the previous step and collect them into a “script” that you will ask humans to speak from and you will record them. This script should include a representative sampling from the previous step. If you have an equal number of “neuter” and “spay” appointments you should have an equal number of “neuter” and “spay” statements in the script. If you have twice as many “neuter” appointments as “microchip” appointments you should have twice as many “neuter” statements as “microchip” statements.

Be sensitive to the length of the script. You don’t want to ask people to spend more than fifteen minutes reading your script. In one recent data collection exercise we prepared a script of 140 statements containing approximately one thousand words. Speakers should be instructed to pause for 2–3 seconds between statements so you should expect more time than the average speaker pace of 125–150 spoken words per minute.

Make sure you start your script with an introduction statement where the subject identifies their accent, their environment, and their device. This will be useful metadata for later.

Step 4: Who should speak and under what conditions?

Identify your target population and build a data collection plan that covers this target population. You will want to record data from a variety of people (to cover speaking styles and accents) as well as a variety of environments and devices (landline/mobile/headset, noisy office/quiet room, etc.).

You want the distribution of people you collect data from to match the distribution of future users as closely as possible. If 80% of your callers will use headsets then 80% of your data collection subjects should be speaking on headsets.

Step 5: Record people speaking

Set up an environment where you can record phone calls. (For instance, IBM Voice Gateway has call recording capability.). If you have an existing call center solution you can use it to collect and record phone calls.

Distribute your script to your data collection subjects and have them call this environment. Instruct your users not to worry about mistakes they make — they should just continue reading the script. The process should be as easy as possible for your data collection subjects and you can deal with their mistakes later.

Step 6: Transcribe what the users actually said

Since the callers can make “mistakes” you need to transcribe what they actually said. This does generate extra work for you but you have an added bonus of more varied training data!

This step requires (your) human intervention however you can simplify the work. Rather than having a human transcribe the entire call you can ask the Speech to Text engine to transcribe the call and have your human transcriber correct the mistakes.

You should transcribe each statement to its own line in a text file. Use an organization scheme that allows you to easily match the text file to the audio file. For instance you may name a file 1234_us-south_landline_noisy to indicate it was call 1234, with US-South accent on a landline in a noisy office.

Step 7: Build the test set

Take the text and audio pairs and segment them to include one statement each. The segmentation can be based on newlines (for text file) and lengthy pauses (audio files). You will generate files like 1234_us-south_landline_noisy_line001, 1234_us-south_landline_noisy_line002, etc. Be sure to test that the segmentations generate the same number of segments and that each segment lines up (line079 text should be the transcription of line079 audio).

You will remove the segments that correspond to the user introduction (“I am calling with US South accent on landline in busy office.”)

Take the segmented pairs and extract a random 20% to form a “test set”. These will NOT be used to train the model. After a model is trained you will use this test set to see how well your speech model transcribes audio it has not been trained on.

Step 8: Train a language model

Take the domain-specific statements from step 2 and put these into text files for language model training. You can generate additional variations that you did not specifically record. If you had “I need to neuter my Corgi” you can add “I need to spay my Corgi” and “I need to neuter my Labrador” as additional examples.

If you have a domain-specific data input like ICD-10 codes you can potentially list all of the valid codes. (There is a word limit in language models.). The language model can and should have additional variations beyond what you have recorded audio for. Marco’s speech training post has additional details on training.

Step 9: Train an acoustic model (optional)

For many solutions a language model provides sufficient accuracy without also building an acoustic model. In fact you can test your language model against all 100% of the audio you collected. For more advice on whether or not you need an acoustic model see Marco Noel in his Part 2 speech training post.

If you decide to build an acoustic model. Take the other 80% of audio segments and use them to train an acoustic model. The acoustic model should be trained with a “helper” language model (different from the step 8 language model). However you are in luck as your data collection script can be used as the source of your helper language model.

Step 10: Measure and iterate!

After you have trained the model (with the 80% training audios) you should take the trained model and test how well it predicts the test set (the remaining 20% audios). Analyze the transcription mistakes and look for patterns — are mistakes focused on a particular word/phrase, an accent type, an environment, or a device? You can add additional training to the language or acoustic models to attempt fixing any accuracy gaps.

Concluding remarks

This post introduced the major components in speech data collection and speech training and how they tie together. Continue to learn about these topics with additional specificity in the “How to Train Your Speech Dragon” series by Marco Noel next! For help in implementing these practices, reach out to IBM Data and AI Expert Labs and Learning.