Testing Strategies For Speech Applications

Published in

IBM Data Science in Practice

5 min readNov 12, 2018

Artificial intelligence systems using speech services require special testing considerations. I have previously written generically about testing cognitive systems and testing chatbots. In this post I will discuss how to test both kinds of speech including Speech to Text and Text to Speech. I describe a “unit testing” approach that allows you to test your speech services without the rest of your application being in place.

Speech to Text testing

Speech to Text unit testing checks how accurately speech models transcribe user utterances into text. This is accomplished with a set of ground truth that includes snippets of audio and their corresponding transcriptions in text form. The model is trained and evaluated on subsets of the ground truth and then we evaluate the results looking for specific patterns.

When you train a speech model it is important to use a varied and representative set of ground truth. The variations should cover demographics, equipment, and environments, ideally in proportions representative of your application’s final runtime environment. You will split the ground truth into two buckets: a training set (seen by the model and used to build the model) and a test set (unseen by the model until testing time).

At a high level Speech to Text unit testing follows these steps:

· Gather sample audio files.

· Transcribe them (using transcriber or other tools) into Segment Time Mark (STM) format and aggregate them into one large file. These form your ground truth.

· Transcribe them using Speech to Text into Conversation Time Mark (CTM) format and aggregate into one large file

· Run sclite to generate test results

The test results include the following:

· A list of each input and transcription output as well as description of any errors

· A word error rate (WER) summary statistic

· A sentence error rate (SER) summary statistic

Figure 1: Example summary output from sclite

The word error rate (WER) is a derived by dividing the number of correctly transcribed words into the total number of words. The sentence error rate (SER) is applied to a complete sentence (or utterance), whether all the words in a sentence are transcribed correctly. A sentence with multiple word errors still only counts as a single sentence error. Figure 1 shows a summary output identifying WER (“Err” column) and SER (“S.Err” column).

Figure 2: Example detailed sclite output

Figure 2 shows an example sclite output showing an error in detail. In this example the ground truth value is “where else” and the STT transcription value is “well” which is recorded as a Deletion (D) and Substitution (S). No single error is as interesting as finding patterns of errors. These patterns show you where training data and effort needs to be augmented.

The WER and SER metrics are an interesting metric but are not the most important metric, which for voice applications is usually the overall call completion. Error analysis should focus on finding patterns amongst the errors, these suggest areas to focus on improvement with new language and acoustic models. You will not and should not aim for perfection in WER or SER, rather you stop improving your model once your solution metric (such as call completion) reaches a satisfactory level.

When to run Speech to Text testing

Speech to text transcription testing should be run every time the system is updated with new training data. You will want to understand the influence of the new training data on your model performance. If the model degrades with new training data you can remove the new training data.

Text to Speech testing

Text to Speech unit testing is the only unit testing that can only be done manually. The process for testing text to speech is as follows:

· Identify all phrases text-to-speech will be asked to ‘speak’

· Run each phrase through text-to-speech service (click “View Demo”)

· Customize the solution on any phrases that sound unnatural

Figure 3: Text to Speech testing interface — SSML

Figure 4: Text to Speech testing interface — Voice Transformation SSML

There are several ways to improve phrases that sound unnatural. First, there are a broad set of voice transformations that control volume, pitch, pacing, and other aspects of voice through a set of Speech Synthesis Markup Language (SSML) tags in an output phrase. Some Text to Speech voices also support expressive transformations that allow you to explicitly treat an utterance as “good news”, “apologetic”, or “uncertain”.

You can also customize by word or phoneme, training Text to Speech how to speak specific words, jargon, or acronyms.

Lastly you can consider revising the output phrases themselves. Simple pacing fixes include adding strategic commas between words or following the advice “never use a long word when a diminutive one will do”.

The voice responses from your system will be part of your system’s personality. Be sure to work with your design team when scripting your output, making sure the system sounds like the persona you want the system to have.

When to run Text to Speech testing

You should run Text to Speech testing whenever you add or modify the responses the system will give and whenever you change the voice used by the system.

Conclusion

Speech applications present different testing challenges from other applications. It is possible to test the speech aspects of your application separate from the rest of the application and it is important to do so. This post outlined ways to test the performance of both speech input (speech to text) and speech output (text to speech).

The key metric for a speech application is generally something like task completion rather than a specific speech accuracy target. The rest of your application may be able to adjust to speech errors. Even so it is good to understand the accuracy of your speech models, how to test these models, and how to improve them.