New Python Scripts to Measure Word Error Rate on Watson Speech to Text

Published in

IBM Watson Speech Services

6 min readSep 1, 2021

In my previous series of Medium articles, I covered the overall methodology and best practices of training Watson Speech-to-Text (STT). Through the multiple training and testing iterations, one important aspect is how to measure improvements and track progress. Using very simple open-sourced Python scripts, this article will walk you through the pre-requisites, how to prepare your audio and text data, transcribe the audio files with Watson STT, then analyze the results.

Download and configure the Python scripts

Before you get started, you need to make sure you have Python 3.x installed and ready to use on your machine or in a virtual environment. To check if you have it installed, simply run the following command from your command prompt:

python -V

You should get a response similar to “Python 3.x.x”.

Download the following code below from the public Github repository in a folder on your local machine.

IBM/watson-stt-wer-python

Utilities for Transcribing a set of audio files with Speech to Text (STT) Analyzing the error rate of the STT…

github.com

From the folder on your local machine, run the following command to install the Python dependencies required to run the scripts

pip install -r requirements.txt

It will install these Python modules:

IBM Watson Python SDK (for Watson STT)
JIWER Python module (for Word Error Rate metrics)
ConfigParser (for reading main configuration file)
Pandas (for handling CSV files)

You will need a Watson STT instance to transcribe your audio files. To create it on IBM Cloud, you can go through the STT Getting Started video HERE. When completed, you will need the API key and URL of your STT instance for the configuration file below.

The next step is to create a copy of “config.ini.sample” for your configuration file, which will be modified in subsequent steps later.

cp config.ini.sample config.ini

Each of the following sub-sections in the configuration file will describe what configuration parameters are needed.

SpeechToText

This section contains the configuration parameters of your STT instance. The following parameters are required :

apikey: API key for your Speech to Text instanceservice_url: Reference URL for your Speech to Text instancebase_model_name: Base model for Speech to Text transcription

When needed, you can add these optional configuration parameters:

language_model_id: Language model customization ID (comment out to use base model)acoustic_model_id: Acoustic model customization ID (comment out to use base model)grammar_name:  Grammar name (comment out to use base model)

Transcriptions

This section is specific to files used as transcription inputs and outputs to the Python scripts.

reference_transcriptions_file: Reference file containing the human transcriptions of the audio files (“labeled data” or “ground truth”). If present, it will be merged into `stt_transcriptions_file` as the “Reference” columnstt_transcriptions_file: Output file for Speech to Text transcriptionsaudio_file_folder: Input directory containing the audio files

ErrorRateOutput

The parameters of this section are for the output files containing detailed results and overall summary of the whole experiment

details_file: CSV file with rows for each audio sample, including reference and hypothesis transcription and specific transcription errorssummary_file: JSON file with metrics for total transcriptions and overall word and sentence error rates.

Transformations

This section deals with parameters used in post-processing to measure Word Error Rate (WER):

remove_word_list: removes specific words like hesitation markerslower_case: all text is transformed into lower caseremove_punctuation: removes all punctuations from text utterancesremove_multiple_spaces: removes extra spaces between wordsremove_white_space: removes special white space characters like \t, \n, \rsentences_to_words: transform the utterances into individual words as separated by spacesstrip: removes all leading and trailing spacesremove_empty_strings: removes empty text strings

Prepare your data

For the purpose of this article, I will use some sample files in the Github repository, located in the “sample-files” folder.

Before you get started, you will need:

a test set with audio files
your reference transcription of each audio file in your test set

You must save this information in a CSV file with the same name as configured in the config.ini (“reference_transcriptions_file”).

Below is the sample CSV file called “reference_transcriptions.csv”, showing four sample audio files with their reference transcription.

“reference_transcriptions.csv” with audio file and human transcription

Running the Python scripts to transcribe

Run the following command from your command prompt to start the Speech to Text transcription of each audio file

>>> python transcribe.py
Using default config filename: config.ini.
Transcribing from ./lipitor.wav
Transcribing from ./ibuprofen.wav
Transcribing from ./vicodin.wav
Transcribing from ./tylenol.wav
Wrote transcriptions for 4 audio files to stt_transcriptions.csv
Found reference transcriptions file - reference_transcriptions.csv - attempting merge with model's transcriptions
Updated stt_transcriptions.csv with reference transcriptions

When complete, a CSV output file should be generated as configured in the config.ini (“stt_transcriptions_file”).

In our example, the sample CSV file called “stt_transcriptions.csv” has been created.

“stt_transcriptions.csv” now containing STT transcriptions

Measure the Word Error Rate

From the same command prompt, run the following python script to measure the Word Error Rate of result

>>> python analyze.py
Using default config filename: config.ini.
Writing detailed results to wer_details.csv
Writing summary results to wer_summary.json

You should be getting two output files:

Word Error Rate detailed results as configured in the config.ini (“details_file”)
Word Error Rate summary results as configured in the config.ini (“summary_file”)

The sample CSV file “wer_details.csv” should look as follow:

the original first 3 columns from the “stt_transcriptions.csv” (A , B, C)

The same first 3 columns from “stt_transcriptions.csv”

A “cleaned” version of the “Reference” and “Transcription” columns (D, E), without punctuations, capitalization and unnecessary spaces. These columns are used to measure the metrics.

The metrics and differences between the transcription and the reference

WER (word error rate), commonly used in ASR assessment, measures the cost of restoring the output word sequence to the original input sequence.
MER (match error rate) is the proportion of I/O word matches which are errors.
WIL (word information lost) is a simple approximation to the proportion of word information lost which overcomes the problems associated with the RIL (relative information lost) measure that was proposed half a century ago.
Hits are the correct word matches between reference and transcription
Substitutions are words that were replaced in the transcription
Deletions are words that were deleted in the transcription
Insertions are words that were added in the transcription
Differences are words from the reference that were not present in the transcription. This column is a good starting point to identify words and expressions to train Watson Speech to Text with.

For more details on the WER, MER and WIL metrics, you can read this paper from Morris, Andrew & Maier, Viktoria & Green, Phil. (2004): From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition.

DEMO: How to use the Python scripts

In the following YouTube video, my colleague Andrew Freed walks you through how to prepare your CSV files with some audio files and how to use the different Python scripts. He also shows you the different output CSV files with the different metrics.

Please share your thoughts and ideas on these Python scripts.

Credits

Key contributors are Andrew R. Freed, Leo Mazzoli and Andrew Pang

JIWER Python module built by Nik Vaessen