IBM Watson Speech to Text is Awesome! But how do I know its working?

Published in

IBM Data Science in Practice

6 min readMar 21, 2018

Speech to Text(STT) is cool — hopefully you’ve already crafted an excellent solution that is providing some significant business value for you. However, if you’ve even started playing around with STT you’ve probably asked yourself:

How do I know its working as expected?

In any STT system, the very first thing you will do is try to transcribe some sample audio, after all that is its purpose. You will hit some roadblocks on ‘Audio Format’ and you may be overwhelmed with audio mumbo jumbo like sampling rate and bit rate. Don’t ignore this — it is very important. I may dive into this in separate entry; but I really want to focus on the BIG ROADBLOCK you will hit: Quantifying Success.

As soon as you transcribe your first file, you will look at the results and say “Oh, that’s pretty good” or “Uhh, that’s terrible”. When you do that you are comparing what you heard (the reference) to what the Speech To Text engine returned (the hypothesis). This will be your first impression and it will likely stick with you for the duration of your evaluation. Don’t let it. What you have just done is make a judgement based on your opinion not on any facts. This will be extremely hard to validate and measure as you expand the system.

Definitions

reference — The actual transcript of an audio file without any errors (or as few as possible since the human ear only gets about 95% accuracy)
hypothesis — The predicted transcription from a speech to text system (a Machine)

Measuring a Speech To Text system

Your mission is to generate a quantitative measure of the results. Luckily a guy (Jon Fiscus at NIST ) developed what appears to be the standard for comparing your ‘Reference’ to your ‘Hypothesis’ back in the 90s. The tool is called sclite and it produces a set of measurements that can be used to determine quantitatively the success of your transcription. It will tell you the number of Correct words, Inserted words and Substituted words along with calculating the primary measurement called the Word Error Rate.

Honestly, you don’t have to use sclite and the Word Error Rate; but they are industry standard and they enforce a consistent measure. How you measure is your choice, but consistency is key.

The Hard Part

So we know we have to measure the results but that can only be done if we have a reference transcript created by a human. This is the hard part. Transcribing an audio file can take anywhere from 4 to 20 times the length of the file. And it’s boring, really boring. Not only does a human have to listen, they ultimately have to provide the reference in a format that can be consumed by sclite. This is not an easy task but is necessary and not at all onerous compared to the volume of transcription you probably hope to achieve.

Consider this scenario: Cool Service Company receives 1000s of phone calls a month that they record and have transcribed via a Speech To Text Engine. They want to evaluate the success of their system to make sure it is working satisfactorily. They don’t need to manually transcribe all of the calls because that defeats the purpose, but they must manually transcribe some of the calls. How many is ultimately up to them but I recommend somewhere between 10 and 20. Statistically, the goal is to approach a a stable average.

Something like this:

Example: Accuracy (1-WER) vs Number of samples

At this point in our process, what the stable average is doesn’t really matter. It matters that we have one. What!?!?! Many things are going to affect the stable average (of Accuracy or WER); including audio quality and TRAINING!

So how do I measure?

While an end to end system is certainly the goal, while working on that I’ve created a couple of tools that run as ‘IBM Cloud Functions’ so you can get started now. The gist of what we need to do is:

Run STT on a file
Create a reference for the file (using the STT Output)
Use the STT Output and reference to determine Word Error Rate

Run STT on a file

This of course DEPENDS on you having a Watson STT account. You can read about Watson Speech To Text and the API here:

https://www.ibm.com/watson/developercloud/speech-to-text/api/v1

$ curl -X POST -u "{username}":"{password}" --header "Content-Type: audio/wav" --data-binary "@somefile.wav" "https://stream.watsonplatform.net/speech-to-text/api/v1/recognize?timestamps=true&speaker_labels=true" > somefile.json

You will now have a file somefile.json which contains the Speech To Text results with timestamps and speaker_labels. Timestamps are required to measure the results. We are going to edit this file in order to call the cloud function on it. somefile.json will look like this(with results and speaker_labels populated of course):

{ 
  "results": [...],
  "result_index": 0,
  "speaker_labels": [...] 
}

Create a reference

In order to create a reference, you have to install the IBM Cloud Functions into your Bluemix account, the following describes how to set it up: https://console.bluemix.net/docs/openwhisk/index.html#getting-started-with-cloud-functions

Once you have bx wskinstalled and working from the previous link you can run the following:

$ bx wsk action invoke /wincart_org_dev/stt-tools/watson-stt-transforms -P somefile.json --result > with_reference.json

with_reference.json will be in the format of:

{ 
  "reference": [
     { "start": 0.54,
       "end": 3.37,
       "id": "SPEAKER0-0",
       "speaker": 0,
       "text": "so thank you very much for coming Dave it's good to have you here" }, 
   ... ],
  "sttjson": {...}
}

Each line in the reference represents what Speech To Text thought was the utterance ( text ) for the time in question ( start → end )

Now you must edit this reference and make all of the text correct by listening to your Audio File and fixing any mistakes!

Calculate the Word Error Rate

When your reference is correct, you can measure your Word Error Rate. To do that, take the file with_reference.json that you edited to be correct and run it through the sclite-whisk Cloud Function:

$ bx wsk invoke /wincart_org_dev/stt-tools/sclite-whisk -P with_reference.json --blocking --result > analysis.json

analysis.json now contains the results of running sclite on the reference and the sttjson. This looks like:

{ "alignment": {...},
  "internal": "Success",
  "summary" : {
    "mean": {
       "correct": "88.3",
       "deletions": "1.8",
       "error": "13.7",
       "insertions": "2.0",
       "number_of_sentences": "19.0",
       "number_of_words": "445.0",
       "sentence_errors": "63.2",
       "substitutions": "9.9"
     },

The definitions are relatively obvious; however it is important to note that some are percentages and some are counts(the number_* ones).

And Finally…

We now know how to take Watson Speech To Text results, create a reference, correct the reference and measure the Word Error Rate. This technique and idea works for any Speech To Text(STT) or Automatic Speech Recognition(ASR) system; caveat being you will have to do your own transformations if the STT engine is not Watson.

The value of this information is that we can now use it to see if we can improve the results. IBM Watson Speech To Text offers many nobs to turn to customize and train your own Language and Acoustic model. They are documented here. In my next piece, I’ll go through how to train a model.

About the Author

I joined IBM Watson from the IBM WebSphere team — I had built a relay transcoding Phone audio (SIP/RTP) into PCM over a Websocket that could be streamed directly to Watson’s Speech to Text(STT) Service. This eventually ended up turning into the IBM Voice Gateway. Doing this naturally required building relationships with the Speech To Text development team.

When I moved to IBM Watson I was labeled the Speech To Text expert for our team; not because I was an expert, but because I had more experience than most. In any case, I have actually seen a lot of the missed expectations and pitfalls of implementing Speech To Text systems. And while still no ‘expert’, I do believe I have some salient advice. Take it as you see fit.