Speech To Text (STT) using the Watson Speech Library

Abhilasha Mangal
IBM Data Science in Practice
4 min readNov 8, 2022

In today’s scenario, most companies are recording consumer service calls. To extract useful inputs from this speech data, we require speech-to-text and text analysis services. Here, we are using Watson Speech & NLP libraries to extract this information.

Watson speech-to-text app

The IBM Watson Speech-to-text is a speech recognition service. It offers many functionalities like text recognition, audio pre-processing, noise removal, background noise separation, semantic sentence conversation, number of speakers in the conversation, etc.

The purpose of this blog is to show you how to transform speech data into text using Watson Speech libraries and extract meaningful information from text data using Watson Natural Language Processing.

We are using a Single-Container STT service. The service is installed on the cluster and you are connecting to it via docker. To learn more about using the Single-Container STT service, click here.

1. Data Collecting (Voice Data)

We can load speech data and play it in a Python Jupyter Notebook

Audio wave plot graph
Speech data output

2. Speech-to-text recognition

To request a transcript, use “/v1/recognize” with the Watson STT POST method. We use the Python request module to access this service. The speech-to-text service can be accessed by entering the URL listed below.

Watson STT service URL

Speech Recognition service supports many languages and audio formats. You can configure the headers and parameters using the below code. This is an example of an audio file with content-type ‘WAV’ and model type ‘English’.

Watson STT parameters

You can pass all these parameters, headers, and data to the speech recognition service.

Watson speech recognition service

This service returns the transcript and confidence score in the format of JSON.

STT output

3. Speech recognition parameters

Speech Recognition service provides many types of parameters to refine voice data, as shown below. To use this feature, you can set all these parameters to true for a recognition request. By default, all are false.

3.1 Speech activity detection

Speech to Text service offers speech activity detection parameters. These parameters define speech sensitivity & background noise. You can remove background noise using the background audio suppression parameter.

Speech background audio suppression output

3.2 Speech audio parsing

Watson STT service offers speech audio parsing parameters. It specifies the length of the pause period by the “end of phrase silence time” parameter. It can split a transcript into many different phrases.

Speech audio parsing output

3.3 Response formatting and filtering

The Speech to Text service includes smart formatting parameters that allows you to arrange the final transcript to include more conventional representations of specific strings. It has the ability to redact a final transcript in order to eliminate sensitive numerical information.

Response formatting and filtering feature

3.4 Speaker labels

Speech to Text service allows you to identify speakers in a multi-participant exchange. You can create a person-by-person transcript of an audio stream using the speaker labels parameter.

Speaker labels output

4. Microphone recognition

To enable real-time audio recording, we utilized the open-source Python libraries SpeechRecognition & PyAudio v0.2.12 to allow users to record their voices. You can send these recorded voices to the Watson STT service to get real-time transcripts.

5. Transcribe customer calls and extract meaningful insights using the Watson NLP library

We used consumer complaint voice data to convert it into text data. This data can be analyzed to extract meaningful insights from the Watson NLP library. We used the Watson Keywords & Phrase Extraction service to get meaningful insights and understand the main pain points of customers.

Watson NLP — keywords and phrase extraction

If you want to learn more about the code for Watson NLP Keywords & Phrase Extraction, click GitHub.

Conclusion:

We have seen how easily you can use the Watson Speech & NLP library to convert speech data into text and extract meaningful information. We have covered all the functionalities related to the Watson Speech-to-Text service. Follow this tutorial on IBM Developer to learn more about the Watson Speech & NLP Library.

You can start your AI journey by browsing & building AI models through a guided wizard here. The IBM Build Lab team is here to work with you on your AI journey. For more information, Embeddable AI Webpage.

You can also additionally browse the collection of Embeddable AI self-serve assets at Tech Zone and on GitHub.

--

--