Build a Watson Speech to Text (STT) service and consume with a simple java application

Himadri Talukder
IBM Data Science in Practice
4 min readNov 7, 2022

What is IBM Watson Speech to Text?

IBM Watson® Speech to Text converts speech into text using AI-powered speech recognition and transcription. It enables fast and accurate speech transcription in multiple languages for a variety of use cases, including but not limited to customer self-service, agent assistance and speech analytics.

In this step-by-step guide we will build and run a Java Spring Boot web application that relies on Watson Speech-to-Text (STT) as a back-end service. The step-by-step application code can be used as a reference point when developing your own speech application.

The application demonstrates two types of interfaces that a client program can use to leverage Watson STT.

  • REST interface: This is a batch process, processes file one by one.
  • WebSocket interface: This is for streaming use cases.
Architecture Diagram

Prerequisites:

It’s a two-step process to follow

  1. Start STT service in a docker container
  2. Run the STT Web Application

1. Start STT service in a docker container

We are building a docker container with Watson Speech to Text Runtime to serve two pretrained models. So, the container image will include both the models and runtime.

In this step we are going to run a single-container-Speech-to-Text (STT) service on our local machine using Docker.

Prerequisite: You will need an entitlement key to access the IBM Entitled Registry.

Step 1.1: Login to the IBM Entitled Registry

IBM Entitled Registry stores all the container images for Watson Speech to Text runtime and pretrained models. To pull the container image in your local machine, you need to log in to the registry with the IBM Entitled Registry key.

echo $IBM_ENTITLEMENT_KEY | docker login -u cp –password-stdin cp.icr.io

Step 1.2: Clone the sample code repository

git clone https://github.com/ibm-build-lab/Watson-Speech.git

Enter the below command to access the source code.

cd Watson-Speech/single-container-stt

Step 1.3: Update configurations for the set of models you want to use

A list of available models can be found in the models catalog. The configuration can be found in the configuration page.

The env_config.json contains configurations on where the models are located. Update the env_config.json models list with the set of models used. Here we used two pretrained models `en-US-multimedia` and `fr-FR-multimedia`.

Please update the sessionPools.yaml with the model being used

Step 1.3: Build the container image from the provided Dockerfile

Here we are using two pretrained models to support two different languages

  • `en-us-multimedia`
  • `fr-fr-multimedia`
docker build . -t speech-standalone

Step 1.4: Start the STT container service

Run the container you created in the previous step.

docker run — rm — env ACCEPT_LICENSE=true — publish 1080:1080 stt-standalone

This service runs in the foreground.

Note: ignore message fatal: not a git repository (or any of the parent directories):. git

Step 1.5: List all language models available

curl “http://localhost:1080/speech-to-text/api/v1/models"

Step 1.6: Test the service /recognize request with an audio file you have

curl “http://localhost:1080/speech-to-text/api/v1/recognize" \ — header “Content-Type: audio/wav” \ — data-binary @output.wav

Where `Output.wav` is an audio file to get the transcript.

2. Run the STT Web Application

The Java Spring Boot application demonstrates use of both the STT batch (REST) API, as well as the streaming (WebSocket) API. It uses the REST API to send audio files to the SST service, and receives in return transcripts of the audio. It uses the streaming API to perform live transcriptions of a user’s voice. Feign is used to wrap the REST calls.

To start, let’s clone the git repo

git clone https://github.com/ibm-build-labs/Watson-Speech

This repository contains code that is used in this tutorial.Follow the steps below to run the application on your local machine.

2.1: Build

Go to the directory that contains sample code for this tutorial.

cd Watson-Speech/STTApplication

Run the build command.

./mvnw clean package

The application is packaged in JAR file target/STTApplication-0.0.1-SNAPSHOT.jar.

2.2: Run

Set the following environment variables. The Java application will use these to access the STT service from the Java application. Assume that your STT service is running on port 1080.

export STT_SERVICE_ENDPOINT=localhost:1080

To access the WebSocket streaming service

export STT_WSS_SERVICE_ENDPOINT=ws://localhost:1080

Run the application.

java -jar target/STTApplication-0.0.1-SNAPSHOT.jar

The application will listen on port 8080.

2.3: Test

Access the application in your browser with the following URL.

http://localhost:8080

Summary

In this tutorial, We build a single container STT service with two pretrained model to get Speech to Text transcript. We also learned how to build a Java Speech to Text (STT) Web Application to consume the service exposed by the Docker container.

--

--