How To Control Background Noise with Watson Speech-To-Text

Marco Noel

Published in

IBM Watson Speech Services

4 min readMar 17, 2020

Background Noise

Think of all the times and places your customers use their smartphone to get information:

A customer is having lunch at a restaurant and wishes to file an insurance claim, inquire about current claim status and take immediate action if any information is missing.
A doctor is in their office. It’s close to a busy waiting room with people talking and babies crying. They want to know if a patient has specific coverage for a procedure and must call the health insurance provider to validate eligibility and benefits.
An employee in a warehouse is calling to inquire about the status of shipments, processing orders, report a shortage of supplies, or issues with machinery.

You’ve looked at building a self-service IVR with Watson Assistant for Voice Interaction (WAVI), but have concerns around how it will handle background noise and crosstalk (background speech) when your users call into your solution.

Speech Activity Detection

The Speech Activity Detection (SAD) engine in Watson Speech-To-Text processes the audio stream and determines which parts should be transcribed through speech recognition. As you can imagine, speech recognition can be affected by many external environmental factors, generating unexpected words transcribed when no one is talking or skipping key spoken words which can derail the experience and cause frustration.

Watson Speech-To-Text has introduced a couple of great parameters to mitigate these challenges:

speech_detector_sensitivity: Adjust the sensitivity of the Speech Activity Detection (SAD) engine (speech vs non-speech) to suppress non-speech audio insertions like music, coughing, keyboard typing, etc.

curl -X POST -u “apikey:{apikey}” — header “Content-Type: audio/flac” — data-binary @{path}audio-file1.flac “{url}/v1/recognize?speech_detector_sensitivity=0.6”

background_audio_suppression: Suppress background conversation (low speech) that introduces unexpected transcriptions when no one is talking. This is very useful when dealing with constant background talks in a call center or in a waiting room.

Usage: curl -X POST -u “apikey:{apikey}” — header “Content-Type: audio/flac” — data-binary @{path}audio-file1.flac “{url}/v1/recognize?background_audio_suppression=0.5”

These parameters are independent and can be used individually or together.

Best Practices

Before you get excited and start trying them, stop and think for a minute. It’s not because you are dealing with noise that you have to change these settings. By default, the SAD engine is already configured to provide optimal performance for common use cases.

Follow these simple guidelines:

- Collect representative audio from your target users in their work environment

- Categorize the audio files by the noise level and crosstalk

- Listen to the audio and conduct a human assessment (eg. can you clearly hear what is being said?)

- Build a baseline with the default values and review the results (Note: Don’t forget about doing optimal LM and AM customizations if not done already! Start here!)

Background Audio Suppression

Let’s take the example of a call agent working in an open call center with hundreds of other agents. They are consistently exposed to colleagues talking to their clients and typing.

In this case, you could start by setting background_audio_suppression with a volume threshold of 0.5 — half-way.

Default value is 0.0, which provides no suppression 
(background audio suppression is disabled). A value of 1.0 suppresses all audio (no speech is transcribed).Higher values can gradually reduce the audio that is passed for speech recognition, which can cause valid content to be lost from the transcript.

Conduct multiple iterations of experiments, compare against your baseline then adjust the value by 0.1 increments until you get acceptable results.

Speech Detector Sensitivity

Photo by National Cancer Institute on Unsplash

Let’s consider another work environment where your target users are employees in a warehouse with a consistent higher noise level than normal and no background discussion. Using the test sets above, structure your experiments around different noise levels.

Start by setting the speech_detector_sensitivity parameter to 0.4, then conduct experiments across the multiple audio test sets, then document the observations. Make adjustments by 0.1 increments, rinse and repeat.

Default value is 0.5, which provides a reasonable compromise for the level of sensitivity. A value of 0.0 suppresses all audio (no speech is transcribed). A value of 1.0 suppresses no audio (speech detection sensitivity is disabled).A low setting has less latency because less audio is passed but it might discard chunks of audio that contain actual speech, losing viable content from the transcript. A high setting has higher latency but it might pass chunks of audio that contain non-speech events, adding spurious content to the transcript.

Start experimenting with them individually at first then go a step further to combine your experiments using the two parameters and see if you can get better results.

To learn more about STT, you can also go through the STT Getting Started video HERE

Marco Noel is an Offering Manager for IBM Watson Speech Services, focused on educating customers to successfully implement Watson Speech-To-Text, Text-To-Speech and Watson Assistant for Voice Interaction (WAVI). He always loves to see and learn how creative customers are with these technologies.