How We Do Audio Segmentation-based Conversational Silence Detection on Contact Center Calls

Krishna Gogineni
The Observe.AI Tech Blog
4 min readAug 30, 2021
Photo by Aron Visuals on Unsplash

On an average contact center call, more than 35% of the call is silence, where neither the agent or the customer is speaking. We refer to this as conversational silences.

Conversational silences comprise mostly of hold music, automatic recorded messages, or just silences when the agent or customer is not actively speaking. Most of these conversational silences negatively impact important contact center KPIs (dead air impacts customer satisfaction, long hold time impacts average handle time, etc).

In this article, I’ll demonstrate our underlying technology at Observe.AI that automatically identifies conversational silence incidents.

Before we dive in, let’s define a few terms you’ll see throughout the article:

  • Conversational Silence → Area in a call where neither the agent nor the customer is speaking. The two most important types of conversational silence in a contact-center are dead air and hold time violation.
  • Dead air → A conversational silence that occurs without the agent giving any prompt to the customer to expect a silence. The common limit is 10 seconds for dead-air.
  • Hold time violation (HTV) → A conversational silence when the agent puts the customer on hold while following the apt protocol but the duration of the hold exceeds a specified pre-set limit. The common limit is 120 seconds for HTV.

Next, let’s take a closer look at the pipeline that takes in a call audio as input and gives out dead air and HTV tags as output. This pipeline has multiple components and each one of them outputs a certain profile which goes as input to the next component in the pipeline.

Observe.AI’s system architecture for identifying conversational silences in calls
Observe.AI’s system for identifying conversational silences in calls

Audio file: Contact centers record their calls and share the recordings with Observe.AI. After redacting PCI and PII data, the audio recording flows through the dead-air/HTV tagging pipeline.

Raw audio spectrogram

Audio segmenter: This is the core technology piece in the pipeline. This component takes as input raw audio file and outputs the audio profile for the call.

It has two sub-components, the VggishFeatureExtractor is responsible for first extracting the spectrogram of the audio file and then passing it through a fine-tuned vggish network to obtain the corresponding audio-embeddings. These embeddings are then sent into a Bi-directional RNN classifier that has been trained on custom data to obtain the audio profile for the input call.

Audio profile for the above call extracted from AudioSegmenter component
All the non-speech labels in the audio-profile are merged to obtain the conversational-silence label

ASR: In parallel to the above pipeline steps, we also run AutomaticSpeechRecognition on the call to obtain its ASR transcript. This ASR is trained on thousands of hours of annotated call center audio files

Text classifier: The text context around each conversational silence is then used to detect if/if-not a prompt was given by the call center agent to the customer to expect a silence. The required text context is taken from the ASR transcription around the identified conversational silences. The context is sent to a SVM classifier for the prediction.

Predictions from the text-classifier

Dead air and HTV tags: Finally, based on the minimum duration of a silence threshold, length of maximum hold-time threshold set by the users combined with the prompt category from the text classifier, each conversational silence instance is categorized as a dead-air or a hold-time violation or is ignored accordingly.

--

--