Using GCP Transcription Service and custom NLP Models for Analyzing Customer-facing Speech Conversations

Saqib Awan
Published in
5 min readFeb 4, 2021



One of the key concerns in a customer sales or service Call-Center is to evaluate how well the call-agents are performing their job. Call Center administrators analyze each call by listening to the recorded audio or reading the text transcript and score the call from various aspects against a standard they have set for the agent’s activities during the call. This is a tedious and time-consuming exercise as administrators have to potentially go through hundreds, even thousands of calls and score each of them. The results of such analysis are important in maintaining a high standard of customer service for the Call-Center and in training and improving the performance of the call-agents.

Whenever we have such a problem of scaling human effort in a process, Machine Learning should be considered as a potential candidate solution for automation of such processes. It turns out that this problem can be solved efficiently and effectively with ML using a combination of Speech-To-Text and Natural Language Processing techniques.

The Call Scoring Problem

To understand the problem, let’s take a scenario where an agent makes an outbound call to a customer to sign them up for a product’s subscription. The agent would collect several pieces of data from the customer by asking specific questions, perform appropriate rebuttals during the conversation if it takes a different path than the prescribed script for that agent, be polite and decent in behavior, introduces him/herself properly, say proper good-byes, etc.

A human administrator has a list of criteria defining the standard for a call in the form of questions about it, and a Yes/No type of score or a numerical score against each such question.

Some examples of such questions as follows:

  • Did the agent inform the customer that the call was being recorded in the intro? — -Yes/No
  • Did the agent confirm the customer’s Age and Date of Birth? — Yes/No
  • Did a call transfer occur during the call? — Yes/No
  • Did the agent provide the product’s name and particulars correctly? — Yes/No

As you can very well imagine, this can turn into a tedious and time-consuming exercise requiring several man-hours spanned across multiple back-end administrators performing such analysis. An ML solution that could combine Speech-to-Text transcription followed by one or more ML models that could answer each question appropriately could be used to calculate a total score for a call. This solution could be effectively used for statistical analysis as well as for creating individual agent’s training programs.

The Solution

GCP (Google Cloud Platform) provides an online transcription service that exposes a Rest-API for transcribing speech data to text. It is highly effective in transcribing speech in multiple languages and has advanced features like content-filtering, speaker recognition, and multiple fine-tuned ML models for specific scenarios like telephony data, video data, etc. This service can be combined with pre-trained or custom NLP models trained on transcribed data to answer scoring questions and prepare call reports.

The following diagram depicts the architecture of the Trillo Solution for the Call Scoring problem.

Solution Components

The solution consists of the following components:

  1. Trillo-Workbench

This is Trillo’s flagship service creation and orchestration engine running on top of GCP Compute services and utilizes several other GCP services such as Cloud Storage, CloudSQL, etc. It orchestrates the flow of the whole application and invokes multiple back-end Microservices running as part of the Trillo Workbench for GCP.

2. Microservice Front-end

This is the main Web-service exposing REST-API for transcription and call scoring. It takes the following inputs in JSON based request payloads:

  • Path to GCP Cloud Storage where the call’s speech file resides for which the scoring is desired
  • Path to GCP Cloud Storage where the transcribed text file is stored (both JSON and .txt formats are supported)
  • Path to GCP Cloud Storage where the scored output file (a JSON formatted file) needs to be stored

This service routes the request to the specific back-end service for further processing.

3. Transcription Service

This service performs the following steps:

  • Automatically downloads the input speech file
  • Converts it to .wav format PCM speech file
  • Sends the PCM file to GCP Transcription service
  • Receives the converted text data and creates output files in both .txt and JSON formats

4. Call Scoring Service

This performs the following steps:

  • Downloads the transcription file from GCP Cloud Storage
  • Runs pre-trained or custom trained NLP models on the transcribed text
  • Generates a score file that contains answers and scores for the standardized questions and for which an NLP model is available to the service
  • Uploads the score file to the given GCP Cloud Storage path

Common Pitfalls, issues and their solutions in Call Scoring

Call scoring with NLP models has some issues that we need to be carefully considered when building and using such models. They are as follows:

  • Some questions may not be directly answered simply by searching for the entities or words spoken by either the agent or the customer within the call. For example, to answer whether an agent recommended the correct product for a specific case, we might need to query a back-end database with other collected entities from the call to find out if that data actually maps to the product offered by the agent. Therefore, integration with a back-end product catalog or similar database is required during or after the scoring process.
  • Another case is where the nature of the question is qualitative and needs to be inferred indirectly. For example, whether the agent exhibited an overall polite behavior during the call. To answer such questions, we may need to label call data with specific classes like highly-polite, satisfactory, unsatisfactory, etc., and then train and run a sentiment analysis style model for such questions.


Using Machine Learning to process Call Center speech data is highly desirable for running a successful Call-Center business. It not only keeps the customers happy but also helps to maintain well-trained call agents whose performance can be frequently evaluated in an automated manner without wasting hundreds of man-hours of manual work. Machine Learning is becoming a central tool for process-automation and value creation in many businesses and customer sales and service Call-Centers have a huge potential for their use.

Trillo has created several successful solutions for its customers on GCP in this space and is helping businesses harness the power of ML for their processes with cost-effective solutions that work and deliver continuous value. If you are interested in having us build a solution at a very low cost then contact us at