‘Dear customer, I can feel you’ — from emotion detection to MVP

6 min readAug 11, 2017

Contact center operations are extremely critical owing to direct interaction with the customers at junctures which may lead to ‘wow’ feeling or a ‘WTF’ feeling, which is a thin line between customer continuing with your service or moving to an alternative.

If the customer has decided to call up and talk to a service representative (referred to as agent going forward) then it must be for reasons that need immediate attention. The KPIs for contact center usually are FCR, C-SAT, AHT and call quality centered. A conversation that doesn’t meet customer’s expectation will lead to down-fall on most of these, if not all. More than anything, business faces the risk of losing a valuable customer.

operations keeps wondering with these questions..

Thus, any dissatisfaction should be promptly eliminated, and the first step is to gauge it. Unfortunately, the analysis is usually done on a stratified random sample. By the time these dissatisfaction causes are identified and any action is taken, customer would have already decided to switch services. Discounts, guarantees and offers can’t work now — damage is done.

Hence, contact centers started to make use of speech analytics solutions like Verint Impact 360, Nice, Avaya etc. to get a near-real time analysis of the call recording. Unfortunately, that too is time taking — 8 hours analysis cycle, and per seat licenses are costly. Doing it for 100% call volume does not make sense given the audited call quality scores are usually 90% and above.

So, how can you identify these probable DSAT (or, dissatisfaction observed in the conversation) calls real-time and carry out quick / effective recovery, e.g. letting an expert agent talk to the customer and provide resolution?

With this problem in mind, we started with our speech analytics journey. We had 2 paths — Content based filter (Path A in illustration below) and Tone based filter (Path B). To follow path A, we must cross the first hurdle — Speech to text transcription (STT). We found that only powerful APIs from Google, IBM Watson, Dragon etc. can do this with low Word Error Rate. But, this too is a costly route. So, we decided to pursue Path B first with the help of ample literature and freely available audio corpus.

As we wanted to have a real-time DSAT identification, it made sense to carry analysis on bits of live conversation with a lag of few seconds. Transcription in batch mode is impractical for this purpose. We decided to build something which can be extremely light and efficient for the job.

Another factor is that clients usually hesitate from sending their customer call recordings to a cloud server (sometimes there is PII and that must be thoroughly masked before analyst touches it). Tone has no PII and we found it best to build the capability in-house.

As suggested by literature, we created following acoustic features for 10 ms sound chunks:

Pitch measures the shrillness or flatness of a voice ; High Pitch = shrill sound; Low pitch = flat sound. Shrillness is a word used to describe the quality of sounds that have a strident, raucous, screeching or harsh character. The pitch levels and pitch ranges of sad emotion are lower and narrower than those of other emotional states. The mean and standard deviation of pitch in sad emotion is smaller than other emotions.
Energy determines the loudness or intensity in a voice ; High Energy = loud sound; Low Energy = soft sound. Energy determines the measure of the strength of ear’s perception. It is a measure of perception of intensity of sound.
MFCC (MEL-Frequency Cepstrum Coefficients) determines the noise sensitivity of Voice. Non-robust MFCC = presence of additive noise or Irregular tone of voice. The shape of the vocal track can give us information on sound that distinguishes one word from another in a particular language. This information lies in the envelope of short time power spectrum and MFCC represents these envelopes.
Formants — Determines quality of vowel sounds to be deep, full, and reverberating. It determines which vowel you hear and, in general, are responsible for the differences in quality among different periodic sounds

Our training dataset : 1440 audio recordings with emotion labeled as neutral, calm, happy, sad, angry, fear, disgust and surprise. As the objective of the exercise was to flag dissatisfaction in conversation, we created two groups viz. D-SAT (Angry, Disgust) and Others

Our emotion detection methodology :

Speaker separation using R library fastICA and later improved with k-means clustering and HMM giving a speaker count & separation accuracy ~79% & ~72% respectively. Algorithm implemented is a variant of Fisher linear semi-discriminant analysis for speaker diarization.
Feature extraction from the separated audio using opensource R and Pythoin libraries (Pitch, Formants — wrassp, Energy — tuneR, MFCC — librosa)
76 features extracted in total — mean, max and standard deviation. Iterated with derivatives of the variables and checked for variable importance, found the model to be better off without the derivative features
Build SVM and XGB with the extracted features and optimize on precision-recall trade-off for best results on DSAT
ROC AUC for SVM was 0.94 and XGB was 0.89
Tried HMM and GMM but the results were lot worse, could be our inexperience in fine-tuning them or feature preparation as well

At this point, we realized that the parameters can be optimized in future to get even higher precision and recall. It would also need a lot more training data. So, we moved on to creating a dashboard prototype which can help the contact center managers carry out quick recovery.

Product Goal : Identify conversation characteristics to improve customer satisfaction in a contact center environment.

Essential Product Features:

Audio waveform representation, separates and highlights the speakers

2. Overall DSAT conversation meter— needle movement for every 5 second and then roll-up with additional business logic (see the table below)

3. Word Cloud for calls with only medium-to-high DSAT score; STT with Watson STT API

simple rules on 2 crucial segments of the conversation

4. Heat map of 5 second units with DSAT scores for color coding; also useful for any post-facto analysis

At a high level, it looks somewhat like this :

Evolution steps : D-SAT levels can be made more pragmatic by applying custom made rules on emotions identified through different segments of the call. Rules on multiple segments will provide rich insights about the conversation. Agent behavior should also be factored using other metrics throughout the call such as % cross over time, % off-script etc.

Currently, the team is focusing on Speech-to text conversion through feature generation at acoustic level and application of deep learning algorithms to generate text using VCTK corpus data (192 speakers of different assent, each speaker reads about 400 sentences from newspaper). 3-layer Dilated CNN is being run on this data set of 24 GB and 1-D convolution is applied for various stride rate. This will lead to character-by-character prediction for new files and has to be followed by spell correction by Fasttext library.

In a nutshell, these are the initial steps for solving a customer-centric business problem by using advanced analytics methods and developing a utility which can allow managers to be swift in tackling customer dissatisfaction. There is still a lot to be done to make this utility powerful — both on the analytics and engineering side.

We have been able to roll-out an initial version for assessing its true value and the reviews are exciting.

Looking forward to your feedback as well!

‘Dear customer, I can feel you’ — from emotion detection to MVP

Written by Rahul Choudhary