Reflections of working with AI with human in the loop

During the course of a proof of concept engagement at work, I worked on a framework that leveraged human in the loop learning for speech to text systems. The task was related to transcribing audio to text from English speakers, who had a distinctive local accent. At various junctures in the training process, I realised that human intervention was helping provide with the right feedback to the S2T (speech to text) engine. In addition, it also helped to identify key gaps and shortcomings in the model performance. Sharing my reflections, hoping it will be of interest for businesses/AI community looking to embrace human in the loop learning in their initiatives.

Business Opportunity

Speech is an essential form of communication that generates a lot of data. As more systems provide a modal interface with speech, it becomes critical to be able to analyze human to computer interactions. Interesting market trends point that voice is the future of UI. This claim is further bolstered now with people looking to embrace contact less surfaces with the ongoing Covid-19 situation.

Interactions between agents and customers in a contact centre remains dark data that is often untapped. The ability to transcribe speech in the local dialects/slang should be in the midst of a call centre advanced analytics road map such as the one proposed in this McKinsey recommendation. To enable this, we might need a framework that can bring the best from the current speech transcription landscape, and present it in a coherent platform which businesses can leverage to get a head start on speech to text adaptation use cases.

In Singapore, there is a local dialect, on which there is extensive work underway. Singlish is the local form of English in Singapore that blends words borrowed from the cultural mix of communities.

An example of what Singlish looks like

Efforts are on to understand calls made to transcribe emergency calls at Singapore’s Civil Defence Force (SCDF). AI Singapore has setup an initiative called Speech Lab, to channelize efforts in this direction. Thanks to the release of the IMDA National Speech Corpus, local AI developers now have the ability to customize AI solutions with locally accented speech data.

IMDA National Speech Corpus

The Infocomm Media Development Authority of Singapore has released a large dataset, which is:

•A 3 part speech corpus each with 1000 hours of recordings of phonetically-balanced scripts from ~1000 local English speakers.

•Audio recordings with words describing people, daily life, food, location, brands, commonly found in Singapore. These are recorded in quiet rooms using a combination of microphones and mobile phones to add acoustic variety.

•Has text files which have transcripts. Of note is certain terms in Singlish such as ‘ar’, ‘lor’, etc.

These kind of initiatives are a bounty for the open AI community in accelerating efforts towards speech adaptation. With such efforts, the trajectory for the local AI community and businesses are poised for major breakthroughs in transcriptions for Singlish in the coming years.

Adding customized audio snippets from locally accented speakers drove up the accuracy of locally accented speech transcription. An overview of the uptick is in the below chart. Without any customisation, the holdout set performed with an accuracy of 73%. As more human annotated data snippets are added, we can further bump up accuracy. Inevitably, there will be some kind of plateau as observed.

On the left is the uplift in terms of accuracy. The right correspondingly shows the Word Error Rate dropping on addition of more audio snippets

The focus of the work I was involved in, was not in achieving the maximal accuracy, but in identifying the training direction that can take the curve on an upward trajectory.

So, how can we monitor if feeding more data to the AI is actually driving accuracy? And how do we know what kind of datasets to add?

Keeping the human in the loop…

A concept fast gaining steam in AI training is Human-in-the-loop learning. An illustration of what human in the loop looks like is as below.

Illustration of Human In Loop Learning

In a nutshell, human in the loop learning is giving AI the right calibration at appropriate junctures. An AI model starts learning for a task, which eventually can plateau over time. Timely interventions by a human in this loop can give the model the right nudge.

“Transfer learning will be the next driver of ML success.”- Andrew Ng, in his Neural Information Processing Systems (NIPS) 2016 tutorial

The idea of transfer learning was aggressively mooted by Andrew Ng, in his NIPS 2016 tutorial (Link to talk: https://www.youtube.com/watch?v=F1ka6a13S9I) (Source of image: https://ruder.io/transfer-learning/)

Not everybody has access to volumes of call center logs, and conversation recordings collected from a majority of local speakers which are some of the key sources of data to train localized speech transcription AI. In the absence of significant amount of local accented data with ground truth annotations, transfer learning can be a powerful driver in accelerating AI development.

What I learnt when building such systems leveraging transfer learning is that one must also allow extensive room for human in the loop learning.

Some key parameters when building such systems can be the below:

  • The speech to text model, can be any kind of ASR engine, which can run on cloud or on premise. The framework can be designed to be agnostic to the ASR technology being used. For e.g. it can help to connect with major Azure/AWS/Google and also with open source projects such as Mozilla DeepSpeech. Having a scorecard approach, where accuracy of each S2T engine is being measured in a leader board, can help deploy the best version for a use case.
  • Allow users to search for ground truth snippets. In lot of cases when the result is available, a quick search of the training records can point to the number of records trained, etc. This can help retrieval of what words are there in the corpus, and how much vocabulary has been trained. This can be quite intuitive, but is often missed in various S2T providers today.
  • Ability to breakdown on Word Error Rates: The industry standard to measure Automatic Speech Recognition (ASR) systems is based on the Word Error Rate, defined as the below
Equation of Word Error Rate

where S refers to the number of words substituted, D refers to the number of words deleted, and I refer to the number of words inserted by the ASR engine.

A simple example illustrating this is as below, where there is 1 deletion, 1 insertion, and 1 substitution in a total of 5 words in the human labelled transcript.

Word Error Rate comparison between ground truth and transcript (Source: https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-custom-speech-evaluate-data)

So, the WER of this result will be 3/5, which is 0.6. Most ASR engines will return the overall WER numbers, and some might return the split between the insertions, deletions and substitutions.

The minimum components to leverage human in loop learning are a set of ground truth text and transcription results text to conduct Word Error Rate analysis. If corresponding audio snippets are available, then one can also inspect acoustic quality, and harvest more training audio in that direction.

However, to fully understand the performance of S2T engines, one will need the framework to provide a detailed split between the insertions, substitutions and deletions. Additionally, we can also allow human annotators to plug audio files with relevant labelled transcriptions, to augment data.

Through this framework, one can broadly monitor the training process at various stages.

  1. Data Ingestion: Through the search explorer, one can monitor what vocabulary the model has been trained on, and what additional data to collect.
  2. Model interpretability: By understanding WER deeper, one can directly understand gaps in model performance. For e.g. in the Singlish context, Singapore is very famous for it’s MRT system. One training error we kept discovering when using S2T systems based on US English, was MRT kept getting substituted as “MIT”. Corrective action can be prescribed in the form of allowing more ‘MRT’ audio snippets or adding a speech post processing layer that can take text context. Making this judgement,is the beauty of any AI training process!
  3. Model selection: A good practice can be to begin with transfer learning, perform loops of steps 1–2 above to understand the use case better, and proceed to tune the right parameters.

While the crux of any human in loop system is an allowance for human nudges to the model, the key decision still has to be made what such ‘nudges’ are. These learning are a small step in helping one solve it in the context of S2T systems.

--

--