Maya — Domain Specific Speech Intelligence at NoBroker
Since the beginning of time speech has been the most common form of communication for man. It is fascinating how humans are able to interpret speech. At NoBroker, we generate huge amounts of speech data. We have a world class customer support team providing assistance on call/chat to our customers. Our paid customers get on-phone assistance on their house-hunting/tenant-hunting needs. In this way we generate hundreds of hours of phone recordings per day. As a data-first organization we strive towards understanding our data and what our customers are saying. For this we built our own Domain Specific ASR model powered by Deep Learning. We call it Maya.
Understanding speech with machines has been a long-fought challenge. We, humans, do it so seamlessly without any effort. When we hear a speech snippet, we hear it along with all the other noise in the surroundings and all the possible contexts. We filter out the noise, filter out the contexts, and understand the speech holistically — language, sentiment, emotion, tone, and content — everything in one go.
Maya is our effort to build complete speech understanding. Understanding language, tone & noise levels from the raw audio — Converting speech signals to human-readable text — extracting the intent and sentiment of the speech — machine-generated audio response — all of this forms part of Maya. Maya paves our way towards complete AI-assisted customer support.
In this post, we brief about the transcription component of Maya — how did we build the capability to convert an audio representation to a text representation. Converting audio to text, lets us translate the problem from audio domain to NLP domain. Latter being mature, lets us understand the semantics of speech better.
We did not want Speech Intelligence to be built on general human speech. We wanted it to be in the context of our business. We chose to take our first baby step by defining our problem statement
” Build a Real Estate Specific, Speech Transcription System for Telephonic Indian Speech”
Speech Curation
Our first logistics was data, like always. We needed to collect labeled data, for the problem at hand. In the first run, we decided to focus on Indian English only. We have an inhouse data curation team, who we set up to accelerate our efforts in AI. We have curation dashboards where the curation team works on labeling various data sets.
We labeled around 200 hours of telephonic speech — 100+ were English and the rest were other languages. We also used some publicly available datasets. When we trained our English model, we had around 200+ hours of Indian English.
To build the transcription model, we experimented on a number of methodologies. We always evaluate the community validated systems before we invent our own. This way we don’t need to reinvent the wheel and we can ship value faster. We started off with conventional HMM systems. With Kaldi, we tried a few iterations, in the beginning. Due to its unsatisfactory performance on noisy data, we moved on to end-to-end Deep Learning-based ASR systems which were dominating the leaderboard in Speech Recognition.
We basically like the idea of Deep Learning, stack up a series of neurons with a specific set of characteristics — define a loss function — feed-forward input data — backpropagate the errors — and continuously improve the parameters of the stack of neurons until the neurons learn to do an abstract task, which is otherwise is unsolvable in the conventional programming paradigm.
From a series of experiments from Espnet, Deepspeech 2, Mozilla’s deepspeech and Listen Attend and Spell, we had a number of learnings.
Among the experiments we tried. Deepspeech & LAS gave us the most promising results. Both had their pros & cons, which we will discuss below
Since our data was noisy we went ahead with Mozilla’s Deepspeech. We then added some set of modifications on top of this to achieve some very impressive results.
Maya
In Maya, a speech snippet goes through the 3 steps as indicated above. A raw waveform signal is first converted to a Mel Spectrogram, a widely accepted feature representation of audio form. We pass that to our Acoustic model, which is the neural network that converts audio Mel spectrograms to character sequences.
The model consists of 6 layers, the first 3 layers are fully connected layers which were followed by a Recurrent Neural Network and a fully connected layer. The final layer is a fully-connected layer with the number of neurons = number of alphabets in the language+ 1. For example in English it 26 +1. The 1 denoting space.
Deepspeech uses CTC loss to train the network.
We trained our model on Nvidia P100 for close to 48 hrs.
Language Model
To improve our model further we introduced a language model built on real estate corpus. This language model lets us give further context-awareness to the output of the model.
We have used a popular and well-tested language model called KenLM. A beam search is performed on this Language model using the output from the CTC layer to get the probable words in the utterance. KenLM implements two data structures, Probing and Trie for efficient language model queries, reducing both time and memory costs.
We used the Trie data structure for KenLM since we wanted to make our ASR system light-weight too. Using the KenLM library we built a 6-gram language model. KenLM uses a method called Kneser–Ney smoothing which is primarily used to calculate the probability distribution of n-grams in a document based on their histories. This helped us further reduce the Word Error Rate by >2%.
Maya is in production and with the capability to transcribe ~160 hours of audios per day in near real-time. The platform capable of this large scale inference is a topic of detailed further discussion. We will discuss this in an upcoming article.
Maya is just our baby step in understanding speech. The way forward is challenging with far-reaching applications, including conversational AI and virtual voice assistants.
In the upcoming articles, we will share how we are solving problems in audio like Language Identification and Speech Synthesis so do follow our blog!
If you’re the kind of person who is excited to wake up every day to change the world around you; come join us! We have plenty of adventures for your intellectual pursuits.