Speech Analytics Part -3, Creating a SpeechToText Classifier

Published in

Analytics Vidhya

3 min readAug 17, 2020

Making a sound model is very similar to making a Data, NLP or Computer Vision. The most important part is to understand the Basics of sound wave and how we pre-process it to put it in a model.

You can check out the previous Part1 and Part2 of this series to know how we work with a sound wave.

We would be using the DataSet used in this competition to create our SpeechToText Classifier. The Dataset consists of several occurrences of 12 words -”yes” ,”no”,” up”,” down”,” left”,” right”,” on”,” off”,” stop” ,”go”,” silence” and unknown sounds. Given we would be creating a Classifier model the model would only be able to predict either of the 12 words.

You can find the entire Notebook Here

Step 1: Browsing Sub Folders to read our Sound files

Step 2: Chopping and Padding Sound Files

One key requirement for a Classifier model is that the Input length of every word has to be of the same length. Thus we would make sure to chop extra time and pad silence at the end of words whose occurrence time is not equal to 16 sec.

Step 3: Feature Extraction from sound wave

To learn in depth on this please refer to my previous Blog. For this problem we would be extracting the MFCC features of the sound and using them as an input for my Classifier model.

Step 4: Model Architecture

Given every word consist of Phonemes, the first step which the model needs to do is extract necessary features/phonemes out of the entire word. Thus we would be using a CNN model to capture these features. In the next step we need to look into all the features/phonemes and classify the word into one of the categories. The sequence plays an important role out here. Thus we would add an LSTM layer too.

The final architecture looks like:

Step 5: Fit the Model

Step 6: Record a word via the Microphone

Let’s learn how to take input from the microphone

Record Input via MicroPhone

Step 7: Convert the recording to Text

We would be using the model which we have created above to predict our input recording from the microphone

You can find the entire Notebook Here

This was your first step in the making of a SpeechToText engine. In reality the architecture is quiet complex and requires high computation power given the number of words it needs to learn ( our final model shoud be able to predict any word and not just 12 words).

Some famous architectures used are :-

BiDirectional LSTMs + CTC
Attention based Seq2Seq model

Do wait for the Part 4 of this article to more about such architectures.