Speech Analytics Part -3, Creating a SpeechToText Classifier
Making a sound model is very similar to making a Data, NLP or Computer Vision. The most important part is to understand the Basics of sound wave and how we pre-process it to put it in a model.
You can check out the previous Part1 and Part2 of this series to know how we work with a sound wave.
We would be using the DataSet used in this competition to create our SpeechToText Classifier. The Dataset consists of several occurrences of 12 words -”yes” ,”no”,” up”,” down”,” left”,” right”,” on”,” off”,” stop” ,”go”,” silence” and unknown sounds. Given we would be creating a Classifier model the model would only be able to predict either of the 12 words.
You can find the entire Notebook Here
Step 1: Browsing Sub Folders to read our Sound files
Step 2: Chopping and Padding Sound Files
One key requirement for a Classifier model is that the Input length of every word has to be of the same length. Thus we would make sure to chop extra time and pad silence at the end of words whose occurrence time is not equal to 16 sec.
Step 3: Feature Extraction from sound wave
To learn in depth on this please refer to my previous Blog. For this problem we would be extracting the MFCC features of the sound and using them as an input for my Classifier model.
Step 4: Model Architecture
Given every word consist of Phonemes, the first step which the model needs to do is extract necessary features/phonemes out of the entire word. Thus we would be using a CNN model to capture these features. In the next step we need to look into all the features/phonemes and classify the word into one of the categories. The sequence plays an important role out here. Thus we would add an LSTM layer too.
The final architecture looks like:
Step 5: Fit the Model
Step 6: Record a word via the Microphone
Let’s learn how to take input from the microphone
Step 7: Convert the recording to Text
We would be using the model which we have created above to predict our input recording from the microphone
You can find the entire Notebook Here
This was your first step in the making of a SpeechToText engine. In reality the architecture is quiet complex and requires high computation power given the number of words it needs to learn ( our final model shoud be able to predict any word and not just 12 words).
Some famous architectures used are :-
- BiDirectional LSTMs + CTC
- Attention based Seq2Seq model
Do wait for the Part 4 of this article to more about such architectures.