Speech Recognition Using CRNN, CTC Loss, DeepSpeech Beam Search Decoder, and KenLM Scorer
Today, three of the most popular end-to-end ASR (Automatic Speech Recognition) models are Jasper, Wave2Letter+, and Deep Speech 2. Now they are available as a part of the OpenSeq2Seq toolkit made by Nvidia. All these ASR systems are based on neural acoustic models, which produces a probability distribution Pt(c) over the all target characters c per each time step t, which is in turn evaluated by CTC loss function:
Essentially, the end-to-end speech recognition system described in this article consists of several simple parts:
- Convert raw waveforms to spectrograms using librosa or torchaudio. This article provides an intuitive understanding of mel spectrograms and this article takes a closer look at the mathematics behind this transformation.
- A spectrogram is an image, so we can use convolutional layers to extract features from them. In this article, I’ll use a popular combination of a Conv2d layer and GELU activation function (because it is better than ReLU across a range of experiments) with dropout for regularization. Also, I think it would be beneficial to use layer normalization and skip connections for faster convergence and better generalization. As a result, the first part of the neural network will consist of the following layers:
# First Conv2d layer
Conv2d(1, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))# 7 blocks of these layers
# Skip connection is added to this Conv2d layer
LayerNorm((64,), eps=1e-05, elementwise_affine=True)
Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
- At the same time, a spectrogram is time-series data, so it’s quite natural to use bidirectional RNN layers such as GRU to capture time-frequency patterns from the features detected by CNN layers. For the same reasons as before, I’ll use a layer normalization and dropout:
# I will use 5 blocks of these layers
LayerNorm((512,), eps=1e-05, elementwise_affine=True)
GRU(512, 512, batch_first=True, bidirectional=True)
- Actually, what we need is to map each small vertical slice of a spectrogram image to a certain character, so our model will produce a probability distribution for each vertical feature vector and each character. It could be done using Linear (fully connected) layers:
# Input for this layer is an output from the last GRU layer
# We need to gradually reduce number of outputs from 1024 to the number of characters used in LibriSpeech dataset - 28 + 'blank'
Linear(in_features=1024, out_features=512, bias=True)
Linear(in_features=512, out_features=29, bias=True)
- Then we can use a greedy decoder or a beam search decoder to produce the final transcription.
A greedy decoder takes in the model’s output and for each vertical feature vector, it chooses the character with the highest probability.
A beam search decoder is slightly more complicated. Beam search is based on a heuristic that assumes that chains of random variables with high probability have fairly high probability conditionals. Basically, it takes the k highest probability solutions for p(x1), then for each of those take the k highest probability solutions for p(x2|x1). Then we need to take the k of those with the highest value for p(x2|x1) * p(x1) and repeat.
I think this video by Andrew Ng and this article are the most intuitive guides on this subject.
According to this paper on NMT by Google, “…we found…that a well-tuned beam search is crucial to obtaining state-of-the-art results.”
- As usual for CRNN models, CTC loss will be used during the training process. You can read more about this loss function here, here, or here.
- Also, it’s quite convenient to use Levenshtein distance and WER as the metrics for measuring the difference between original utterance and generated transcription.
The resulting model has the following architecture:
In this article, I’ve used the LibriSpeech ASR corpus of approximately 1000 hours of segmented and aligned English speech, derived from reading audiobooks. Utterances in this dataset are made of 28 characters (target classes):
Then we need to transform waveforms to spectrograms using MelSpectrogram and define functions to convert text to integers and vice versa:
This collate function is necessary to prepare tensors required by our model — spectrograms and labels along with their lengths. These ‘lengths’ tensors will be used later by the CTC loss function:
Now we have to initialize DataLoaders using training and validation datasets and set the random seed to a fixed value to get reproducible results:
The next step is to define training and validation loops, choose an optimizer, hyperparameters, and metrics to evaluate the training progress. I’ve decided to use AdamW optimizer with a fairly low learning rate of 5e-4, train this model for 25 epochs, and use levenshtein and jiwer packages to calculate quality metrics:
Pretrained model weights are available here, and generated scorer is available here. Also please note that batch sizes were chosen based on the amount of available GPU memory and this model consumed about 22 GB of VRAM during training:
For testing purposes, I’ve sampled 20 random spectrograms from the test set:
As you can see, this relatively small model is definitely capable of recognizing human speech and showed good performance on the LibriSpeech dataset.
This project is also available on my GitHub.