Geek Culture
Published in

Geek Culture

Photo by Matt Botsford on Unsplash

Speech Recognition Using CRNN, CTC Loss, DeepSpeech Beam Search Decoder, and KenLM Scorer


Today, three of the most popular end-to-end ASR (Automatic Speech Recognition) models are , and . Now they are as a part of the toolkit made by Nvidia. All these ASR systems are based on neural acoustic models, which produces a probability distribution over the all target characters per each time step which is in turnevaluatedby function:

Summarization of CTC ASR pipeline’s architectures by Nvidia

Essentially, the end-to-end speech recognition system described in this article consists of several simple parts:

How to convert waveform to spectrogram using librosa and torchaudio
Waveform converted to spectrogram
  • A spectrogram is an image, so we can use to extract features from them. In this article, I’ll use a popular combination of a layer and activation function (because than across a range of experiments) with for regularization. Also, I think it would be beneficial to use and for faster convergence and better generalization. As a result, the first part of the neural network will consist of the following layers:
Conv2d(1, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))LayerNorm((64,), eps=1e-05, elementwise_affine=True)
Dropout(p=0.2, inplace=False)
Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  • At the same time, a spectrogram is time-series data, so it’s quite natural to use bidirectional layers such as to capture time-frequency patterns from the features detected by layers. For the same reasons as before, I’ll use a layer normalization and dropout:

LayerNorm((512,), eps=1e-05, elementwise_affine=True)
GRU(512, 512, batch_first=True, bidirectional=True)
Dropout(p=0.2, inplace=False)
  • Actually, what we need is to map each small vertical slice of a spectrogram image to a certain character, so our model will produce a probability distribution for each vertical feature vector and each character. It could be done using(fully connected) layers:

Linear(in_features=, out_features=, bias=True)
Dropout(p=0.2, inplace=False)
Linear(in_features=, out_features=, bias=True)
  • Then we can use a or a to produce the final transcription.
    A takes in the model’s output and for each vertical feature vector, it chooses the character with the highest probability.
    A is slightly more complicated. Beam search is based on a heuristic that assumes that chains of random variables with high probability have fairly high probability conditionals. Basically, it takes the highest probability solutions for , then for each of those take the highest probability solutions for . Then we need to take the of those with the highest value for and repeat.
    I think by Andrew Ng and are the most intuitive guides on this subject.
    According toon NMT by Google,
  • As usual for models, will be used during the training process. You can read more about this loss function , or .
  • Also, it’s quite convenient to use and as the metrics for measuring the difference between original utterance and generated transcription.

Speech recognition model’s architecture (1,645,181 trainable parameters)



In this article, I’ve used the of approximately of segmented and aligned English speech, derived from reading audiobooks. Utterances in this dataset are made of characters (target classes):

Then we need to transform waveforms to spectrograms using and define functions to convert text to integers and vice versa:

This collate function is necessary to prepare tensors required by our model — . These ‘lengths’ tensors will be used later by the CTC loss function:

Now we have to initialize susing training and validation datasets and set the random seed to a fixed value to get reproducible results:


The next step is to define training and validation loops, choose an optimizer, hyperparameters, and metrics to evaluate the training progress. I’ve decided to use optimizer with a fairly low learning rate of train this model for epochs, and use and packagesto calculate quality metrics:

Training and validation loops
Model implementation directly follows the architecture described in the previous paragraph

For inference, I’ve used the by It allows us to use an based on the toolkit.

Generation of the custom alphabet mapping file that will be used for Alphabet initialization
KenLM model and external .scorer generation
Inference process using ds-ctcdecoder
Training and validation results

Pretrained model weights are available and generated scorer is available. Also please note that batch sizes were chosen based on the amount of available GPU memory and this model consumed about during training:

CUDA device used for this project


For testing purposes, I’ve sampled 20 random spectrograms from the test set:

Model inference and displaying spectrograms
Results for a few random samples from the test set

As you can see, this relatively small model is definitely capable of recognizing human speech and showed good performance on the LibriSpeech dataset.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store