# Speech Recognition Using CRNN, CTC Loss, DeepSpeech Beam Search Decoder, and KenLM Scorer

# Theory

Today, three of the most popular end-to-end ASR (Automatic Speech Recognition) models are **Jasper**, **Wave2Letter+****, **and **Deep Speech 2**. Now they are **available** as a part of the **OpenSeq2Seq** toolkit made by Nvidia. All these ASR systems are based on neural acoustic models, which produces a probability distribution **Pt(c)** over the all target characters **c** per each time step **t, **which is in turn** **evaluated** **by **CTC loss**** **function:

Essentially, the end-to-end speech recognition system described in this article consists of several simple parts:

- Convert raw waveforms to
**spectrograms**using**librosa****torchaudio**.**This article**provides an intuitive understanding of**mel spectrograms**and**this article****mathematics behind this transformation**.

- A spectrogram is an image, so we can use
**convolutional layers**to extract features from them. In this article, I’ll use a popular combination of a**Conv2d**layer and**GELU****it is better**than**ReLU****dropout**for regularization. Also, I think it would be beneficial to use**layer normalization**and**skip connections**for faster convergence and better generalization. As a result, the first part of the neural network will consist of the following layers:

# First Conv2d layerConv2d(1, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))# 7 blocks of these layersLayerNorm((64,), eps=1e-05, elementwise_affine=True)

# Skip connection is added to this Conv2d layer

GELU()

Dropout(p=0.2, inplace=False)

Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

- At the same time, a spectrogram is time-series data, so it’s quite natural to use bidirectional
**RNN**layers such as**GRU**to capture time-frequency patterns from the features detected by**CNN**layers. For the same reasons as before, I’ll use a layer normalization and dropout:

**# I will use 5 blocks of these layers**

LayerNorm((512,), eps=1e-05, elementwise_affine=True)

GELU()

GRU(512, 512, batch_first=True, bidirectional=True)

Dropout(p=0.2, inplace=False)

- Actually, what we need is to map each small vertical slice of a spectrogram image to a certain character, so our model will produce a probability distribution for each vertical feature vector and each character. It could be done using
**Linear**

**# Input for this layer is an output from the last GRU layer**

# We need to gradually reduce number of outputs from 1024 to the number of characters used in LibriSpeech dataset - 28 + 'blank'

Linear(in_features=**1024**, out_features=**512**, bias=True)

GELU()

Dropout(p=0.2, inplace=False)

Linear(in_features=**512**, out_features=**29**, bias=True)

- Then we can use a
**greedy decoder**or a**beam search decoder**to produce the final transcription.

A**greedy decoder**takes in the model’s output and for each vertical feature vector, it chooses the character with the highest probability.

A**beam search decoder**is slightly more complicated. Beam search is based on a heuristic that assumes that chains of random variables with high probability have fairly high probability conditionals. Basically, it takes the**k**highest probability solutions for**p(x1)**, then for each of those take the**k**highest probability solutions for**p(x2|x1)**. Then we need to take the**k**of those with the highest value for**p(x2|x1) * p(x1)**and repeat.

I think**this video****this article**are the most intuitive guides on this subject.

According to**this paper****“…we found…that a well-tuned beam search is crucial to obtaining state-of-the-art results.”** - As usual for
**CRNN**models,**CTC loss**will be used during the training process. You can read more about this loss function**here**,**here****,**or**here**. - Also, it’s quite convenient to use
**Levenshtein distance****WER**

**The resulting model has the following architecture:**

# Practice

## Dataset

In this article, I’ve used the **LibriSpeech ASR corpus**** **of approximately **1000 hours** of segmented and aligned English speech, derived from reading audiobooks. Utterances in this dataset are made of **28** characters (target classes):

Then we need to transform waveforms to spectrograms using **MelSpectrogram**** **and define functions to convert text to integers and vice versa:

This collate function is necessary to prepare tensors required by our model — **spectrograms and labels along with their lengths**. These ‘lengths’ tensors will be used later by the CTC loss function:

Now we have to initialize **DataLoader**s** **using training and validation datasets and set the random seed to a fixed value to get reproducible results:

## Training

The next step is to define training and validation loops, choose an optimizer, hyperparameters, and metrics to evaluate the training progress. I’ve decided to use **AdamW** optimizer with a fairly low learning rate of **5e-4, **train this model for **25** epochs, and use **levenshtein**** **and **jiwer**** **packages** **to calculate quality metrics:

For inference, I’ve used the **beam search decoder** by **DeepSpeech****. **It allows us to use an **external language model scorer** based on the **KenLM**** **toolkit.

Pretrained model weights are available **here****, **and generated scorer is available** ****here**. Also please note that batch sizes were chosen based on the amount of available GPU memory and this model consumed about **22 GB** **of VRAM **during training:

## Testing

For testing purposes, I’ve sampled 20 random spectrograms from the test set:

As you can see, this relatively small model is definitely capable of recognizing human speech and showed good performance on the LibriSpeech dataset.

**This project is also available ****on my GitHub****.**