From Wav2Vec2 to Decoded Sentences

9 min readJun 5, 2022

Have you read about Wav2Vec2 and wondered how its output gets decoded into correct sentences? Have you started working on a Speech-to-Text pipeline using Wav2Vec2?

This blog will explain how many buzz-words combine: Fine-Tuned Wav2Vec2 model, Output Decoding, CTC Encoding, Beam-Search, Language Model, and Hot-Words Boosting. They are all pieces of a pipeline explained below.

The blog will explain how to fine-tune Wav2Vec2 and use it to do inference and decoding. It will skip the base Wav2Vec2 model since many excellent blogs already cover it (see links in the section below).

Before diving into the various parts, let’s first look at how it all combines:

With the full process visual in our heads, let’s examine its parts and how each one contributes.

Wav2Vec2

The Wav2Vec2 model is trained in a self-supervised manner. It is firstly trained with audio-only for representation learning, then fine-tuned for a specific task (for example STT) with additional labels.

Wav2Vec2 can be fine-tuned for STT in English but if STT is required for another language, then XLS-R should be used. XLS-R is the multilingual version of Wav2Vec2 and was trained in 128 languages.

More great resources on Wav2Vec2 & XLS-R:

The Wav2Vec2 Paper

Wav2vec 2.0: Learning the structure of speech from raw audio

XLS-R: Self-supervised speech processing for 128 languages

An Illustrated Tour of Wav2vec 2.0

Fine-Tuning Wav2Vec2

The Fine-Tuned Model:

The first component of Wav2Vec2 consists of a stack of CNN layers that are used to extract acoustically meaningful — but contextually independent — features from the raw speech signal. According to the Wav2Vec2 paper, this part of the model has already been sufficiently trained during pre-training and does not need to be fine-tuned anymore. For this reason, we freeze these layers when fine-tuning.

On top of the CNN layers, we have the transformer that outputs a context representation. The transformer is trained while fine-tuning (it is not frozen). Since the task is STT, the model has to map this sequence of context representations to its corresponding transcription. To achieve this, a linear layer is added on top of the transformer block. This linear layer is used to classify each context representation into a token (character) class.

The output size of this linear layer corresponds to the number of characters in the vocabulary, which is derived from the labeled dataset used for fine-tuning. The fine-tuned model can also be referred to as the acoustic model.

The Fine-Tuning Loss:

Wav2Vec2 is fine-tuned using Connectionist Temporal Classification (CTC) loss. CTC is an algorithm that is used to train neural networks for sequence-to-sequence problems and mainly in Automatic Speech Recognition and Handwriting Recognition.

The concept of CTC encoding is explained below. Using CTC as a loss function is beyond the scope of this blog.

For more details on CTC and how it is used as a loss function check out this excellent blog: https://distill.pub/2017/ctc/

The Challenge Of Decoding

Now that we have a fine-tuned STT model, how do we get the transcribed text for a speech sample?

Given an audio sample, the audio is sliced into evenly spaced chunks of time. These chunks are passed through the fine-tuned acoustic model, which gives a per- timestamp probabilities matrix. The acoustic model predicts character probabilities for each time slice.

The image below shows an audio signal (top row), transformed into a spectrogram (middle row) and the acoustic model’s per-timestamp output as a heat-map (bottom row). Each heat-map time slot can be viewed as a histogram of the character probabilities.

Character probabilities predicted for each audio slice

Decoding the model output is the process of taking this probabilities matrix and producing a human-readable text from it.

Naive Decoding

Our first, naive decoding solution: For each audio slice, choose the most probable character using the Argmax function.

What are the problems with the naive decoder?

The model’s output length will be proportional to the audio length and not to the transcribed text length. Since we will usually have many more timestamps than transcribed characters, we end up having many more predicted characters compared to actual transcribed characters.
People speak at different speeds so we might get the same character predicted in multiple consecutive time steps. A possible solution could be to ignore duplicated consecutive characters (by collapsing them to a single character). In such a case, how would we handle words that are spelled with duplicated characters like HELLO?

CTC Encoding

CTC stands for Connectionist Temporal Classification and it is an algorithm, encoding method, and a loss function.

How can we use CTC encoding to solve the problems described above?

CTC adds a BLANK symbol to the possible predicted character set. This BLANK can be used in the predicted output when:

We do not want to transcribe a character (when there is silence in the audio)
Between consecutive characters that should not be collapsed.

Note that the same final output could be decoded from multiple CTC encoded predictions (see figure below). We will come back to this in the next section.

Many legitimate alignments produce the same output

Why Is This Not an Optimal Decoding Solution?

The naive Argmax + CTC decoding algorithm described above is not an optimal solution for two reasons:

Problem 1

The most likely (argmax) output may not correspond to the most likely collapsed output string. Why?

Let’s assume that we have a decoder that can choose more character options per time step, rather than only the argmax character. Such a decoder will end up with more than one possible output sentence.

The table below shows an example of the best 3 outputs (decoded sentences) that such a decoder may yield and their output probabilities. Later sections will describe how these probabilities are calculated but for now, let’s ignore that. In the example below, the most probable decoded output is “AB_” ( ‘_’ stands for BLANK) which gives the sentence “AB” (with a probability of 0.3). The second most probable output is “B_B” which gives the sentence “BB” (with a probability of 0.25) and the third most probable output also gives the sentence “BB” (with a probability of 0.1). What is the probability of getting the sentence “BB”? 0.25+0.1 = 0.35 which is larger than the probability of getting “AB”. This comes to show that even though “AB” is the argmax output, “BB” is the more likely output.

If the argmax output is not always the most probable, we should consider a less greedy decoding strategy, one that takes into account other options as well.

Problem 2

Since the decoded output comes from an acoustic model, the output sentence may end up having:

Misspelled words
Something that is not a word (“phor” instead of “for)
A word that sounds that same but has a completely different meaning (bear vs. bare or knot vs. not)

Beam Search to The Rescue

To handle problem #1 described above, we could do the decoding in a better way.

The true solution would be:

Score every possible path in the character probabilities matrix
Combine the scores of equivalent paths.
Choose the path with the highest combined score as the final output.

This gives an exact solution. Great! But, the number of path possibilities is huge, and going over all of them is computationally impossible. Also, in most cases, the majority of these paths will have a tiny probability.

As a fast approximate solution, we use Beam-Search. Beam-search is a (breadth-first) heuristic search algorithm that explores a graph by expanding the most promising nodes in a limited set.

Start by taking the N best characters from the first time slice and keep their probabilities. These are the N initial beams.
In the next time step, try to add a second character to each of the beams. They are scored by multiplying their character probabilities.
If multiple paths arrive at the same output, collapse them into one and add their scores.
Prune the number of beams kept down to N (according to their scores).
Repeat these steps for each of the time steps.

A CTC Beam Search example for the characters {a,b} and a beam size of 3 (image taken from here)

For more on beam search with CTC: Sequence Modeling With CTC

Language Model as the Game Changer

As stated in problem #2 described above, the output we decoded so far comes from the acoustic model alone. This may cause errors such as decoded words that do not exist. To fix it, we would like to include more information about the language in our output. We would like to know how likely a given output is in our language and then we could use it to remove outputs that are less probable and increase the score of those that are more probable. In general, a language model takes a text and outputs a likelihood score for it.

To have a good language model, one would need to use a large corpus of text which is different from the audio + transcriptions dataset used to train the acoustic model.

A very common way to add a language model to the decoding process is with an N-grams language model. Using more advanced language models is beyond the scope of this blog.

In a generic language model, the probability of a sentence is the probability of each word given all previous words already seen.

Sentence Probability in a Generic Language Model

In an N-gram model, the probability of a sentence depends only on the N previous words instead of all the previous words.

N-Grams Language Model

Example of a text’s probability according to a 2-Gram LM

How do we use the language model?

While decoding and performing the beam search described above, the score for each path takes into account both the acoustic model score (denoted as
P-CTC) and the language model score (denoted as P-LM). The LM score is multiplied by a configurable weight parameter.

As desired, a very unlikely text will get a low LM probability compared to a likely text.

Hot-Word Boosting

A pre-defined list of hot words can be prepared and introduced to the decoder. When calculating the score of a text, the number of hot words is counted and the text’s score is boosted according to this count.

This is helpful when we want to boost specific words according to a specific domain and also when some words are missing from the language model’s vocabulary. When words are missing, the model could be retrained to include these words, but retraining is not always feasible.

This hot-words probability (P-HW) is given a user-configurable weight (W-HW) and they are added to the output path probability:

Conclusion

To conclude let's briefly go over the flow of decoding a Wav2Vec2 fine-tuned speech-to-text model:

The Wav2Vec2 model generates contextualized representations that are then used by the fine-tuned STT model.
The model is trained to produce CTC encoded texts.
The model outputs a matrix of probabilities per character in the vocabulary (and the BLANK CTC character).
A beam search is performed on this matrix. In each time slot, the texts from all the paths are collapsed in a CTC manner. The text for each path is scored (as described above) and the best N paths continue to the next time slot.
Finally, we end up with N-scored paths and choose to output the one with the highest score.

Note that the decoding process described above has no specific relation to Wav2Vec2 so it can be used for any other STT model which outputs a character probability matrix.