Inference with Wav2vec 2.0

Published in

Georgian Impact Blog

7 min readMar 24, 2021

By Zilun Peng, Akshay Budhkar, Jumana Nassour, Ilana Tuil and Jason Levy

We talked about wav2vec 2.0 in our first post and showed how to compress wav2vec 2.0 in our second post in this series, to increase inference speed. To round out this series, we’ll show you how to perform inference with wav2vec 2.0 in this post.

We’ll start by walking you through the code of a Viterbi decoder to decode wav2vec 2.0. Then, we’ll compare the Viterbi decoder with the beam search decoder. We will also describe how to run inferences efficiently using Ray, a distributed computing framework. Ray parallelizes inference tasks on multiple CPU cores, making inference much more efficient. Finally, we’ll show how using Ray in addition to knowledge distillation results in a total of 6x speed increase in inference on wav2vec 2.0.

Decoder and wav2letter

In our previous post, we showed you how wav2vec 2.0 and a decoder work together in a speech recognition system. Here, we’ll look at the Viterbi decoder and show you how to use one.

The Viterbi decoder finds the most likely token sequence given their probability distributions, which is the output from wav2vec 2.0. A token can be a character or a sentence boundary.

In our previous post, we passed the output from wav2vec 2.0, emissions, into the decodemethod of the decoder, like this:

Before showing you what happens inside the decode function, we import the methods we need from wav2letter.

We explain CpuViterbiPath and get_data_ptr_as_bytes when we use them below. Now, let’s dive into the decode method!

In line 2, we get emissionsdimensions. B is the batch size, the number of data samples we pass to the decoder in one iteration. T is the length of the output representation from wav2vec 2.0 and N is the number of tokens, 32 in our case.

In line 4, we create transitions, a matrix containing transition probabilities between tokens. We use a zero matrix here, so we’re not giving this information to the Viterbi decoder.

In line 5, we create viterbi_path. This tensor stores the results the decoder returns.

In line 6, we create workspace. This involves calling CpuViterbiPath.get_workspace_size(B, T, N), which allocates contiguous memory space for arrays the Viterbi decoder uses.

Now, we’re ready to decode. In line 8, we call CpuViterbiPath.compute. This method runs the Viterbi algorithm and returns the most likely token sequence. Note that we call get_data_ptr_as_bytes on the tensors we created earlier. This method returns pointers to those tensors. By calling CpuViterbiPath.compute, we pass these pointers to the C++ method which implements the Viterbi algorithm.

In line 18, we do some post processing on the decoded sequence (viterbi_path) by calling self.get_tokens to remove unnecessary blank spaces. We do this for every decoded sequence in the batch.

The code in this section is here and we used the decode method in this notebook.

That’s it! Now you have a good understanding of how we actually convert the output of wav2vec 2.0 into text using the Viterbi decoder. The Viterbi decoder is not the only decoder choice: wav2vec 2.0’s authors use a beam search decoder. In the next section, we’ll compare the beam search decoder and Viterbi decoder.

Beam search decoder and language models

Wav2vec 2.0’s authors used a beam search decoder, but how is it different from a Viterbi decoder? In a Viterbi decoder, only the most likely token is saved and considered to decode the next token. The beam search decoder looks at k probable tokens, where k is the beam size specified by the user. As a result, the beam search decoder outputs k probable text sequences.

How do we know which decoded sequence is best? This is where language models (LM) come into play. Wav2vec 2.0’s authors used an n-gram LM and a transformer LM. The n-gram LM learns conditional word probabilities by counting their occurrences in a corpus. The transformer LM has a multi-head attention mechanism and linear layers, and is trained on a huge corpus. Multi-head attention helps the model focus on words at different positions in a sentence. Both the n-gram LM and the transformer LM are capable of evaluating the likelihood of a sentence.

What is distributed inference?

In our previous post on compressing wav2vec 2.0, we introduced knowledge distillation and showed that a distilled student model is at least twice as fast as the original wav2vec 2.0 model. We can further increase a student model’s inference speed using distributed inference.

To do this, start by introducing an inference task, feeding a speech audio waveform into the ASR system and getting the transcribed text. It is a waste of computing resources for the ASR system to perform inference tasks sequentially because we don’t need to wait for the result from processing one audio waveform to start another one. We use distributed inference to perform multiple inference tasks simultaneously and fully use all computing resources.

Distributed inference with Ray

Ray is an open source distributed execution framework. The figure below shows a set of inference tasks. In each task, we convert raw audio waveforms into text. We distribute these tasks to multiple CPU cores using Ray.

We created a data loader for retrieving audio waveforms in this post, and we repeat the same step here.

In the rest of this section, we’ll show you how to do distributed inference with Ray.

First, import and launch Ray.

Next, tell Ray the part of code that we want to parallelize.

remote_process_data_sample is declared with @ray.remote. Ray treats it as a task and distributes tasks to different CPU cores at run time.

Inside remote_process_data_sample, process_data_sample feeds raw audio waveform (batch) into the encoder (model). The output from the encoder is fed into the decoder, and the result is the transcribed text.

process_data_sample also takes in target_dict, a map, from tokens to indices, to process the decoder output. We also explain this in more detail in our previous post on speech processing.

We run inference tasks in parallel processes, and each audio waveform passes through the encoder (model) then the decoder (decoder). We use ray.put to put the encoder and decoder into a shared memory managed by Ray. This helps Ray save memory because all sub-processes use these two objects.

Now, let’s create a set of inference tasks and start the distributed inference!

In the code above, we get every data sample from the data loader. batch contains the audio waveform and ground truth transcribed text. We pass the data sample (batch), references to encoder (model_id) and decoder (decoder_id), and target_dict into remote_process_batch_element, defined earlier. remote_process_batch_element does not block and we immediately get a future object. Later, we use future objects to retrieve the inference result.

predictions = ray.get(prediction_futures)

In the code above, we retrieve predictions by passing future objects to ray.get.

Now that we have the predictions, we calculate prediction quality by word error rate (WER), using the jiwer package.

We first import wer from jiwer, then get the WER score by passing both ground_truths and predictions to wer.

Check out this notebook if you are interested in distributing inference using Ray.

Distributed inference results

Let’s look at some results after distributing inference tasks with Ray.

Let’s look at two models here: wav2vec_big_960h and a student wav2vec 2.0 model. wav2vec_big_960h is the original wav2vec 2.0 model we talked about in our previous post. The student wav2vec 2.0 model is smaller than the original model in terms of model size. We obtained this student model through knowledge distillation. Again, you can read me here.

The student model’s inference time should be faster than wav2vec_big_960h, because it’s smaller. As the first two rows of the table show, it’s actually 2.9 times faster than wav2vec_big_960h. Note that for the first two rows, we ran inference on the batches sequentially using PyTorch’s default CPU inference settings. When we distribute inference tasks using Ray, as the third row shows, the student model inference speed is six times faster than the original model. Looking at the second and the third rows, we can see that using Ray to distribute inference is twice as fast as using PyTorch’s default inference setting. This is partially affected by the fact that we are using batches of size one. For more information, see PyTorch documentation on inference and CPU threading.

We have seen inference results on the entire dataset, which consists of 2703 data samples. If the task is to transcribe one speech audio waveform, then distributing inference using Ray is not as efficient as running inference in PyTorch. It would be interesting to conduct a more thorough comparison between the two frameworks using different batch sizes and tweaking PyTorch’s inference settings.

The experiments above were conducted on a 48 CPU core machine. We faced some problems trying to configure Ray to work with all 48 cores, therefore, we set it to use 30 cores instead.

Wrapping up

In this blog post, we showed you how to use a Viterbi decoder to convert the output of wav2vec 2.0 to text. In our previous post, we saw that you can compress the wav2vec 2.0 model to make it run faster. Now you can see that inference speed over several input examples of wav2vec 2.0 is even faster using distributed inference.

About Georgian R&D

Georgian is a fintech that invests in high-growth software companies.

At Georgian, the R&D team works on building our platform that identifies and accelerates the best growth stage software companies. As part of this work, we take the latest AI research and use it to help solve the business challenges of the companies where we are investors. We then create reusable toolkits so that it’s easier for our other companies to adopt these techniques.

We wrote this series of posts after an engagement where we collaborated closely with the team at Chorus. Chorus is a conversation intelligence platform that uses AI to analyze sales calls to drive team performance.

Take a look at our open opportunities if you’re interested in a career at Georgian.