Exploring ASR Model Development: Fine-Tuning XLS-R Wav2Vec2 Model with Swahili Data

Antony M. Gitau
8 min readMar 24, 2023

So you’ve just completed working on your audio clips and accompanying transcripts for speech recognition model training. You now want to dive into developing a first-version model. I was at this point a few weeks ago, and I had to make some decisions that led me to successfully publish my first fine-tuned swahili speech recognition model on hugging face hub[1].

From Java T point

In this blog post, I’ll be sharing my journey of fine-tuning a Swahili speech recognition model and publishing it on the Hugging Face hub. I’ll walk you through my thought process, methodology, and lessons learned so that you can benefit from my experience and save time in your own work.

What options exist in the ASR model development?

1. You can train the model from scratch.

This means that you start by defining the model architecture, which involves selecting the type and number of layers, the activation functions, and the connections between the layers. You also need to define the loss function, which measures the error between the predicted output and the true output, and the optimizer, which adjusts the weights and biases of the model to minimize the loss.

— What is the challenge with this approach?

Training models from scratch is relatively rare due to the significant computational and time requirements involved. A few notable examples of models that have been trained from scratch and have achieved impressive results include:

  1. The original Transformer model[2], was trained on a large corpus of text data for the task of neural machine translation. It introduced a novel self-attention mechanism that allows the ability to capture long-term dependencies.
  2. The BERT model [3](Bidirectional Encoder Representations from Transformers), was trained on a massive corpus of text data using an unsupervised technique of masking some words in a sentence and expecting the model to predict them. For example, in the sentence “I want to [MASK] a taxi”, the model would be required to predict the missing word “order”. This technique is called masking language modeling and it allows the models to learn from large amounts of unlabeled text data.
  3. The GPT-2 model [4](Generative Pre-trained Transformer 2) was trained on a large corpus of text data using a language modeling objective. The language modeling objective involves training a model to estimate the probability distribution of the next word in a sequence, given the preceding words. For example, given the sentence “The cat is sitting on the…”, the language model would predict the probability distribution of the next word, which could be “table”, “floor”, “chair”, or some other word in the language. This model can generate coherent and realistic text that is often difficult to distinguish from the human-written text.

Since such models have achieved state-of-the-art performance, especially in natural language tasks, and they are publicly available for free. It made more sense to make use of them.

2. Fine-tune a pre-trained model (PTM).

Fine-tuning is a specific type of transfer learning where a pre-trained model is modified slightly for a specific task, (in our case, speech recognition), by training it on a smaller, task-specific dataset (swahili clips and transcripts in this case). The idea is that the pre-trained model has already learned general features and patterns that can be useful for the new task. This is a plausible idea because PTMs can effectively capture knowledge from massive labeled and unlabeled data due to their large-scale nature. Additionally, substantial work in fine-tuning PTMs has been demonstrated via experimental verification and empirical analysis. This is why we went for this methodology.

— The benefits of PTMs

PTMs presented us with three main advantages that led us to favor this approach:

  1. Reduced training time: We wanted to develop a simple first model from which we could improve.
  2. Improved performance: We appreciated that pre-trained models already learned general features and patterns of language from large datasets and attained state-of-the-art results, so they could help us achieve better swahili speech recognition performance compared to starting from scratch.
  3. Lower data requirements: To fine-tune an already trained model, we needed way less data compared to starting from scratch.

Fine-tuning the XLS-R Wav2Vec2 model

After completing the preparation of the Swahili clips and transcripts from common voice[5], we embarked on the fine-tuning exercise. Our choice for this model, Wav2Vec2-XLS-R-300M[6], was based on three factors:

  1. The pre-training task. Wav2Vec2-XLS-R-300M model is a large-scale multilingual pre-trained model for speech.
  2. Model architecture. The model is based on wav2vec 2.0.[7] The wav2vec 2.0. showed that even an hour of labeled fine-tuning data outperformed the previous state-of-the-art on 100 hours subset. The architecture is shown in figures one, two, and three.
  3. Corpora used for training. The model was trained in 128 languages, including Swahili, making up 436k hours of unlabeled speech.
Fig 1. From the wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations Paper [7]
Fig 2. From the XLS-R: Self-supervised Cross-Lingual Speech Representation Learning at Scale Paper 2021[8]
Fig 3. From the Unsupervised Cross-lingual Representation Learning for Speech Recognition Paper 2020[9]

According to the authors[8, 9], to fine-tune the model we need to add a classifier representing the output vocabulary of the respective downstream task (in this case, we used a Swahili vocabulary that we generated from the training set), on top of the model and train on the labeled data with a Connectionist Temporal Classification (CTC) loss[10].

Fine-tuning a model hosted on hugging face requires defining four main functions; data collator, metric, training arguments, and the pre-trained model’s checkpoints, and passing them through a trainer class. The trainer class[11] is optimized for Hugging Face Transformers models and offers an interface for feature-complete training in PyTorch.

The data collator

This function is used for padding input sequences and labels to the same length. Since the input and output have different modalities — the input being audio and the output being text — we employ different padding strategies for them. For input sequences, we pad to the longest sequence in the batch, or no padding is done if only a single sequence is provided. For instance, if we have a batch of audio input sequences with different lengths such as:

batch = [  
[0.1, 0.2, 0.3],
[0.4, 0.5, 0.6, 0.7],
[0.8, 0.9]
]

We will pad the sequences to have the same length as the longest sequence in the batch, resulting in a padded batch like this:

padded_batch = [  
[0.1, 0.2, 0.3, 0.0],
[0.4, 0.5, 0.6, 0.7],
[0.8, 0.9, 0.0, 0.0]
]

Then, we pad the labels to the same length as the input sequences. We adopted the data collator function[12] from the Hugging Face transformers repository on GitHub.

Model checkpoints

We now load the pre-trained checkpoint of Wav2Vec2-XLS-R-300M. These lower layers, as you can see on the model architectures in figures one, two, and three, are responsible for extracting acoustic features from the audio input and have already been trained with useful information. As a result, it’s not necessary to modify these layers during the training process, and we can keep them fixed to save computation time and memory. In doing so, we only update the higher layers, the Connectionist Temporal Classification (CTC), responsible for the transcription.

Training parameters

The training arguments are a set of hyperparameters that determine how the training process is executed. In this case, the training arguments were inspired by both Patrick’s blog[13] and a Swahili finetuned model[14]. The specific parameters can be found on our model card[1] on the Hugging Face Hub.

Metric function

This function calculates the word error rate (WER) by comparing the predicted output string with the true label string using the WER formula: (S+D+I)/N, where S represents the number of substitutions, D represents the number of deletions, I represents the number of insertions, and N represents the total number of words in the reference.

Trainer

After defining the necessary functions, the next step is to instantiate the trainer. The trainer takes in the previously defined functions and begins the fine-tuning process upon calling the Hugging Face train function. Once the fine-tuning process is complete, the resulting model is pushed to the Hugging Face Hub for others to access and use. The entire notebook we used to fine-tune can be found on GitHub[15]. Figure 4 shows the training and validation loss during training.

Fig 4. Trainer Output during Training

Evaluation

The evaluation of the Swahili speech recognition model yielded a loss of 1.2834 and a word error rate of 0.5834 on the evaluation set. Subsequently, we conducted more tests on the model using the hosted inference API on Hugging Face. While the model was able to output a readable sentence, it was not without errors. A notable error was the mispronunciation of the word “shillingi” and outputted it as “shilingimu”. Additionally, the model failed to recognize that the word “namna” should be one word and instead outputted it as “na” “mna”.

Fig 5. Testing the model on Hugging face inference API

Although the model performed well overall, these errors indicate that there is still room for improvement. Particularly, we’ll focus on fine-tuning the model with more Swahili data and incorporating a language model to improve decoding accuracy.

Conclusion

We have successfully developed a Swahili speech recognition model by fine-tuning the Wav2Vec2-XLS-R-300M model and publicly hosting it on Hugging Face Hub. Our model was trained using 3726 sentences and transcripts, achieving a loss of 1.2834 and a word error rate of 0.5834 on the evaluation set. The entire model development took place on a Google Colab notebook, the freely available version which offers limited graphical computing power of ~ 12GB of RAM, and ~78GB of disk space.

However, we recognize that there is still room for improvement and plan to add more Swahili data for training and experiment with language models to enhance our model’s decoding accuracy. We welcome any suggestions or alternative approaches to further improve our model.

References

  1. Our fine-tuned swahili speech model on the Hugging Face Hub.
  2. Attention is all you need paper
  3. BERT paper
  4. GPT-2 paper
  5. Our journey of data preparation blog
  6. Facebook’s wav2vec2 XLS-R 300 million parameters model on Hugging Face Hub.
  7. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations paper.
  8. Self-supervised Cross-Lingual Speech Representation Learning at Scale Paper
  9. Unsupervised Cross-lingual Representation Learning for Speech Recognition Paper
  10. Sequence ModelingWith CTC blog
  11. Trainer class documentation on Hugging Face Docs
  12. Data collator function code
  13. Fine-tuning xls-r wav2vec model blog
  14. Swahili fine-trained model
  15. Our fine-tuning notebook

Other resources

  1. Pre-trained Models for Natural Language Processing: A Survey
  2. Pre-trained models: Past, present and future

--

--