From Raw Data to Accurate Speech Recognition (ASR): My journey of Data Preparation.

7 min readFeb 15, 2023

My journey started by visiting the Mozilla Common Voice project[1], a publicly available database of crowd-sourced voice datasets for speech recognition software. As of 2019, Common Voice was home to 39 languages with 2,500 hours of collected audio[2], which has certainly increased by more than 10 folds as shown in figure one.

Fig 1. Statistics of all contributions in common voice (from common voice)

I was particularly interested in the Swahili audio dataset and its corresponding transcripts, which contained a remarkable 86,000 sentences and 900 hours of audio. With over 300 hours of validated audio (see figure two), the dataset is ideal for fine-tuning a pre-trained model. Fine-tuning can be likened to tailoring a suit to fit a specific individual — just as a tailor takes measurements and uses their expertise to make adjustments for the perfect fit, fine-tuning adjusts a pre-trained model’s parameters to fit new data for improved performance in a specific task. Our research focuses on fine-tuning a model for Swahili speech recognition, which makes this dataset a valuable resource.

Fig 2. Swahili statistics (both diagrams are from Common Voice)

Join me on a journey as I share my experience of preparing Swahili data from Common Voice for speech recognition. Using the accessible Google Colab platform, I was able to prepare the data which is now ready for fine-tuning. However, this task was not without its challenges. I’ll share with you how I overcame the roadblocks.

Approach

Considering that low-resource languages are primarily spoken in low-resource regions, my approach to preparing Swahili data for speech recognition made use of freely accessible tools. This way I hope it more accessible to a wider audience.

1. Using the Coqui STT toolkit

The Coqui STT toolkit[3] uses deep learning to transcribe speech to text by predicting text from audio and checking it for spelling and grammar using a language model. To set up the Coqui STT environment[4] on Google colab, we encountered a challenge: Coqui STT requires the use of docker containers to provide a consistent collection of software, dependencies, and environments from a pre-built Coqui STT docker image. However, Google colab does not support docker natively, making it impossible to run the Coqui STT docker image directly on colab.

The next option was to try downloading all the dependencies contained in the pre-built Coqui STT docker image on Google colab. We quickly ran into problems while trying to install TensorFlow 1.15.4 which runs Coqui STT on Google colab. Only TensorFlow 2.2.0 or later exists.

What next now? Perhaps using a GPU-powered machine to run the pre-trained Coqui STT docker image or using a different toolkit that is compatible with Google Colab and doesn’t require docker containers. The former violates our quest to use freely accessible resources.

2. Using Hugging Face

So what other tools can we use to prepare data and train a Swahili ASR model using freely accessible resources? Hugging Face Hub[5]? It’s a platform for hosting and sharing pre-trained models, datasets, and demos of machine learning projects. It’s home to over 120,000 models, 20,000 datasets, and 50,000 demonstration applications, all open source and accessible to the public.

So I navigated to the datasets page and typed “common voice” on the search bar. As you can see in figure three, an impressive 74 datasets appeared from that search.

I clicked on the topmost “Mozilla-foundation/common_voice_11_0 which was actually the most recently updated repository. That’s right, my next search was Swahili. And vuolah! Figure four showed up. The dataset entries included an audio path, actual audio, transcripts, and various metadata such as upvotes and downvotes, contributor age, and gender columns.

Fig 4. A screenshot of the first five rows of Swahili training data on the Hugging Face datasets page

For speech recognition, we only care about showing the model audio and the corresponding transcript. We will have to clean up the other columns and then transform the raw mp3 audio into numbers. Computers only understand numbers, remember? Those numbers will be packaged as arrays of float point numbers that will represent the amplitude of the audio and make it possible for the model to recognize different characters and words based on the variations in amplitude.

Now that we have found some Swahili data, we want to load it on a notebook, then prepare to fine-tune a pre-trained model.

With three lines of code, the training, and testing sets are loaded on Google colab. This was after downloading the hugging face dataset library[6]. Figure five shows a snippet.

Fig 5. A code snippet of loading Swahili data from Hugging Face to colab

As we can see in figure six, the data-loading operation was successful. We have over 26k training entries with all the columns we saw on the Hugging Face dataset hub and over 10k testing entries.

Fig 6. A code snippet showing the properties of the training and testing sets.

As we agreed earlier, we only needed the audio and the corresponding transcripts for speech recognition. The gender, age, and accent of the contributors are not essential for our task, so we removed those columns using the remove_column function from the Hugging Face dataset library, as shown in Figure six.

Fig 6. Dropping the columns we do not require for fine-tuning a pre-trained model

So we now just have the only three columns we care about as we can see in figure seven.

Fig 7. We only have the audio, the paths, and transcripts

After dropping the unnecessary columns, we generated ten random sentences from the training set to see what they look like. Figure eight provides a screenshot of ten sentences from the training set.

Fig 8. Ten random sentences from the training set

We observed a mix of capital and small letters in our training set, along with full stops and other punctuation marks. However, it is difficult for the model to distinguish between uppercase and lowercase letters, or learn the pronunciation of punctuation marks. Thus, we converted all characters in both sets to lowercase and removed hatted characters that were surprisingly present alongside punctuation marks. After cleaning, we were left with only 26 distinct characters from both the training and testing sets, as shown in Figure nine.

Fig 9. Distinct characters from training and testing set.

As you can notice in figure nine, we have space present because we want the model to learn that words are separated by spaces. With the vocabulary ready, we converted it to a JSON file and then loaded it to the XLS-R Wav2Vec2CTC tokenizer. XLS-R Wav2Vec2[7] is a speech model pre-trained on unsupervised cross-lingual representation learning and it will be the first model we fine-tune for the Swahili speech recognition task. The tokenizer will prepare inputs for the model. It will split sentences into smaller units called tokens. The Wav2Vec2Feature extractor on the other hand converts raw audio to arrays which can then be accepted by the model as inputs. Figure 10 shows how we initialized the Wav2Vec2CTC tokenizer[8] and Wav2Vec2Feature extractor[9].

Fig. 10 Initializing the Wav2Vec2CTC tokenizer and Wav2Vec2Feature extractor

To process the audio data, we first needed to downsample all the clips to 16KHz. This was necessary because the clips in the common voice dataset had a sampling rate of 48KHz. To achieve this, we used the cast_column function provided by the Hugging Face Audio feature[11] as shown in figure 11.

Fig 11. Downsampling the Audio clips to 16KHz.

To streamline the data preparation process, we can combine the extractor and tokenizer into a single function called Wav2Vec2Processor, provided by Hugging Face as shown in figure 12.

Fig 12. Wrapping the Wav2Vec2 feature extractor and Wav2Vec2 CTC tokenizer into a single processor.

Wav2Vec2Processor is responsible for processing the data to the format expected by Wav2Vec2ForCTC for training, that is 1-Dimensional, 16KHz, floating-point array data. The data preparation process ends with running the Wav2Vec2Processor[11] on the training data extracting the input values (the arrays that represent the amplitude) from the audio and the target labels (sentences) as shown in the code snippet in figure 13. This process will be useful in the training loop to preprocess each batch of data as it is loaded into the model during training.

Fig 13. Processing the training data using the Wav2Vec2CTC tokenizer and Wav2Vec2 feature extractor

Next steps

Fine-tuning the XLS-R Wav2Vec2 model[7] using the Swahili data we just prepared.
Followed by creating an interface for the fine-tuned model using either Streamlit[12] or Gradio[13].
Then fine-tuning our model to recognize healthcare domain speech using the radio corpus collected by Marconi Lab from healthcare shows.
Later, we try using Whisper[14], a more recent general-purpose speech recognition model.

We hope you found this article helpful. We welcome your feedback and encourage you to share any insights or alternative approaches you may have used.

References:

Mozilla Common Voice project page
Common Voice
Coqui STT documentation
Setting up Coqui STT environment for transfer learning and fine-tuning.
Hugging Face Hub
Hugging Face Dataset Library
XLS-R Wav2Vec2 transformer documentation on Hugging Face docs.
Wav2Vec2CTC tokenizer docs
Wav2Vec2Feature extractor docs
Audio feature in Hugging Face Datasets Library
Wav2Vec2Processor
Streamlit spaces on Hugging Face
Gradio spaces on Hugging Face
Whisper Model