African Language Speech Recognition

Published in

CodeX

6 min readJun 10, 2022

If you think of the easiest way to communicate with people surrounding you, spontaneously it’s speaking!! As the world developing everyday as the ways of communication developed, for that machine is the best tool to be used in this process..

Automatic Speech Recognition “ASR” is the technology that allows human beings to use their voices to speak with a computer interface in a way that resembles normal human conversation, and the most advanced version of currently developed ASR technologies is called Natural Language Processing “NLP”. Even thought there are many matured speech recognition systems available e.g. Google Assistant, Amazon Alexa. However, all of those voice assistants work for limited languages only!

Now we have the World Food Program wants to deploy an intelligent form that collects nutritional information of food bought and sold at markets in two different countries in Africa — Ethiopia and Kenya, this requires selected people to install an app on their mobile phones, and whenever they buy food, they use their voice to activate the app to register the list of items they just bought in their own language. The intelligent systems in the app are expected to live to transcribe the speech-to-text.

In this blog we will learn how to build a deep learning model that is capable of transcribing a speech to text for both Amharic and Swahili languages.

For this project we will be working with audio & text files, but before we start looking at the data let’s first refresh some basic concepts about sound, and see how we can implement deep learning models to such data.

Sound is created via the vibration of air. All sounds we hear are combinations of high and low pressure that we often represent in a waveform, In this image we can see how to convert the waveform into array by sampling, so we can implement deep learning!

Data

Amharic: Speech Recognition ASR format (~20hr training and ~2hrs test)
Swahili: Speech Recognition ASR format (~10hr training and ~1.8hrs test)

Data Preprocessing Steps with Python

Loading audio to do preprocessing of audio data in Python we used package called librosa, it allows us to load audio in our notebook as a numpy array for analysis and manipulation then we were able to load transcriptions as given both from the Swahili and Amharic datasets, for the Amharic, most of the transcriptions are a single characters, which are actually parts of the word. While for the Swahili text we noticed that the stop words are what form a major chunk of the data, which means we got to do more cleaning, For instance this is how Amharic text looks like after cleaning:

['የተለያዩ የትግራይ አውራጃ ተወላጆች ገንዘባቸውን አዋጥተው የልማት ተቋማትን እንዲመሰርቱ ትልማ አይፈቅድም',
 'የጠመንጃ ተኩስ ተከፈተና አራት የኤርትራ ወታደሮች ተገደሉ',
 'ላነሷቸው ጥያቄዎች የሰጡትን መልስ አቅርበነዋል',
 'እብዱ አስፋልቱ ላይ የኰለኰለው ድንጋይ መኪና አላሳልፍ አለ',
 'ጠጁን ኰመኰመ ኰመኰመና ሚስቱን ሲያሰቃያት አደረ',
 'ድንቹ በደንብ ስለተኰተኰተ በጥሩ ሁኔታ ኰረተ',
 'በድህነቱ ላይ ይህ ክፉ በሽታ ስለያዘው ሰውነቱ በጣም ኰሰሰ',
 'በሩን እንዲህ በሀይል አታንኳኲ ብዬ አልነበረም እንዴ',
 'በለጠች የበየነ የበኩር ልጅ ነች',
 'የቆላ ቁስልና ቁርጥማት በጣም አሰቃቂ በሽታዎች ናቸው']

Resampling Speech data is represented in an array consists of a sequence of numbers, each represents a measurement of the intensity or amplitude of sound at particular moment. Number of similar measurements is determined by sampling rate which defines the number of samples per second taken from a continuous signal to make a discrete signal. We chose a standard sampling rate of 44,100Hz.

Then we have to convert channels, some of the sound files are mono while most of them are stereo. Since the Neural network model expects all items to have the same dimensions, we converted the mono files to stereo using function called convert_channels, where we shift the audio data to the left in the array forming a duplicate of the original, and save it.

Augmentation of data helps to generate more data from our existing data, we used numpy’s roll function to generate time shifting. This help our model to generalize to a wider range of input.

Feature Extraction:

Commonly used features that are directly fed into neural network architectures are spectrograms and Mel-Frequency Cepstral Coefficients (MFCCs).

Spectrogram is a visual illustration of the spectrum of frequencies of an audio signal as it varies with time. Hence it includes both time and frequency aspects of the signal, and obtained by applying the Short-Time Fourier Transform (STFT) on the signal. It’s usually depicted as a heat map. In this image the vertical axis shows frequencies, and the horizontal axis shows the time of the clip.

Mel-Frequency Cepstral Coefficients MFCC of a signal are a small set of features (usually about 10–20) which concisely describe the overall shape of a spectral envelope. It models the characteristics of the human voice.

Acoustic Modelling is used in ASR to represent the relationship between an audio signal and the phonemes or other linguistic units that make up speech. The model attempts to map the audio signal to the basic units of speech such as phonemes. In our case we use store audio features function to perform our acoustic modeling and majorly the basic structures that we developed and shall translate into our model for learning, are chroma_stft, spec_cent, spec_bw, rolloff, zcr and of course the mfcc features. for example this for Swahili

Model Architecture

Now the audio input data & corresponding labels in an array format, it is easier to apply NLP techniques. We can convert audio files labels into integers using label Encoding for machines to learn. The labeled dataset will help us in the neural network model output layer for predicting results. These help in training & validation datasets into nD array.

At this stage, we apply other pre-processing techniques like dropping columns, normalization, etc. to conclude our final training data for building models. Moving to the next stage of splitting the dataset into train, test, and validation is what we have been doing for other models. We can leverage CNN, RNN, LSTM,CTC etc. deep neural algorithms to build and train the models for speech applications like speech recognition. The model trained with the standard size few seconds audio chunk transformed into an array of n dimensions with the respective labels will result in predicting output labels for test audio input. As output labels will vary beyond binary, we are talking about building a multi-class label classification method.

Deep learning can be defined as a machine learning technique to solve problems by providing the inputs and desired outputs and letting the computer find the solution, using a neural network. We want to build a deep learning model that converts speech to text.