African language Speech Recognition (Amharic language)

8 min readJun 9, 2022

Speech-to-Text

Abstract

We have Processed more than 10 thousand Amharic audio samples and prepared them for modeling. after building the model, different hyperparameter values (epoch number and batch size) were used to train the model until the best result in terms of the WER (Word Error Rate) metric is obtained. the final WER rate able to be obtained given the limited time the experiments were run is 0.9954

Introduction

The World Food Program plans to use an intelligent form to collect nutritional information on food purchased and sold in three African markets.

The design of this intelligent form requires selected people to download an app to their phone, and then use their voice to activate the app to register the list of products they just bought in their own language anytime they buy food. The app’s intelligent technologies are anticipated to transcribe Amharic speech to Amharic text and organize the data in a database in an easy-to-process manner.

Amharic is a morphologically rich language with limited resources. The purpose of this project is to construct an Amharic STT (Speech To Text) converter. Because of the intelligent form’s design, certain people must use the web app, and anytime they buy food, they must use their voice to activate the app, which then registers the list of items they just purchased in their own language.A deep learning model capable of converting Amharic speech into text in Amharic. The model we develop should be accurate and resistant to noise. During the fourth week of a 10 Academy Machine Learning course, we produced this project.

Motivation

Automatic speech recognition systems are useful in a wide range of scenarios. In reality, ASRs (Automatic Speech Recognitions) can be used for any work involving computer interaction.

The following are some of the most common applications of speech recognition technology:

Dictation systems include medical transcriptionists,
Legal and business dictation,
Conventional word processing.

Recording audio voice is easier than typing on a keyboard. This will be the primary incentive for this project. Non-technical WFP (World Food Program) employees make up the majority of those working on this project. As a result, recording cooking products and products utilizing audio / voice of the individual is the simplest and most straightforward method.

Preprocessing and Modeling

The preprocessing stage is where we processed the audio samples to prepare them for feature extraction and modeling. Here, the audio samples are made to have equal length by truncating some of them and adding padding to the others; the audios are resampled to have similar sampling rate, and they are all converted to a stereo. Finally, a random time shifting was applied to the audio samples to increase the prediction accuracy of the model after training.

After that features need to be extracted from the data and be ready to train our model. After building the model and compiling a loss function with it (CTCLoss function in this case), the training of the data will commence.

Preparing metadata: used to track and load the audio samples one by one at runtime.
Standardizing the sampling rate
Audio resizing
Converting mono to stereo
Data augmentation
Feature Extraction
Model Building
Training

Preparing metadata

Metadata is information that is used to link an audio file to its transcribed counterpart. A Meta data file named ‘trsTrain.txt’ with a white space separated audio id and transcription is extracted in our dataset. Here’s how it looks after the Meta data has been generated and converted to a csv file.

Sample Rate Standardization

When we see our sounds in the train folder most of them were having different sampling rates. For our model to be fair and free from bias we used this step to make our sampling rate constant for all the audios. We made the sampling rate 44 KHz.

Audio Resizing

During this stage we found out that most of the audios were more than 5 seconds and also more than half silent sound in them. So we truncated each audio to be less than 5 seconds.

Converting mono to stereo

Monaural or monophonic sound reproduction is intended to be heard as if it were a single channel of sound perceived as coming from one position. Stereophonic sound or, more commonly, stereo, is a method of sound reproduction that creates an illusion of multi-directional audible perspective.

So we changed all the audios into a common stereo audio format.

Data Augmentation

The last step we did on our preprocessing was to augment our audios. This step will help us to have a model that is robust and resistant to noises. Among different data augmentation techniques we use time shift for our data augmentation phase.

Feature Extraction

After the previous steps, the data is now ready for feature extraction. This is where a spectrogram representation of the data is generated. Spectrogram is a representation of the audio in frequency domain instead of time domain. But not to lose precise representation of the audio, we will divide up the audio into small frames or windows and apply the FFT (fast fourier transform) on the audio samples to generate a spectrogram of the audio. FMCC is used to make the number of nodes used to represent the frequency bands. This spectrogram can be visually represented as follows, where the x-axis is time, the y-axis is the frequency and the darkness of the image describes the intensity of the audio.

Modeling

During the modeling stage we will be choosing a certain modeling architecture to generate the model. In machine learning, each type of artificial neural network is tailored to perform certain sets of tasks. There are two types of artificial neural networks: convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Recurrent neural networks are designed to interpret temporal or sequential information. These networks use other data points in a sequence to make better predictions. They do this by taking in input and reusing the activations of previous nodes or later nodes in the sequence to influence the output. convolutional neural networks are incapable of effectively interpreting temporal information. Convolutional neural networks are one of the most common types of neural networks used in computer vision to recognize objects and patterns in images.

For this project we used RNN since we are working with audio files, which are temporal by nature.

Hyperparameters

the hyperparameters used to tune the model were batch size and epoch number.

A limited number of experiments were run to get the following results.

The image below shows our last experiment results. Although the result is not significant, we were able to decrease the error rate from 1.0 to 0.992 as shown in the image below.

Model results

During the model training data and the pipeline were tracked using DVC pipeline. Below, you can take a look at two of the sages tracked by the DVC pipeline and their outputs.

Some of the experiment results were tracked using MLflow. The image below shows the tracked parameters and associated metrics for each run.

Bias, Fairness

Because the data used to train the ASR model in this study is highly political, validation is essential. While this does not fully invalidate our model, it may not be appropriate to use it in other domains without adequate testing and validation because it may have solely learned political terms and may be biased when recommending terms.

Future Work

To highlight the upcoming study in the speech-related sector, the following recommendations are made and presented based on the experience and numerous obstacles raised in this experiment. Because this research only looks at native and non-native Amharic speakers, it ignores linguistic variations among non-native Amharic speakers. As a result, greater research into dialectical differences among non-native speakers of the language is required.

Given the constraints, the results are encouraging and demonstrate that a general speech recognition technique capable of translating Amharic SPEECH-TO-TEXT might be built.

Review of previous work

Despite the fact that ASR (Automatic Speech Recognition) research for technologically favored languages has been ongoing for more than 50 years, Amharic speech recognition research has only recently begun. Speech recognition for Ethiopic languages has been the subject of a number of studies.

Some of the works are:

- Zegaye S. (2003) HMM based large vocabulary, Speaker Independent, Continuous Amharic Speech Recognizer, M.Sc. Thesis, Addis Ababa University, Addis Ababa

- Bereket K. (2008): developing a speech synthesizer for Amharic language using Hidden Markov Model, M.Sc. Thesis, Addis Ababa University, Addis Ababa

- Solomon B. (2001): Isolated Amharic Consonant-Vowel (CV) syllable Recognition: an experiment using the Hidden Markov Model, M.Sc. Thesis, Addis Ababa University, Addis Ababa

- Kinfe T. (2002) : sub-word based Amharic word recognition : an experiment using Hidden Markov Model, M.Sc. Thesis, Addis Ababa University, Addis Ababa