Speech Recognition For African Language

8 min readJun 9, 2022

Amharic Language | Speech-to-text

Abstract

This project aims to transform Amharic spoken audios into a text format. The report will be including all the steps taken to achieve that and generate a model for that sole purpose. We’ll also, construct sample transcriptions on a training set of audio files after the model is created.

Business Objective

The World Food Program ought to use an intelligent form to collect nutritional information on food bought and sold at markets in two different countries in Africa — Ethiopia and Kenya.

The design of this intelligent form requires selected people to install an app on their mobile phones, and whenever they buy food, they use their voice to activate the app to register the list of items they just bought in their own language. The intelligent systems in the app are expected to live to transcribe the speech-to-text and organize the information in an easy-to-process way in a database.

Amharic is a morphologically rich language with limited resources. The purpose of this project is to construct an Amharic STT (Speech To Text) converter. Because of the intelligent form’s design, certain people must use the web app, and anytime they buy food, they must use their voice to activate the app, which then registers the list of items they just purchased in their own language.

A deep learning model capable of converting Amharic speech into text in Amharic. The model we develop should be accurate and resistant to noise. During the fourth week of a 10 Academy Machine Learning course, we produced this project.

What is speech recognition?

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition, or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics, and computer engineering fields.
CTC is an algorithm used to train deep neural networks in speech recognition, handwriting recognition, and other sequence problems. CTC is used when we don’t know how the input aligns with the output (how the characters in the transcript align to the audio).

Motivation

Automatic speech recognition systems are useful in a wide range of scenarios. In reality, ASRs (Automatic Speech Recognition) can be used for any work involving computer interaction.

The following are some of the most common applications of speech recognition technology:

Dictation systems include medical transcriptionists,
Legal and business dictation,
Conventional word processing.

Recording audio voice is easier than typing on a keyboard. This will be the primary incentive for this project. Non-technical WFP (World Food Program) employees make up the majority of those working on this project. As a result, recording cooking products and products utilizing the audio/voice of the individual is the simplest and most straightforward method.

Data and Preprocessing

Data

Before starting to use our data for our model we had to clean and preprocess the provided audio files. This step helps us to get clean data that can be used and can easily be processed by the model resulting in more accurate prediction.

Some of the steps were taken for this :

Preparing metadata
Spectrogram generation
Standardizing the sample rate
Audio resizing
Converting mono to stereo
Data augmentation

Preparing metadata

Metadata is information that is used to link an audio file to its transcribed counterpart. A Metadata file named ‘trsTrain.txt’ with a white space separating audio id and transcription is extracted in our dataset. Here’s how it looks after the Metadata has been generated and converted to a CSV file.

Spectrogram Generation

After having the metadata and the audios the next thing we did was to generate a spectrogram. It is the visual representation of audio where the x-axis is time, the y-axis is the frequency and the darkness of the image describes the intensity of the audio.

Sample Rate Standardization

When we see our sounds in the train folder most of them were having different sampling rates. For our model to be fair and free from bias we used this step to make our sampling rate constant for all the audios. We made the sampling rate 44 kHz.

Audio Resizing

During this stage we find out that most of the audios were more than 5 seconds and having also more than half silent sound in them. So we truncated each audio to be less than 5 seconds.

Audio Resizing

During this stage, we find out that most of the audios were more than 5 seconds and having also more than half silent sound in them. So we truncated each audio to be less than 5 seconds.

Converting mono to stereo

Monaural or monophonic sound reproduction is intended to be heard as if it were a single channel of sound perceived as coming from one position. Stereophonic sound or, more commonly, stereo, is a method of sound reproduction that creates an illusion of multi-directional audible perspective.

So we changed all the audios in to common stereo audio format.

Data Augmentation

The last step we did on our preprocessing was to augment our audios. This step will help us to have a model that is robust and resistant to noises. Among different data augmentation techniques we use time shift for our data augmentation phase.

Review of previous work

Despite the fact that ASR (Automatic Speech Recognition) research for technologically favored languages has been ongoing for more than 50 years, Amharic speech recognition research has only recently begun. Speech recognition for Ethiopic languages has been the subject of a number of studies.

Some of the works are:

- Zegaye S. (2003) HMM based large vocabulary, Speaker Independent, Continuous Amharic Speech Recognizer, M.Sc. Thesis, Addis Ababa University, Addis Ababa

- Bereket K. (2008): developing a speech synthesizer for Amharic language using Hidden Markov Model, M.Sc. Thesis, Addis Ababa University, Addis Ababa

- Solomon B. (2001): Isolated Amharic Consonant-Vowel (CV) syllable Recognition: an experiment using the Hidden Markov Model, M.Sc. Thesis, Addis Ababa University, Addis Ababa

Deep learning architecture

In the deep learning era, neural networks have shown significant improvement in the speech recognition task. Various methods have been applied such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), while recently Transformer networks have achieved great performance.

Recurrent Neural Networks

RNNs perform computations on the time sequence since their current hidden state is dependent on all the previous hidden states. More specifically, they are designed to model time-series signals as well as capture long-term and short-term dependencies between different time-steps of the input.
Concerning speech recognition applications,

A CNN ( Convolution neural network) architecture is formed by stack of distinct layers that transform the input volume through a differentiable function. It consists of an input layer, hidden layers and an output layer. In any feed-forward neural network, any middle layers are called hidden because their inputs and outputs are masked by the activation function and final convolution. In a convolutional neural network, the hidden layers include layers that perform convolutions. Typically this includes a layer that performs a dot product of the convolution kernel with the layer’s input matrix. This product is usually the Frobenius inner product, and its activation function is commonly ReLU. As the convolution kernel slides along the input matrix for the layer, the convolution operation generates a feature map, which in turn contributes to the input of the next layer. This is followed by other layers such as pooling layers, fully connected layers, and normalization layers

Convolutional methods can be grouped into 1-dimensional and 2-dimensional networks, respectively.
2D-CNNs construct 2D feature maps from the acoustic signal. Similar to images, they organize acoustic features i.e., MFCC features, in a 2-dimensional feature map, where one axis represents the frequency domain and the other represents the time domain. In contrast, 1D-CNNs accept acoustic features directly as input.

Hyperparametrics :

Are various settings that are used to control the learning process. CNNs use more hyperparameters than standard multilayer preceptions. Some of these are Kernal size, padding , stride…etc

Model Result

Some of the results from training our model are

Bias, Fairness

Because the data used to train the ASR model in this study is highly political, validation is essential. While this does not fully invalidate our model, it may not be appropriate to use it in other domains without adequate testing and validation because it may have solely learned political terms and may be biased when recommending terms.

Future Work

To highlight the upcoming study in the speech-related sector, the following recommendations are made and presented based on the experience and numerous obstacles raised in this experiment. Because this research only looks at native and non-native Amharic speakers, it ignores linguistic variations among non-native Amharic speakers. As a result, greater research into dialectical differences among non-native speakers of the language is required.

Given the constraints, the results are encouraging and demonstrate that a general speech recognition technique capable of translating Amharic SPEECH-TO-TEXT might be built.