Speech to Text Recognition (Swahili-Amharic languages)

Abel Mitiku
9 min readJun 9, 2022

Introduction

Speech recognition also referred to as speech-to-text or voice recognition, is a technology that recognizes speech, allowing a voice to serve as the primary interface between the human and the computer.

Speech recognition technology allows for hands-free control of smartphones, speakers, and even vehicles in various languages. Companies have moved towards the goal of enabling machines to understand and respond to more and more of our verbalized commands. Many matured speech recognition systems are available, such as Google Assistant, Amazon Alexa, and Apple’s Siri. However, all of those voice assistants work for limited languages only.

The World Food Program wants to deploy an intelligent form that collects nutritional information of food bought and sold at markets in two different countries in Africa — Ethiopia and Kenya. The design of this intelligent form requires selected people to install an app on their mobile phones, and whenever they buy food, they use their voice to activate the app to register the list of items they just bought in their own language. The intelligent systems in the app are expected to live to transcribe the speech-to-text and organize the information in an easy-to-process way in a database.

We were chosen to provide speech-to-text technologies in the Amharic and Swahili languages. Our task was to create a deep learning model capable of converting speech to text. The model we create should be precise and resistant to background noise.

Data Features

The data for this challenge can be retrieved by clicking on the links below.

Amharic: Speech Recognition ASR format (~20hr training and ~2hrs test)

Swahili: Speech Recognition ASR format (~10hr training and ~1.8hrs test)

You can find more data for these two languages here: For Swahili & Amharic. For example, Amharic voice commands data (315MB) can be found here: Amharic Voice Commands | Zenodo

Input features (X): audio clips of spoken words

Target labels (y): a text transcript of what was spoken

Automatic Speech Recognition

The acoustic model, which explains the distribution across acoustic observations given the character sequence and the language model, which assigns a probability to every possible character sequence, are the two aspects of character-level speech recognition. This sequence-to-sequence model merges the acoustic and language models into a single neural network.

Our goal was to create a character-level ASR system in TensorFlow utilizing an encoder/decoder-based recurrent neural network with an attention mechanism that could conduct inference on an Nvidia GPU in the AWS cloud.

Amharic language meta-data generation

The dataset consists of 16kHz audio files between 2–15 seconds long. Using the prepared scripts, the audio files were converted to single-channel (mono) WAV/WAVE files (.wav extension) with a 256k bit rate, and a 16kHz sample rate. The pre-processing techniques used for the text transcriptions include the removal of any punctuation other than apostrophes, and transforming all characters to lowercase.

# Manipulation of transcripts name_to_text = {}        
with open(filename, encoding="utf-8")as f:
f.readline()
for line in f:
name = line.split("</s>")[1]
name = name.replace('(', '')
name = name.replace(')', '')
name = name.replace('\n', '')
name = name.replace(' ', '')
text = line.split("</s>")[0]
text = text.replace("<s>", "")
name_to_text[name] = text

################################################
# Converting the audio files to mono channels
# generating data that includes audio file path, audio duration, and # audio transcript
target = []
features = []
filenames = []
duration_of_recordings = []
for k in trans:
filename = path + k + ".wav"
filenames.append(filename)
audio, fs = librosa.load(filename, sr=None)
duration_of_recordings.append(float(len(audio)/fs))
lable = trans[k]
target.append(lable)

The script for transforming the generated meta-data into a JSON file for additional preprocessing and deep learning can be found in the snippet below.

'''        
# convert dataframe to json
# @param data: dataframe
# @param path: path to save json
# return: None
# @exception: None
'''
try:
with open(path, 'w') as out_file:
for i in range(len(data['key'])):
line = json.dumps({'key': data['key'][i], 'duration':
data['duration'][i], 'text': data['text'][i]})
out_file.write(line + '\n')
except KeyError:
var = 0

Acoustic Feature Extraction for Speech Recognition

The first step in any automatic speech recognition system is to extract features i.e. identify the components of the audio signal that are good for identifying the linguistic content and discarding all the other stuff which carries information like background noise, emotion, etc.

The purpose of feature extraction is to illustrate a speech signal by a predetermined number of components of the signal. This is because all the information in the acoustic signal is too cumbersome to deal with, and some of the information is irrelevant in the identification task.

There are 3 primary methods for extracting features for speech recognition. This includes using raw audio forms, spectrograms, and mfcc’s.

Raw Audio forms

The primary dataset used will not need much cleaning as it is taken from quite smooth recordings that have been preprocessed for background noises. This will, of course, lead to reduced performance in distracting environments.

This method uses the raw waveforms of the audio files and is a 1D vector of the amplitude where X = [x1, x2, x3…]

Audio File Transcription : ከ አትላንታ ብዙ ሰዎች ተገኝ ተዋል

Spectrograms

A spectrogram transforms the raw audio waveforms into a 2D tensor (using the Fourier transform) where the first dimension corresponds to time (the horizontal axis), and the second dimension corresponds to frequency (the vertical axis).

If we have 161 features for each frame, and frequencies are between 0 and 16000, then each feature corresponds to around 100 Hz.

Shape of the Spectrogram : (293, 161)

Mel-Frequency Cepstral Coefficients (mfcc’s)

The mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.

Like the spectrogram, this turns the audio wave form into a 2D array. This works by mapping the powers of the Fourier transform of the signal, and then taking the discrete cosine transform of the logged mel powers. This produces a 2D array with reduced dimensions when compared to spectrograms, effectively allowing for compression of the spectrogram and speeding up training as we are left with 13 features.

Shape of the MFCC : (293, 13)

Deep Neural Networks for Acoustic Modeling

The human brain, its functions, and the way it works served as the inspiration for the creation of the neural network. The deep neural network can predict a solution for a task and make conclusions using its previous experience.

Deep Neural Networks Architecture

Recurrent neurons are similar to feedforward neurons, except they also have connections pointing backward. At each step in time, each neuron receives input as well as its own output from the previous time step. Each neuron has two sets of weights, one for the input and one for the output at the last time step. Each layer takes vectors as inputs and outputs some vectors. This model works by calculating forward propagation through each time step, t, and then back propagation through each time step. At each time step, the speaker is assumed to have spoken many possible characters. The output of this model at each time step will be a list of probabilities for each possible character.

The RNN is made up of a combination of acoustic and language models. The language model scores sequences of characters, while the acoustic model scores sequences of acoustic model labels over time. The valid auditory label sequences are then mapped to the matching character sequences using a decoding graph. The score of a path is the sum of the score provided to it by the decoding graph and the score given to it by the acoustic model, and speech recognition is a path search algorithm across the decoding graph. So, voice recognition is the process of determining which character sequence maximizes both linguistic and acoustic model scores.

RNN

In an RNN, the information cycles through a loop. When it makes a decision, it considers the current input and also what it has learned from the inputs it received previously. This model explores a simple RNN with 1 layer of Gated Recurrent Units.

rnn architecture
def rnn_model(input_dim, output_dim=29):
# Input
input_data = Input(name='the_input', shape=(None, input_dim))
# Recurrent layer
simp_rnn = GRU(output_dim, return_sequences=True,
implementation=2, name='rnn')(input_data)
# Activation Layer
y_pred = Activation('softmax', name='softmax')(simp_rnn)
# Specifying the model
model = Model(inputs=input_data, outputs=y_pred)
model.output_length = lambda x: x
return model
######################################################
rnn = rnn_model(input_dim=161) # 161 for Spectrogram/13 for MFCC
#########################################################
# Train the model
train_model(input=rnn,
pickle_path='rnn.pickle',
save_model_path='rnn.h5',
spectrogram=True,
)

We acquired the anticipated transcript from the audio sample that was tested after training the model.

actual: በ ባህል በ ቋንቋ አንድ ናቸው
predicted: ተአንየንየንየንአንተንተንየንየንተንተንተንየ
WER: 1.0

This model’s performance is insufficient, but we’ll explore if a more complicated model can help.

CNN + RNN

This model explores the addition of a Convolutional Neural Network to the RNN.

def cnn_rnn_model(input_dim, filters, activation, kernel_size, conv_stride, conv_border_mode, units, output_dim=29):
# Input
input_data = Input(name='the_input', shape=(None, input_dim))
# Convolutional layer
conv_1d = Conv1D(filters, kernel_size,
strides=conv_stride,
padding=conv_border_mode,
activation=activation,
name='conv1d')(input_data)
bn_cnn = BatchNormalization(name='bn_conv1d')(conv_1d)
# Recurrent layer
simp_rnn = GRU(units, activation=activation,
return_sequences=True, implementation=2, name='rnn')(bn_cnn)
# Batch Normalization
bn_rnn = BatchNormalization()(simp_rnn)
# TimeDistributed Dense layer
time_dense = TimeDistributed(Dense(output_dim))(bn_rnn)
# Softmax activation layer
y_pred = Activation('softmax', name='softmax')(time_dense)
# Specifying the model
model = Model(inputs=input_data, outputs=y_pred)
model.output_length = lambda x: cnn_output_length(
x, kernel_size, conv_border_mode, conv_stride)
return model
############################################################rnn_cnn = cnn_rnn_td_model(input_dim=161,
# 161 for Spectrogram/13 for MFCC
filters=200,
kernel_size=11,
conv_stride=2,
conv_border_mode='valid',
activation='relu',
units=200)
#############################################################
# Train the model
train_model(input=rnn_cnn,
pickle_path='rnn_cnn.pickle',
save_model_path='rnn_cnn.h5',
spectrogram=True) # True for Spectrogram/False for MFCC

We acquired the anticipated transcript from the audio sample that was tested after training the model.

actual: የተባረሩ ኤርትራውያን ድርጅታቸው ተ ሸጦ ወኪሎ ቻቸው ከ አገር እንዲ ወጡ ተወሰነ
predicted: ተባሩ ኤርታራክተክታ ለው ከገር ወ
WER: 1.00

Adding a convolution layer boosted our score significantly, but it is still insufficient. We’ll give it a shot by adding another RNN layer.

CNN + Bidirectional RNN

This model combines all of the ideas in the preceding models.

def cnn_brnn_model(input_dim, filters, activation, kernel_size, conv_stride,
conv_border_mode, recur_layers, units, output_dim=29):
# Input
input_data = Input(name='the_input', shape=(None, input_dim))
# Convolutional layer
conv_1d = Conv1D(filters, kernel_size,
strides=conv_stride,
padding=conv_border_mode,
activation=activation,
name='conv1d')(input_data)
bn_cnn = BatchNormalization()(conv_1d)
# Bidirectional recurrent layer
brnn = Bidirectional(GRU(units, activation=activation,
return_sequences=True, name='brnn'))(bn_cnn)
# Batch normalization
bn_rnn = BatchNormalization()(brnn)
# Loop for additional layers
for i in range(recur_layers - 1):
name = 'brnn_' + str(i + 1)
brnn = Bidirectional(GRU(units, activation=activation,
return_sequences=True, implementation=2, name=name))(bn_rnn)
bn_rnn = BatchNormalization()(brnn)
# TimeDistributed Dense layer
time_dense = TimeDistributed(Dense(output_dim))(bn_rnn)
# Softmax activation layer
y_pred = Activation('softmax', name='softmax')(time_dense)
# Specifying the model
model = Model(inputs=input_data, outputs=y_pred)
model.output_length = lambda x: cnn_output_length(
x, kernel_size, conv_border_mode, conv_stride)
return model
############################################################cnn_brnn = cnn_brnn_model(input_dim=161,
# 161 for Spectrogram/13 for MFCC
filters=200,
activation='relu',
kernel_size=11,
conv_stride=2,
conv_border_mode='valid',
recur_layers=2,
units=200)
############################################################# Train the model
train_model(input=cnn_brnn,
pickle_path='cnn_brnn.pickle',
save_model_path='cnn_brnn.h5',
spectrogram=True)

We got the expected transcript from the audio sample that was tested after the model was trained.

actual: ጻድቃን ናቸው ተብሎ ይታመን ባቸዋል
predicted: ጻድቃ ና ቸው ተብሎው ይታመለ ባ ቸዋል
WER: 1.40

This ASR scored extremely well and adding another RNN layer considerably improved our model performance.

Conclusion

We’ve now trained a high-performance recurrent neural network for speech recognition. We’ve created an ASR model that can be used in a web application. The Python flask web app framework is used in this GitHub project.

Speech-to-Text Architecture

--

--