Building an English to Igbo Translation Project
Neural Machine Translation, the field that is having new breakthroughs every other day. I decided to get in on the act by working on one that can translate English to Igbo. The Igbo language is the primary native language of the Igbos who are found in Southeastern Nigeria which is found in Western Africa which is found …. you get the idea.
A Bit Of History
Before I start talking about how I built this project, let me take you back in time on the subject of machine translation.
In the 60’s, machine translation used a literal (word for word) method by employing a dictionary containing both the source language and its target language equivalent for computers to replace each word in the sentence to be translated with its target language equivalent.
There were many translation errors that resulted from this method mainly because of the varying grammatical structure of different languages and results began to drastically improve when statistical methods were employed in the 80’s.
The introduction of Machine Learning into the field of machine translation was a game changer and the publication of this paper in 2014 could only herald better times for machine translation.
Seq2Seq Model
The above paper talked about an encoder-decoder architecture made up of Recurrent Neural Network (RNN).
A plain encoder-decoder architecture tries to map a fixed length input to a fixed length output where there is a huge chance that their lengths differ. The encoder (which is made up of a stack of Long Short-Term Memory, LSTM or Gated Recurrent Units, GRU) accepts an input sequence and then produces a final hidden state which tries to encapsulate the information of the input sequence i.e encodes the input sequence.
This hidden state is then passed to the decoder (which also contains stack of LSTM or GRU units), the decoder outputs a final hidden state which is then passed through the softmax function to predict an output.
In the context of this project, an English text is provided as input to the encoder, which then produces an output vector that encodes the English text. This output vector is then propagated to the decoder that then predicts the right Igbo translation.
You can read a simple explanation of the above here.
The plain encoder-decoder architecture has a flaw in the sense that it doesn’t really work well with long sentences. The reason is that the encoded vector produced by the encoder is not capable of capturing the whole information about the input sentences. An analogy for the way this works is a translator reading to the end of a sentence or paragraph, storing the information in his/her head and then without going over the original text now tries to translate the text into the target language.
Attention Mechanism
It can be argued that the plain encoder-decoder architecture clearly differs from the human mode of translation.
Humans translate by looking at few words at a time, understanding the context and then translate those words to the target language without reading to the end of the sentence. Fortunately, there is a neural network architecture loosely based on this system. This architecture is called the attention mechanism which was introduced in this paper.
Attention mechanism is similar to the encoder-decoder architecture described above with some differences.
The decoder doesn’t just work with the final hidden states produced by the encoder but with all the hidden states produced at each step of the output generation. It then uses the weighted combination of all these hidden states to predict the right word.
In the above image, you can see an animation of how attention model works. The opacity of the purple lines connecting the English words to the French words represents the contribution of that English word to the final prediction of the French word. Simply put, the opaqueness of the line indicates the degree of contribution to the prediction of that word. For example, agreement has a higher contribution to the target word accord which incidentally is its French translation. Faded lines have lower contribution to prediction of the target word and words with lower contribution are useful in providing context to the final translation.
We can also see that words that have similar position in their respective sentences have more opaque lines which is kind of similar to the way we humans translate.
You can find detailed explanations about attention mechanism here and here.
Attention mechanisms have a flaw in that they are computationally expensive and have issues dealing with very long sentences even though they perform better than their plain encoder-decoder counterpart.
Transformer Model
In 2017, some researchers down at Google decided to ditch the RNNs totally and replace them with attention models calling the new architecture Transformers. You can find the paper here which is appropriately called Attention Is All You Need.
The Transformer architecture used in machine translation is still a type of encoder-decoder architecture.
First the input tokens are converted to vectors by the input embedding which also happens to the target tokens. Because this model does not contain any recurrence which would have provided some information about the position of the tokens in the sequence, it needs a way to get this information. This information is provided by the positional encoding which uses sine and cosine functions of different frequencies. This positional encoding is then added to the input embedding which is then propagated to the encoder.
The encoder is made up of several layers. Each layer is made of the multi-head attention for self-attention and a feed-forward network with residual connection between them.
The purpose of self-attention is similar to a human translator trying to read a sentence written in the source language in order to understand the author’s intent. Self-attention allows the model to encode the input sequences better by replacing the function of RNNs in the above seq2seq and RNN attention models explained above.
The output of the multi-head attention shown below is a function of K which is the key, Q which is the query, V which is the value.
In the encoder self-attention, K, Q and V are all gotten from the input tokens or the output of the previous encoder layer.
The feed-forward network is made up of two linear or dense layer with a ReLU activation between them.
The decoder is also made up of several layers just like the encoder. Each of this layer is made up of a multi-head attention for self-attention, a multi-head attention for the encoder-decoder attention, and a feed-forward network.
The multi-head attention for self-attention performs the same function as that of the encoder’s own except that the K, Q and V are all gotten from the target tokens or the output of the previous decoder layer.
The multi-head attention for the encoder-decoder attention is different from the self-attention in that it acts on both the encoder output and the target tokens. The queries (Q) come from the previous decoder layer while the keys (K) and values (V) are the encoder outputs. This functions like the attention mechanism found in attention RNN models.
The decoder’s feed-forward network is similar to that of the encoder and its outputs are passed into a Linear layer, which is a typical neural network. Finally, the output of the linear layer is passed through a softmax function to get a prediction.
The advantage of a Transformer model over that of RNN and its variants is that all its computation can be parallelized unlike RNN where some computations have to be done sequentially. This makes training faster and, from my personal experience, it uses less memory. The Transformer model also has the capacity to learn longer sentences better.
A more detailed explanation of Transformer model can be found here.
The Project
I worked on an Igbo English machine translation using all the machine learning models provided above. An adequate amount of training data is required to build work on any machine learning project (your accuracy depends on it, of course).
For this project, I needed one that would contain both sentences in their Igbo and English forms. The only prepared dataset I could find had only 18 sentences and was highly inadequate so I had to prepare my own dataset. To do this, I had to extract the verses from the digital Igbo and English Jehovah Witness bible and then I proceeded to mine the relevant Igbo and English sentences from an online Igbo dictionary here. BeautifulSoup python package was used for both extractions.
After getting my dataset, I had a choice of using either word-based machine translation or character-based machine translation.
Word-based machine translation involves tokenizing your sentences into words using the grammatical structure of the language and then converting them into numbers before inputting them into the model. To make tokenization possible, you need a tokenizer for the language you are working in and at the time of this project execution, none was available for the Igbo Language.
Character-based machine translation centers around transforming every character in the text by converting them to numbers before one-hot encoding these numbers and is used mainly for those languages that do not have a tokenizer and for this reason I had to utilize this method.
I started with an rnn attention model adapted from one found in the Coursera Machine Translation assignment and after training it for like a week, the result were 🤦🏽♂️. Here are some of them:
English Sentence: Come
Correct Translation: Bia
My Translation: aba<pad><pad><pad><pad><pad><pad>…English Sentence: Eat
Correct Translation: Rie
My Translation: aba<pad><pad><pad><pad><pad><pad>…
From the above and many other results, you might deduce that the above model was in love with aba.
I decided to use a simple encoder-decoder architecture without attention which I adapted from Francis Chollet seq2seq tutorial found here. It was miles better than the above model. Some of the results were 🤷🏽♂️. Here are some of them:
English Sentence: ascend
Correct Translation: gbago
My Translation: gba mgbaEnglish Sentence: 19 Then he said to him Get up and be on your way your faith has made you well.
Correct Translation: 19 O wee si ya Bilie lawa okwukwe gi emewo ka ahu di gi mma.
My Translation: 19 O wee si ha O bu na ndi a ga asi Nne m aka , n ihi na o bu ezie na m na ekpe ikpe n aka.English Sentence: desire
Correct Translation: gu aguu
My Translation: gu aguu
From the above, you can deduce that the model starts well and the veers off course. But we can all agree that it’s better than the rnn attention model.
Buoyed by the above progress, I decided to try rnn attention model again with better preprocessing learnt while doing the above and the results were 😕
English Sentence: 26 Give thanks to the God of the heavens , For his loyal love endures forever .
My Translation: 26 Kele ne Ce e eEnglish Sentence: language
My Translation: aguEnglish Sentence: 22 Then King Je hoi a kim sent El na than the son of Ach bor and other men with him to Egypt .
My Translation: 22 Eo Ee
From the above, you can see that even though they are better than the initial rnn attention model, they are worse than the simple seq2seq model and objectively bad.
With the lessons learnt from the above experiments, I decided to use the Transformer model but the transformer model is actually tuned for word-based machine translation and the Igbo language does not have a tokenizer that I could use, what to do? 🤔
I renewed a search for an Igbo tokenizer but failed to find one but I was lucky enough to find the nearest thing to what I was looking for: a research paper containing an algorithm for an Igbo tokenizer and normalizer and I quickly implemented the algorithm along with some ideas I found from the web. You can find the paper can be found here and the implemented tokenizer as a Python library can be found here.
I used the word-based method in conjunction with the Transformer model and the results were a mix of 👌🏽 and 🧐
English Sentence: write against
Igbo Translation: detọ
My Translation: detọEnglish Sentence: 7 The length of time that David lived in the countryside of the Phi·lisʹtines was a year and four months.
Igbo Translation: 7 Ụbọchị niile Devid biri n’ime ime obodo ndị Filistia dị otu afọ na ọnwa anọ.
My Translation: 7 ụbọchị niile devid biri n’ obodo ndị filistia dị otu afọ na ọnwa anọEnglish Sentence: 23 Sing to Jehovah, all the earth! Announce his salvation day after day!
Igbo Translation: 23 Bụkuonụ Jehova abụ, unu ndị niile bi n’ụwa! Kwa ụbọchị, kpọsaanụ nzọpụta ọ na-enye!
My Translation: 23 bụkuonụ jehova abụ unu ndị niile bi n’ ụwa kwa ụbọchị kpọsaanụ nzọpụta ọ na- enyeEnglish Sentence: 5 You must love Jehovah your God with all your heart and all your soul and all your strength.
Igbo Translation: 5 Jiri obi gị dum na mkpụrụ obi gị dum na ike gị dum hụ Jehova bụ́ Chineke gị n’anya.
My Translation: 5 jiri obi gị dum na mkpụrụ obi gị dum na jehova bu chineke gị n’ ihi na chineke gị hụrụ gị n’ anya
Well, you can see that there is a clear winner which is the Transformer model.
All the models were written by me with help from the Coursera NLP Machine Translation Program, this repo for the second attention model, this blog post for the seq2seq model, this repo by Google for the main transformer models, this awesome blog post and this repo for help in making my transformer models maintainable. The above aids were awesome and I could have used them to just train my data but I wanted to learn how Neural Machine Translation works and through all the tears and failures, I can only say that it was an education and it was worth it.
The GitHub repo for this Keras-based project can be found here.
Thanks to Chukwurah Chuka, Steven Oni (@steveoni), Chukwuebuka Okolo for reading drafts of this.
P.S: While writing this post, I discovered a gorgeous site that contains parallel corpus in several languages. The site is called OPUS and you can visit it here