NLP Resources for African Languages — Luganda/Kinyarwanda Translation Model
About four months ago, I started out on a journey to slowly start building Natural Language Processing tools for African languages. The overall aim is to remove language as a social barrier to success in the global economy of the 21st century. There is a need for people all over the world to be able to use their own language to learn, and especially when using computers or accessing information on the Internet. If you’d like to read my initial thoughts at the start of all this, here’s the previous post.
I spent the better part of the past four months doing some research. The major language groups on the African continent are as follows(source — Wikipedia);
- The Niger-Congo family with a population of 800 million. Major groups include Amhara, Hausa, Oromo, Somali, Tachelhit, Berber and they cover North Africa, the Horn of Africa and the Sahel regions
- The Afro-Asiatic family with a population of 100 million. Major groups include Akan, Fula, Igbo, Kongo, Mande, Mooré, Yoruba, Zulu and Swahili covering the West Africa, Central Africa, Southern Africa and East Africa regions
- The Nilo-Saharan family with a population of 60 million. Major groups include the Dinka, Kanuri, Luo, Maasai and Nuer occupying the Nile Valley, Sahel and East Africa regions
- The Khoisan family with a population of 1 million. Major groups include Nama, San, Sandawe and Kung ǃXóõ located in Southern Africa, particularly Tanzania
Groups such as the Autronesian and the Indo-European are also notably widely spoken in Africa but due to their origin and prominence elsewhere in the world, I chose not to focus on them. Below is a map with the geographic distributions of these languages.(Source: Tracing African Roots blog)
Whereas in my previous post, I mentioned the intent to begin with Luhya, Kiswahili and Amharic, initial research made me begin to consider the languages not as independent entities, but rather as members of their larger ethno-linguistic groups where a lot of grammatical characteristics and vocabulary are shared. Leveraging on the similarities of languages in the same linguistic families opens up possibilities for exploring the use of transfer learning to share lexical and sentence level representations across multiple languages further along the line. That being said, I chose the Niger-Congo family, where the Bantu languages fall, for my initial focus, and particularly languages in the J and S zones according to the Guthrie Classification of Bantu Languages. These include Luganda, Chiga, Nyankore, Soga, Haya, Luhya, Nande, Nandi, Kinyarwanda, Rundi, Ha, Shona, Ndau, Sepedi, Sesotho, Tswana, South-Ndebele, North-Ndebele, Xhosa, Zulu, Swazi, Tsonga, Tswa, Ronga, Kalanga, Nambya and Venda. I am sure there are others.
My first project has been to build a Luganda-Kinyarwanda translation model based on word vector embeddings.
Word embeddings: Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension. (definition from Wikipedia)
Step 1: Scraping some data from the Internet
I identified a website with Bible translations in several languages and was pleasantly surprised to find loads of translations to African languages. Perhaps I should not have been too surprised given that exploration of the African continent was first done by missionaries and in a bid to evangelize, they began learning the local languages, teaching English and translating religious texts. Unfortunately, due to the terms and conditions of use of the site, I am unable to share the the actual text data. Their terms prohibit scraping or harvesting of data for distribution purposes. However, if you are interested, here is a link to the tool that I used to scrape some data from the site. Feel free to do so for your own personal use.
As my current interest is in Bantu languages and, for now, particularly those in the J and S zones according to Guthrie’s Classification, I’ve got versions in Luganda, Rundi, Sesotho, Shona, Kinyarwanda, Sepedi, Setswana, Xhosa, Ndebele and IsiNdebele. Plus Kiswahili and English.
Step 2: Cleaning and tokenizing the data
I needed to do some basic preprocessing of the text. This involved removing punctuation from the texts, making everything lowercase, removing the numbers that marked chapters and verses of various books of the Bible and finally tokenizing the texts.
Tokenize: to break text into individual linguistic units
Step 3: Training of vector word embedding models
I came across GloVe, a model for learning word vector representations, and decided to give it a try. I went ahead and trained embeddings for Luganda and Kinyarwanda. There’s several steps to achieving this.
The vocab_count script allows one to extract unigram counts from text data. Here’s the top 20 or so for English and Luganda, just to give you a sense of what the output is like. You get a file with each line containing a word and next to it the number of times that the word has appeared in the text you are working with.
Next is finding out the word-word co-occurrence statistics of our corpus using the cooccurscript. This is essentially analyzing the context within which a word is most often used by taking into account it’s neighbouring words. The output of this step is a binary file.
Here’s a simplistic example to try and explain that concept. From the sentences below, our coocurrence statistics would pick up on the fact that the pronoun ‘I’ is more often than not used alongside a verb as opposed to, say, a noun, because the last sentence is incorrect and should not exist.
I am a girl.
I saved the world.
I cows are always hungry.
Excerpt from Glove documentation to further(better) explain co-occurence: As one might expect, ice co-occurs more frequently with solid than it does with gas, whereas steam co-occurs more frequently with gas than it does with solid. Both words co-occur with their shared property water frequently, and both co-occur with the unrelated word fashion infrequently.
I used the default window of 15 words to arrive at the co-occurence statistics for this particular experiment.
The shuffle tool is used to shuffle entries in the word-word coocurrence binary files and once again outputs binary files which we can then use to train our GloVe models.
The glove tool takes as input the vocabulary count files that are output of the vocab_count script as well as the shuffled binary files that are the output of the shuffle script and gives, as output, the glove model. I chose to have the model represented as a .txt file with each word and its vector values. Below is sample output.
The first line of each file contains the count of words/tokens whose vector representations the file contains and the number of dimensions of the vectors. In this case, my Luganda file has 42,262 tokens, my Kinyarwanda file has 53,096 tokens and both files contain 50 dimension vectors.
Step 4: Aligning the word vector embeddings
What I achieved in step 3 is to represent the words in my text corpora in a vector space. This however does nothing for my attempts at building a translation model unless these vector embeddings are aligned. If the vocabulary related to ‘family’ is located in the north-west region of my Luganda vector space while in the south-east region of my Kinyarwanda vector space, I need to first align them, have the vector representations overlap, before I can begin the task of translation based on the vectors of each word.
The family example is simplistic, saying north-west and south-east implies a 2-dimensional space whereas in this instance, my vectors have 50 dimensions. If the latter were the case, aligning the vectors may have been a simple case of plotting and visualizing both then rotating the vectors until they align.
Given the high dimensions in question, I used Facebook MUSE’s unsupervised learning approach to learn a mapping between the source space(luganda vectors) and the target space(kinyarwanda). The output is once again .txt files with the various words and their vectors, only this time the vectors for the output file are aligned to those of the source language. As I have no previously built dictionaries for the two languaes, the only means left for me to do some degree of testing of the models is by visualizing the word vectors.
Step 5: Visualizing the word vector embeddings
For this step, I used gensim to read in the vector models as KeyedVectors and then proceeded to use a combination of matplotlib, python plotting library, and t-SNE from sklearn, a tool used to visualize high dimensional data.
Visualizing all of the output looked something like this for Luganda…
There were just too many tokens, 42,262 to be exact, so I had to purposefully pick out a handful of tokens of interest and then went ahead and visualized those.
Here’s a visual of some random words I chose to plot. The first, cleaner visualization is simply the words in Kinyarwanda and in Luganda as well. In some cases, I found more than one word for the same thing so the count is not necessarily equal. Luganda words are in green while Kinyarwanda words are in blue.
The second, slightly busier image is the same visualization except with English translations next to the words. If the word has a ‘null’ next to it, that just means I never got round to getting its translation, the process of doing that was a very tiring and annoying manual process so I did the bare minimum.
Finally, here’s the same image another 5 times, each time highlighting a word of interest that appeared in the same ‘neighbourhood’ where vector spaces are concerned and made me very excited at my little win.
- nose(eng), nnyindo(lg), izuru(rw)
- to love each other(eng), kwagalana(lg), gukundana(rw)
- thing(eng), ikintu/ekintu/engoye(lg), kintu(rw) — besides being in the vicinity of each other in the vector space, the similarity in vocabulary to some extent validates my hypothesis that systematically building resources for closely related languages at a go is a good idea.
- one(eng), emu(lg), rimwe(rw)
- name(eng), erinnya(lg), izina(rw)
…aannnndd, there you have it folks. Of course there are other instances where the translations are quite unintuitive, let’s attribute that to the fact that I did not work too hard at refining the model. Here’s a gist with the visualizing code.
If you’re reading this and wondering how you can help, here’s a couple of ways you could…
- If you know of any websites with text translated to any bunch of African languages, please point me in their direction in the comments section, particularly Bantu languages in the J and S zones, but really just all African languages
- If you are a native speaker of any of the languages I am currently working on and can spare half an hour or so to help with the task of dictionary building, I’m starting to build out small dictionaries which would make future testing/validation and visualization of models easy. Again please reach out in the comments or via a message. I’m currently working through the details of how this will work efficiently
- If you see any other way you can plug in and help out, also ping me
As next steps, I’m going to be working on tools to better tokenize Bantu languages. Conventional tools are wanting in this instance because Bantu languages tend to be agglutinative. In many instances, the subject, verb, object and various qualifiers can be contained in one word whereas the equivalent in a European language would be a sentence made up of several words. I am also going to work on aligning the texts I already have at sentence level and vary the creation of word vector embeddings to see if I can get better models. Feel free to leave recommendations of any tools, techniques, papers and previous work you think I should check out.
That’s it for now. Let’s hope it isn’t another 4 months before you hear from me again.