How to build a Swahili question answering system for your mpesa statements.
Automatic question answering is a challenging area for AI researchers. There are a lot of hurdles to jump when it comes to building a robust neural question answering system. First, you must obtain data, necessary to train your model, then you must build and tune the model so that it is not tripped up by the numerous nuances of natural language.
A number of initiatives have been launched to solve the natural language understanding challenge. The Stanford question answering data set is a collection of 100,000+ question-answer pairs based on 500+ articles. With this data set it is possible to train a robust machine learning model and test its reading comprehension skills. The most impressive attempts achieve exact match scores of 78 and f1 scores of 85.619. If questions like the ones pictured below can be answered correctly 78% of time, then there is reason to believe we are making progress towards machines that can fully understanding language :

It seems though that if machines were to ever understand us fully, we would probably have to converse in English. The reason for this post is thus a documentation of my feeble attempt to bring the Swahili language up to speed with the current natural language understanding techniques.
So to test the ability of my machine learning algorithm to understand Swahili I came up with a simple challenge: Use 16 rows of mpesa statement records to answer 300 sheng/ swahili / english questions.
Example questions from my data set include:

Here are the data types for the 16 mpesa rows.

And the mpesa rows:

The algorithm's function is to take in a question and point to any of the 16 mpesa statement rows where it thinks the answer is.
The following is a brief description of how the model works:
First the words used to describe the question are converted to random vectors of size 25. So the input matrix is of dimensions [number_of_questions, number_of_words, embedding_size]. Then the 16 rows of mpesa statement records are also converted to a matrix using the same random embedding. The two matrices are inputs of an LSTM with attention. The final states of the question and mpesa rows are multiplied to get a similarity metric. This metric is then combined with a feature I call ‘wordExists’ that measures the number of words shared between a question and a given row in the mpesa statement.
The two combined similarity metrics are then passed to a two layer feed forward neural network to output probabilities about which row is most likely the answer a given question.
The results? 70% accuracy on the test set.

This architecture was inpired by this paper: https://arxiv.org/abs/1703.04816
The code for my model:
If you would like to suggest any improvements or correct any errors in my code, reach out to me on my twitter.
