Towards Automating Digital Maternal Healthcare in South Africa

Published in

Patient Engagement Lab

10 min readJul 26, 2019

This article is based on the paper:

Jeanne E Daniel, Willie Brink, Ryan Eloff, Charles Copley. 2019. “Towards Automating Healthcare Question Answering in a Noisy Multilingual Low-Resource Setting. ” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.

Q: What are the signs of labor?
A: Signs of labor include a jelly-like discharge, your water breaking, and regular and painful labor contractions. Make sure you can get to a hospital.

TL;DR: We gained access to a fairly large, anonymized, noisy, multilingual question-answering dataset in the maternal healthcare domain (MomConnect). Our goal was to investigate ways in which we can reduce the burden on the currently-overwhelmed, fairly standardized MomConnect helpdesk through automation. We encountered several challenges within the dataset, most notably noisy questions, code-mixing, near-duplicate answers, and many languages being low-resource languages. We reduced the number of near-duplicate answers and made use of cross-lingual word embeddings, that learned from the shared context created by standardized responses, to deal with the code-mixing and multilinguality. We also kept the whole vocabulary to preserve words from low-resource languages. We reduced the dataset to questions with answers that occurred more than 128 times in the entire dataset. We treated the problem as an answer selection problem, where we classified to the correct answer in a predefined set of answers. Our best classification model was a 512-unit LSTM model with the word embedding space trained end-to-end, achieving an accuracy of 62.13% and a recall@5 of 89.56% on the test set. The recall@5 results are promising as such a model can assist in providing top 5 suggestions to the MomConnect helpdesk. A MVP version is currently being developed by Praekelt.org tech team!

MomConnect

MomConnect is a free service offered to women who register their pregnancy at a public health facility in South Africa. The service is available in all 11 official languages, and provides information and emotional support to women throughout and after their pregnancy using both text-messaging and WhatsApp. The messages are intended to be appropriate to each to-be mom’s stage of pregnancy (and the newborn’s age).

MomConnect has successfully registered over 2.6 million users since 2014. As could be expected, sending out messages often generates responses and/or questions from the to-be moms registered on the programme! These incoming messages are handled manually by a staffed helpdesk. Questions are posed to the helpdesk via SMS and WhatsApp. The recent introduction of WhatsApp as an additional channel to SMS has increased the volume of questions substantially, causing a growing backlog of unanswered questions. Currently the median response time is about 20 hours.

The majority of questions are answered according to expert-crafted templates (such as demonstrated in the introduction). While questions can be posed in any of the 11 official languages of South Africa, the template answers are still in English. The standardized nature of the answering process presented a unique challenge for a feasibility study into automating the process. During 2018 and 2019, we (researchers from Stellenbosch University and Praekelt.org) did just that and will be presenting our paper, “Towards Automating Healthcare Question Answering in a Noisy Multilingual Low-Resource Setting” at the The 57th Annual Meeting of the Association for Computational Linguistics (ACL) in Florence, Italy, end July 2019.

This unique dataset provided us with the opportunity to apply computational linguistics techniques to a real-world application for social good. The ability to scale MomConnect would mean greater access to this healthcare platform in South Africa.

Data challenges encountered

As can be expected from a real-world dataset there were several issues that needed to be overcome before we could proceed to a formal research problem. The text questions were very noisy, meaning they had many typos, spelling errors, and unwanted characters. Adding to the noisiness there were many examples of substantively identical answers showing slight variations in word content and/or punctuation. The inconsistencies could be characterized by answers that seemed the same but differed slightly in terms of their word content or punctuation. This made it harder to identify which answers were the most popular.

Language identification and separation was also quite challenging. A language label (for example “Northern Sotho”, “Afrikaans”, “Venda”) is recorded when a user signs up at the clinic, but users are free to ask questions in any language, and in many cases used multiple languages in the same sentence (code-mixing). In a country like South Africa, with 11 official languages, the end result is that language-specific models could not realistically be trained. The spread of languages across the dataset was very unequal, for example, 51,250 users chose English as their sign-up language, but only 97 users chose Ndebele as their sign-up language. Whilst technically challenging, this noisy, multilingual dataset is also very fortunate as it allows for research in a more realistic setting and is what could be expected in other similar settings in a public health service.

Finally, many of the languages in our dataset are low-resource languages that do not have widely available annotated digital resources. Low-resource languages are often geographically constrained, resulting in a lower return-on-investment when compared to tools for high-resource languages (e.g. English, Mandarin, French). As a result low-resource language often lack freely-available datasets, parallel language datasets, software tools such as language detection, machine translation, spell-checkers, speech-to-text, text-to-speech, etc.. In extreme cases governments can also actively (or passively, through bureaucracy) suppress the teaching and development of certain languages, further compounding the problem.

In the author’s opinion, many low-resource languages are doomed to stay low-resource languages without active intervention by creating freely-available high-quality annotated datasets and funding research for building tools for these languages.

Experimental Design

We tackled the problem as an answer selection task, where a question had to be matched to an answer in a defined set of answers. This is almost like Frequently Asked Questions, which economically reuse previously answered questions to guide future answers. While we had standard English answers, we did not have standard questions, and they were also in multiple languages. Our challenge here is to match questions that are similar in their intent but written differently, or in two different languages.

To address the multilinguality and code-mixing, we made use of cross-lingual word embedding spaces. We also attempted to reduce the number of near-duplicate answers by replacing near-duplicates with their more popular twin.

Data preparation

Our first task was to create identify a subset of reliable question-answer pairs that represents the majority of the dataset and would be most effective in addressing the backlog. In the original dataset there were 42,675 unique answers (many of whom were small variations of one another). The frequency distribution of these answers approximated a power law distribution, meaning we had a handful of answers that occurred about 70% of the time, and the rest of the answers occurring only once or twice throughout the dataset.

We wanted to address the problem using the Pareto (80/20) principle. We decided to set a threshold for number of questions per answer and keep all those above this threshold. This subset was subsequently used for training and testing. Choosing the threshold was quite difficult because we had to balance the need for wanting as many questions per answer while not excluding the less-frequent languages. Ultimately we decided not to deliberate too much in choosing the threshold, as this was a proof-of-concept. After de-duplicating the answers, we used 128 as our threshold which left us with 126 unique answers for our answer selection task. Then we split the reduced dataset into a Training, Validation, and Test set (60:20:20).

Sample question-answer pairs. Questions can be posed in any of South Africa’s 11 official languages, while the template answers are currently all English.

Cross-lingual Word Embeddings

It is important to note is that we did not remove any stop words, in order to preserve the limited vocabulary of some of the low-resource languages. Further, some languages were conjunctively written, and we also didn’t have access to stopword dictionaries for many of the languages. This left us with a multilingual vocabulary of size of 65,547.

Considering the fact that our data had questions in multiple languages (with low-resource languages), with instances of code mixing, and the shared context created by the standardized English responses across multiple lanugages, we decided to create a cross-lingual embedding space. Benefits of using this approach includes:

For a peek at what the cross-lingual continuous bag-of-words embedding model does, here is an example of the same word in English, Zulu, and Xhosa, and their respective closest neighbors in the embedding space, using cosine distance:

child: baby, bbe, babe, bby, babay
ingane: ingan, yami, ngane, umtwana
umntwana: untwana, wam, umntana, wami

Different spellings and shorthand of the same word/context tend to be clustered together, which is quite useful when working with SMS and WhatsApp messages. Note the slight overlap in the Zulu and Xhosa examples, due to the two languages being closely related.

Answer Selection vs Answer Generation

Generating answers, e.g. Seq2Seq modeling, instead of selecting answers from a predefined set, might have been the more obvious choice for this problem, given the amount of training pairs we had. But the complexity surrounding the problem meant that generating answers would be very risky, as these answers would not necessarily be standardized nor approved by healthcare officials. Therefor we decided to treat this problem as an Answer Selection task.

Classification

To select the most correct answer for a given question, we classify each of the questions to the most appropriate answer from the set of 126 standard answers (Answer Selection). As a baseline we train a multinomial Naive Bayes (MNB) classifier on a bag-of-words representation of each question, using as our vocabulary only the 7000 most frequent words across the training set.

We then consider k-nearest neighbor (k-NN) classification on the averaged word embeddings of each question, using uniformly weighted majority voting, and for increasing values of k. We also considered locality-sensitive hashing (LSH), an approximate nearest neighbor algorithm, which sacrifices accuracy in k-NN for efficiency. With LSH, averaged word embeddings are randomly hashed into short binary encodings that preserve local information, thus enabling nearest neighbor searching in sub-linear time.

Finally, we trained various Long-Short Term Memory networks with word embeddings as feature extraction layers end-to-end, with increasing number of hidden units. Each model took a variable-length sequence of word IDs as input and has a softmax output layer for classification.

Results

The models are evaluated using classification accuracy on the test set of 30K as-yet unseen question-answer pairs. We also identify a “low-resource” (LR) part of the test set, based on how many words per sentence are of a low-frequency, and also measure accuracy on this set.

Classification accuracy (Recall@1) of various models on both the full and LR set

The MNB baseline performs well, both on the full test set and the LR test set, but possibly due to bias for the high-resource languages in its bag-of-words. The nearest neighbor models (k-NN and k-LSH) show almost no improvement over MNB, and actually do worse on the LR set.The LSTM models seem to perform best. Increasing the number of LSTM units increases accuracy but decreases the performance on the LR set.

While LSTM shows a significant improvement over the other models, it reaches an accuracy of only 62.13% on the full test set. This is understandable given the complexities of noisy data, multilinguality, and code-mixing, but succeeding only 6 times out of 10 is insufficient for a real-world implementation.

In order to gauge the feasibility of a top-5 recommender system assisting a human operator, we also measure recall@5 for the MNB baseline and LSTM models. Recall@5 is calculated as the portion of predicted items where the correct result was present in the top 5 predictions.

The best recall@5 performance of 89.56% on the full test set and 81.23% on the LR set is encouraging, and could be considered for a real-world application (stay tuned!) Such a model can serve in a semi-automated answer selection process, with a human in the loop to choose the final answer. This could significantly reduce the burden of the current staffing complement, if approximately 70% of the queries can be dealt with in a semi-automated manner. In the case where the human does not agree with any of the suggested answers, the option can remain for the human operator to manually select the correct standardized response, as is currently done. This feedback can help improve the automated response service, and assist future research tasks.

Currently a MVP that makes use of Tensorflow Server is being developed by the Praekelt Foundation tech team.

Final Words

Thank you for reading all the way to the end! Feel free to email me at jeanne.e.daniel@gmail.com if you have any questions. Shout out to Monika Obrocka and Charles Copley for providing feedback and suggestions during the writing of this article!