The Naive Bayes Classifier for Public Health Help Desk Triage

Charles Copley
Patient Engagement Lab
8 min readApr 30, 2018
Wordcloud of questions to the South African National Department of Health MomConnect SMS/Whatsapp Helpdesk

MomConnect is one of the most successful examples of a mHealth platform in the world. It has been led by the South African National Department of Health, and currently enrolls almost all mothers in South African Government clinics into a support platform which uses USSD, SMS, and, now, WhatsApp. The service has recently been described in a series of papers in the BMJ Global Health.

I will focus on one of the papers, Optimising mHealth helpdesk responsiveness in South Africa: towards automated message triage, that explores options for improving the efficiency of a Help desk. The Help Desk is a key feature of the MomConnect programme and has been running for nearly three years. It provides an invaluable service to mothers that need additional help. Mothers are able to SMS (and more recently Whatsapp!) questions to the service and receive assistance from qualified health care practitioners. Making the process more efficient using machine learning techniques would be incredibly useful, and I will discuss this further today.

Naive Bayes

Naive Bayes is a technique that has been used since at least the 1950s for these types of classification. It is conceptually easy to understand, highly scalable, robust and often used as a benchmark for more elaborate classification algorithms. It is based on Bayes’ theorem.

Thomas Bayes

To understand Naive Bayes let’s take a simple example. We are given the following sentences that have been labelled as either Positive or Negative sentiment

Photos by Abdelrahman Hassanein, Gift Habeshaw, Ryan Grewell
  1. Happy baby chuckling | Positive
  2. Happy bear rolls | Positive
  3. Chuckling baby smiles | Positive
  4. Angry baby cries | Negative
  5. Angry bear growls | Negative

We will use these sentences to train a classifier and use it to determine the sentiment of a new sentence we haven’t seen before

Angry baby owl

We are using owl arbitrarily here and can be substituted for any word that was not included in the training set.

If we look at the example sentences, we could start with a prior probability that an unseen sentence has a 60% probability of being a Positive sentence vs 40% probability of being a negative sentence. This is using the fact that 3/5 (60%) of the sentences are labelled as Positive vs 2/5(40%) of the sentences are labelled as Negative.

However, we can do better than this by using conditional probabilities, which is the probability of something happening conditional on some other event happening. It is also sometimes expressed as the probability of something happening given some other event happening. This is really just using the additional information encoded by the words in the sentences. For example, it is more probable that the label will be Negative if the word angry appears in the sentence; the probability of a Negative label conditional on the word angry is higher than the probability of a Positive label conditional on the word angry. In fact “angry” is only associated with the Negative label (2/6) and never with the Positive (0/9) . We write the conditional probabilities as :

P(angry | Negative) = 2/6

P(angry | Positive ) =0 /9

The word baby is associated with both Negative and Positive.

P(baby | Negative) = 1/6

P(baby | Positive) = 2/9

The word owl is not associated with either of the labels so doesn’t help us much.

P(owl | Negative) =0/6

P(owl | Positive) =0/9

How do we properly handle all of these conditional probabilities and combine them with the prior probabilities? This was originally formalised by Thomas Bayes, a Presbyterian minister in the 18th Century resulting in the famous Bayes Theorem which is today used in applications as diverse as playing poker to weather forecasting.

Bayes Theorem is given below.

P(Label|Words) = P(Words|Label) x P(Label) / P(Words)

And can be read as :

“The probability of Label conditional on Words is equal to the probability of Words conditional on the Label, multiplied by the prior probability of the Label and normalised by the all possible combinations of outcomes..”

We already know the prior probabilities of P(Label) i.e. P(Positive) = 60% and P(Negative) = 40%. The normalisation constant, P(Words), is only important if we want the probabilities to sum to one; since we are only interested in comparing the relative likelihood (as is often the case) we can omit this step for now and focus on calculating the conditional probabilities P(Words|Label) e.g P(Happy,baby,chuckling|Positive) etc.

In order to do this we make the (very strong) assumption of independence between words, giving rise to the “Naivety” in our model .This assumption allows the following

P(Happy,baby,chuckles|Positive)=P(Happy|Positive) x P(baby|Positive)x P(chuckles|Positive)

Having made this assumption everything then becomes a counting exercise. I have tabulated the counts in the table below, which give us all the probabilities that we need.

The six words associated with Positive:

happy,baby,chuckling,bear,rolls,smiles

and five associated with Negative:

angry,baby,cries,bear,growls

The cumulative probability across all words associated with a label must be 1.

We wish to evaluate the probability of each label conditional on the words in the sentence

  1. P(Positive|angry,baby,owl)
  2. P(Negative|angry,baby,owl)

in order to decide which category the sentence belongs in.

Bayes theorem coupled with the assumption of independence says we just need to evaluate

  1. P(Positive|angry,baby,owl)

= P(angry|Positive) *P(baby|Positive) * P(owl|Positive) * P(Positive)

= 0/9 * 2/9 * 0/9 * 0.6

= 0

2. P(Negative|angry,baby,owl)

= P(angry|Negative) *P(baby|Negative) * P(owl|Negative) * P(Negative)

= 2/6 * ⅙ * 0/6

= 0

Above we see the final problem: the zero values. We can avoid these by using Additive (or Laplace) Smoothing. In its simplest form we add one to each of the word counts and then add the count of ALL words (in the example above there are ten words) to the denominator in order ensure that the probabilities still add up to one. We then update our previous table to the one below

  1. P(Positive|angry,baby,owl)

= P(angry|Positive) *P(baby|Positive) * P(owl|Positive) * P(Positive)

= 1/19 * 3/19 * 1/9 * 0.6

= 0.00026

2. P(Negative|angry,baby,owl)

= P(angry|Negative) *P(baby|Negative) * P(owl|Negative) * P(Negative)

= 3/16 * 2/16 * 1/16 * 0.4

= 0.00058

We can actually now calculate the normalizing constant that we ignored earlier, P(Word). In this case the sum of all possible outcomes would be 0.00058+0.00026 = 0.00084. Using this normalisation constant in Bayes Theorem ensures that all the probabilities sum to one. Given this the probabilities of the two labels are:

P(Negative|angry,baby,owl) = 0.00058/0.00084 = 0.69

P(Positive|angry,baby,owl ) = 0.00026/0.00084 = 0.31

From the numbers it is 2.23 times more likely that the sentence is a Negative sentence than a Positive, or expressed differently there is a 69% chance that the sentence is Negative and a 31% chance that it is Positive. Depending on the costs of misclassification we could choose different levels of certainty before classification, or we could just choose the most likely option. For example, in a email spam classifier, if we mistakenly classified an email as Spam (a false-positive) you might never see it again. This might be very inconvenient, so we might set a very high threshold (e.g. 99.99%) sure that the email is spam before classifying it as such. If we occasionally let through a spam email (a false-negative) the user just marks it as spam, which is low cost. So we might tolerate a higher false-negative rate in this example. Medical tests such as HIV tests are an example where False-Positives are potentially less costly than false-negatives if you accept the argument that someone falsely diagnosed with a disease can undergo the same test again. Of course this still costs money with estimates upwards of $4 billion spent on false-positive mammography results in the US.

Further Improvements

Further improvements can usually be made by normalising words. Two examples of this are stemming and lemmatization. For example angry, anger, angriest are all changed to angry. Stemming achieves this with a simple per-language rule-based approach ignoring the context of each word. Lemmatization is more sophisticated using the context of the word. In our case, questions are received over an SMS service. Users often do not use normal vernacular and questions are asked in a number of languages, with multiple languages often mixed into a single question. Since questions are often in 1) non-standard English, 2) languages other than English, or 3) a mix of English and other languages, a stemming algorithm based on standard English is likely to perform poorly. Similarly Lemmatization is unlikely to perform well under these conditions.

As more data is accumulated it can be easily incorporated into the model. Language does not affect the model and different languages can be handled simultaneously.

Results

We have trained a Naive Bayes classifier on the approximately 50 000 questions that were sent to the National Department of Health Maternal Help Desk between October 2016 and August 2017. The questions are in all eleven of the official South African languages with each question assigned one of nine possible labels by a trained nurse.

A random sample of 75% of the data were used to train the classifier, with the unseen 25% used to test the classifier performance. The classifier was trained using the Apache Spark machine learning libraries, with the summary statistics created using the R Caret Package. The classifier achieved an accuracy of 85.4% in these unseen data. An existing limitation is the heavy skewed towards “Question” labels. A evenly distributed training set could be used to improve the results in future.

A pregnant mom registered on WhatsApp

Conclusion

This is a promising early result in providing simple infrastructure to aid question classification and eventually triage of the messages into Urgent vs non-Urgent depending on the content of the question. If nurses were to label questions as being urgent (e.g. domestic abuse, bleeding) vs non-urgent (nutritional advice, sleeping difficulties), this classifier could be used to automatically prioritize helpdesk messages. This might allow the Helpdesk efficiency to improve and as more data are collected become increasingly accurate.

We believe that the results of the simple Naive Bayes algorithm promise a future where a classifier could help us create an efficient Help Desk to help support the increased engagement we are seeing with moms from the WhatsApp integration and switchover.

--

--