Comprehending Emoticons

“smiley paint on gray ground in front of people” by Nathan Dumlao on Unsplash

AWS’s Comprehend provides a natural language processing service that uses machine learning to identify a text’s language, extract key phrases, understand sentiment (how positive or negative the text is) and automatically organise text by topic. You can easily analyse text and apply the results in a wide range of applications, finding insights and relationships in text. No need to understand how it works, just call an API and get the results. Simple?

Machine learning algorithms not only need to be accurate but also understandable and explainable. It is important to test how they behave for your specified domain.

I was interested in analysing user feedback. This can include emoticons and often does. I was curious how Comprehend would handle these ubiquitous little characters.

To test this out, I called the Detect Sentiment api with a test string: “Comprehend makes me happy”. The results were what you would expect, with the API returning a positive sentiment that was 93% positive:

{ “Sentiment”: { “Sentiment”: “POSITIVE”, “SentimentScore”: { “Positive”: 0.932614266872406, “Negative”: 0.008453690446913242, “Neutral”: 0.05423713102936745, “Mixed”: 0.004694902338087559 } } }

Replacing the text ‘happy’ with an emoticon ‘😀’, the updated test string “Comprehend makes me 😀”, resulted in a neutral sentiment overall and only 8% positive:

{ “Sentiment”: { “Sentiment”: “NEUTRAL”, “SentimentScore”: { “Positive”: 0.08551795035600662, “Negative”: 0.03741350769996643, “Neutral”: 0.8686702847480774, “Mixed”: 0.00839831680059433 } } }

What went wrong? To understand why, let’s call the Detect Syntax api. The response from this API, shows us that the emoticon is being interpreted as punctuation:

{
“TokenId”: 4,
“Text”: “😀”,
“BeginOffset”: 20,
“EndOffset”: 21,
“PartOfSpeech”: {
“Tag”: “PUNCT”,
“Score”: 0.9972658157348633
}

This is not ideal and any text that contains emoticons is not being analysed correctly.

Fortunately the solution is simple, before consuming the text with Comprehend’s API, pre-process the text to replace any emoticons with their corresponding text. For example 😀 to happy, 💗 to love and so on.

Notes:

  1. If you are using Python, there is a great little emoticon project that will ‘demojize’ text.
  2. There have been other examples that deal with emoticons by assigning each emoticon a sentiment. If the text contains an emoticon and it is a ‘positive’ emoticon then the text is positive. This approach is flawed, as text could contain negative modifiers (not😀), which would not be correctly interpreted.

If you have an interest in Machine Learning and love solving complex problems come and have a chat with the Momenton crew.