Emoji Generation for News Headlines ✌️ Overview of Short Text Classification Techniques

15 min readApr 15, 2019

The human brain processes images 60,000 times faster than text, and 90% of information transmitted to the brain is visual.

Introduction

As technology advances and the volume of data made available to the public multiplies daily, the attention span of the people who consume that data shortens, and it becomes crucial for platforms that provide information to captivate their users in a matter of instants.

Visual information is the format the human brain processes fastest [1], and so it makes sense that adding visual information to the content you’re trying to deliver to your users would help capture their attention. In the hope that adding these bits of visual information to snippets of text could help the users scan a list of headlines faster and get to the articles that interest them before they tire of browsing, I was assigned the task of building an emoji generator in the line of my work at Snipfeed — a platform that processes thousands of headlines a day to deliver them all to one place, and cannot afford to emojize headlines manually.

Tools

I used a dataset of news headlines with “gold annotations” (i.e. emojis that were put there by the people who uploaded the articles to a news platform), and the emojis library for Python, along with the usual tools for NLP (NLTK, WordNet) and machine learning (SKLearn, PyTorch, and some Keras).

The annotated dataset isn’t available for download, but I’m sure with some scrapping skills, you can build one for yourself (check out Snipfeed).

The Data — Pre-processing

The data I used contained 13,027 news articles, with their annotated title (example below), and a link (from which we can scrap the text of the article — and more — using the newspaper library for Python).

AI-Powered, Self-Driving Robots Are Taking On a Bigger Role at Walmart Stores 🤖

I divided emojis into two groups (emotions vs topics), and only kept the emojis that appeared at least 50 times. That turns a multilabel + multiclass problem into two multiclass problems with only one label (we only want to predict one emotion and one topic emoji for each headline in the end). The 50 appearances threshold ensures we get sufficiently meaningful data about the type of headline associated with the emoji.

I also created emoji clusters by hand. This means that, for instance, all the emojis that represent fun (😜 😝 😂 😄 😅 😆) are clustered into one group. All the headlines containing one of those emojis will now contain 😂 instead. That is equivalent to associating the label ‘funny’ to the headline.

In the end, I’m left with the following emojis:

topic emojis

emotion emojis

There were 3,446 headlines containing a topic emoji, and 2,747 containing an emotion emoji. As the following bar charts show, both multiclass datasets are unbalanced (I used words instead of emojis for the labels because some get cut out depending on browsers, but the point here is only to witness the unbalance anyway):

Distributions of our subclasses within our 2 classes

It might look like the list of topics is quite short (and it is, tbh), but the idea here is first and foremost to find a model that will predict these guys correctly, despite the small size of our dataset, the multiclass problem, AND the unbalance problem, and then we can worry about adding more labels.

Being Basic About It : Bags of Words & Multinomial Naive Bayes

Bags of Words

The Bag of Words (BoW) representation is a sparse matrix representation, where each item (piece of text /document) is on a row, and each word in the vocabulary (all the words used in the corpus) is on a column. In the approach we’re gonna be using (TF-IDF = Term Frequency — Inverse Document Frequency), each cell contains the frequency of the word in the document, divided by the frequency of the word in the corpus.

Stopwords and Stemming

Stopwords-removal and stemming are part of the hall-of-fame of NLP pro-tips. Stopwords are all those little words that are all over the place and have no semantic significance (the, and, to, at, because, etc…). They’re basically parasites, and it will only make our (the model’s) job easier to remove them.

Stemming is the art of cutting a word down to its root. For instance, it transforms plurals into singulars, trims down verbs so that it doesn’t differentiate past and present tense, and many more wonderful things.

Multinomial Naive Bayes

Naive Bayes classifiers are probabilistic classifiers that use conditional probabilities (the Bayes formula) to measure the probability of each class for a sample. They’re called Naive because they make the assumption that variables are independent (i.e. the probability of several events occurring at the same time is equal to the product of the probabilities that each of these events will occur on their own). The term Multinomial comes from the probability distribution the model assumes its variables follow. A ‘regular’ Naive Bayes assumes variables follow a Gaussian distribution. The Multinomial distribution is a generalization of the Binomial distribution (which is usually people’s favorite because it’s so simple).

Multinomial Bayes Classifiers are pretty popular in NLP, since they’re good at predictions on multiple discrete variables (which are very good for modelizing words, for example).

Pre-processing & Formatting

We’re going to be using the Newspaper3K library for Python to scrap the texts of the articles attached to our headlines. That means we’ll be dealing with a few hundreds of words for each document, instead of a dozen.

All our documents are stored into a TF-IDF sparse matrix, and we use 2,229 samples for training, and 500 for testing.

I handled the imbalance problem with over-sampling: I take the largest class I have, count the number of elements in it, and for every other class, I repeat elements at random until I obtain the same number of samples. In the end, there’s 439 samples per label.

Results

Let’s go ahead and look at some examples of predictions our model has made:

The first line is the first 100 characters of the article. As you can see, the newspaper library isn’t flawless, and sometimes it will scrap ads or those pop-up windows asking you to disable AdBlock… as if.

However, the predictions seem really good… If you continue looking, you can see they sometimes don’t correspond exactly to the original label, but still make a lot of sense:

samples of predictions made by our MNB: valid mistakes

And sometimes, they’ll even surpass the original labels, which were set by human annotators:

samples of predictions made by our MNB: correcting gold annotators

In the examples above, you can see our model has ‘corrected’ a rocket emoji for Nike shoes to a basketball emoji, as the article was centered around the San Antonio Spurs (who play basketball, for those I’ve lost here). It changed the money emoji to a justice emoji for that headline about Trump paying off some people to stay quiet. Honestly, I was pretty happy with that.

Let’s look at our predictions for the testing dataset in numbers:

precision and recall on each subclass of the topic category for the MNB

For each target, you can see the number of predictions that were made (which influences the confidence you can have in the statistics), precision on the class, and recall on the class. The bottom lines contain the average for each metric, as well as the weighted average (the weight of each class = the number of predictions that were made in that class).

To get an even better sense of what our predictions look like, we can also generate a confusion matrix:

Our diagonal looks clear enough, which is a yay!

Latent Vector Representations & Recurrent Neural Networks

Latent Vector Representations (Word2Vec vs Count-Based)

Another hall-of-fame representation for text is the latent vector representation.

There’s two types of vector representations:

Word2Vec, where the vectors are computed to improve probabilistic generation of words surrounding the one you’re focusing on (trained with a feed-forward neural network and an optimizer — usually SGD);
count-based approaches, where you basically build a co-occurence matrix (see what words often appear together or not), and perform dimension reduction to obtain vectors out of it.

In this project, I used GloVe, a pre-computed count-based representation made public by the Stanford NLP group. A trust-worthy bunch.

illustration of vector representations for words

Above is a visual representation of what latent vectors should look like. One word’s representation on its own is meaningless: the interesting part of vector representations is you can compare them, and even perform computations. For instance, you can see on the graph that if you add the vector that goes from ‘king’ to ‘man’ to the word ‘queen’, you get the word ‘woman’. So the great thing about these representations is they capture semantics.

Recurrent Neural Networks

Recurrent neural networks are neural networks that have a vector “running through them” and carrying the long-term information from one end to the other. At each input (word), that vector updates itself with the new information it’s received. So the big difference with your basic neural network is recurrent networks will take time into account.

You might have deduced earlier from the BoW section that it completely disregards the order the words come in. However, recurrent networks do take these things into account, and so the same word could have a different effect on our predictions depending on its context.

Above is an illustration of what a RNN looks like. h1 is a vector you initialize with your RNN (usually just a vector of zeros). Each x is an input (here, a word represented as a vector). The input x1 mixes with h1 to form h2, which now contains the initial information, plus the information from x1, and that can go on and on for as long as you want (typically, the number of words in your sentence).

Pre-Processing & Formatting

The inputs for our RNN are going to be sentences in the form of matrices, where each row is a vector that represents a word. These matrices will be of shape (n_words, dimension), where n_words is the number of words in the sentence, and dimension is the size we’ve chosen for our latent vectors.

Problems & Solutions

To train a RNN with this dataset, we’ll have to solve two problems:

RNNs take in inputs of constant sizes. However, not all headlines have the same number of words in them, and so not all our matrices will be of the same shape. To remedy that, what I’ve chosen to do is to train the RNN one sample at a time. The list of matrices the RNN will get as an input will therefore have only one sample, and the shapes of the samples will be constant (because there’s only one).
The other problem is the unbalance between our classes. As previously stated, some topics are prevalent and others barely exist. So what we’re gonna do, instead of picking a sample at random each time, is pick a class at random, and then a sample within that class. That way, all classes should have equal representation. (We’re basically doing over-sampling.)

Training — Topic Emojis

Our training dataset contains 2,949 samples, which leaves 500 of them for testing.

Our RNN has only one hidden layer of size 200, and uses the softmax function for activation, which returns probabilities for mono-label classification. If we wanted to predict several emojis, we might want to use the sigmoid function instead.

The graph above shows the learning curve on the training dataset for topic emojis.

Let’s look at our predictions in numbers:

precision and recall for each subclass of the topic category for the RNN

We’re about 10 points below our first system with this one, but we have to remember this model only uses the titles of our news.

Now, again our confusion matrix:

Our diagonal is clear enough here. Of course, there’s only 500 testing samples, but classification seems to be pretty good.

Let’s look at a few predictions from the testing dataset (Title = news headline, Emoji = the true label, and then the 3 lines underneath are our top 3 predictions, with their negative score next to them):

So our predictions are pretty good! We can see that some mistakes (like the one made on the first example) are forgivable. Switching from a computer screen to a mobile phone when talking about Facebook wouldn’t be a problem when applying our model to new samples. Other mistakes, like the one made on the third to last sample (about Breanna Stewart) are a little more problematic. Mistaking a basketball player for a soccer player isn’t the greatest thing, especially when our model is so confident about it.

Switching to GRU

We’re now going to try things out with a RNN slightly more robust than the vanilla version. I’m using Keras this time to put together a GRU (which follows the idea of LSTMs but have been known to outperform them in tasks with small datasets — such as ours).

One thing we’re also adding is attention. Attention is a layer in the model that computes weights to give to each input token for the prediction of each output token.

As you can see on the left, attention gives us a form of interpretability for neural network models. For each output in sequential predictions, the attention layer computes weights to give to each input, and adjusts the predictions that way. For instance, on the left, you can see the word “equipment” held a largely heavier weight in the prediction of “équipement” than the rest of the inputs.

This technique has proven to improve results of recurrent networks (namely on translation models).

I’m only going to add our chosen metrics for direct comparison, as the rest of the model is pretty much the same as the previous:

precision and recall on each subclass of the topic category for the GRU

We’re 2 points above our performances with the vanilla RNN, which — yeah, it’s worth switching to GRU. This model outperforms the MNB, which ran on whole articles and not just their titles!

Now that we have a pretty good model, we’re going to add labels to it, and see if our performances stay legit. I did that by augmenting the dataset (by hand). There’s now 20 labels instead of 15 (only 18 of them end up being predicted though). Let’s look at performances:

precision and recall on each subclass of the topic category for the GRU: augmented labels

And they’re even better! Of course, a bigger number of samples means there’s more material for learning, but it’s good to know adding classes to our predictions doesn’t lower our model’s performances.

Review of the Models’ Performances

Applying the model to emotions emojis

Now, emotions are a little trickier. Two people may not agree on which emoji should go with what. Political views — and honestly straight up your personality — can affect the way you view certain news. It’s possible for some people to be broken-hearted over sitcoms, while others find it funny. One lesson here is: it’s dangerous to label news headlines with emotion emojis in the first place. However, there must be a way to label some of them with confidence. And that’s precisely what we’re gonna tweak here: confidence.

Predictions are always made with a confidence score (here, it’s simply the probability returned by the predict_proba function of our models). We’re using a softmax layer on our neural networks, and Multinomial Bayes predictions automatically sum to one, so all of our results are probability distributions. That means, if one of our predictions has a confidence score superior to 50%, then all other predictions put together sum to under 50%. So that’s a pretty high confidence score. By putting a threshold on our prediction of 50% confidence, we can have a little more faith in our predictions:

maria menounos gets married again see wedding photos !
♥

airwolf actor jan michael vincent has died
😔

it s not the economy , stupid


former nba all star kenny anderson suffers stroke
😔

ideas for valentine day treats
♥

Below, our metrics for the TF-IDF + MNB model (on the left), and the GRU model (on the right). For both the models, we set a threshold for validation of predictions at 50% certainty.

precision and recall on each subclass of the emotion category for the MNB and the GRU

Each model has its pros and cons. The major pro of the MNB is it makes predictions for each class, and we can hope to obtain more diversity from it. The major pro of the GRU is… well, the performances. The predictions are fewer, but the few ones we have are very trustworthy.

One of our main goals here has to be avoiding headlines like “3 Dead in Airplane Crash 😄”, so my decision here was to go with the GRU. However, with more research, why not consider an ensemble method that combines both?

Manually increasing Diversity

The final step here is to add some diversity to our predictions. Our models predict classes of emojis, but there’s often several emojis possible for a same class. In other words, we’d like to “de-cluster” the clusters of emojis we built in the beginning of this article.

All we have to do is reverse the dictionary we built, with our chosen label emojis as keys and lists of similar emojis as values, and then pick an emoji from the corresponding list of our label at random for each prediction.

Parsing & Using a Database

Machine learning is awesome (and a sure way to sound cool when people ask you what you do), but sometimes the simplest solutions are the best. In this case for instance, a database containing a list of all existing emojis with labels is gonna be a pretty nice thing to have on hand.

The emojis library (mentioned earlier with a link) has that in store. Each emoji has an alias (a description in one or two words), and tags. For instance, the 🎅 emoji has the alias ‘santa’ and is tagged ‘christmas’.

By selecting some important keywords, this library makes it easy for us to obtain automatic tags. You just have to be careful what words to send in for tagging (you might want to avoid words like ‘on’ that will get you emojis like 🔛). I chose to do automatic tagging in three steps:

Building a list of all emojis that are worth automatically tagging. For example, the house emoji 🏠 might not be necessary to add to a tagline. The link emoji 🔗 is quite pointless, and can even be misleading. So what you want is a list of emojis that will for sure be interesting to have in your headline, as they will automatically appear there once the corresponding word is detected.
Part of Speech tagging the headlines and only selecting nouns for potential automatic tagging. You have to be careful with headlines that are entirely capitalized, so I lower the whole thing before POS-tagging. I used nltk’s pos_tag function to obtain the tags. That’s not 100% accurate (especially when you set the text to lowercase), but it gets the job done.
For each word in the headline (well, each word fit for automatic tagging — so nouns): looking for an exact match in emoji aliases. For instance, the word ‘horse’ will directly find the 🐴 emoji. If that doesn’t work, look for an emoji that matches the stem of the word. That means ‘horses’ will become ‘horse’ and find the emoji. If neither of those methods work, look for emojis that carry the word as a tag. That last option is particularly helpful when it comes to countries (once again, I suggest building your own list of countries — those with famous flags) and concepts such as Christmas or Halloween. You can then obtain a list of emojis that correspond to your word, and pick one. I chose to pick one at random each time, to increase diversity, but you can just pick the first one returned if you like.

fooled by fake paperwork , the met bought a stolen egyptian coffin for 4 million
⚰️ tyrrell winston is the artist turning new york s trash into art
🎨 ringing the vegetarian bell : taco bell making 2019 commitments with new menu launch
🌮 how the usta national campus has transformed the american tennis scene
🎾 georgia man charged in plot to attack white house with anti tank rocket
🚀the history of the world according to cats
🐱

Above are a few examples of headlines that were automatically tagged. (If you can’t see the emojis properly, just copy paste them to a chat bot or something — the first one’s a coffin and the second a palette). As you can see, that method works pretty well, and it’s the shortest piece of code in this whole thing. Typical.

References

T-Sciences, Humans process visual data better