🙋‍♀️Does Emoji Use Correlate With Twitter Engagement?

Data analysis of tweets from the emoji angle (with charts!). Bonus: Most used emoji revealed.

Marta
The Startup

--

I don’t know about you, but the day I showed the emoji keyboard to my mum was a beginning of a new era of comedy.

My life was complete, and all channels of communication became emojified.

You might not be an emoji person, I get it. 💁‍♀️ That’s why you have me. I’m the one who’ll add custom emoji to your team’s slack channel and to probably every single company tweet. Which brings us to the topic of this post.

But, this post is not about the history, science (?), or language of emoji (plenty written already), not even about the CTR of emails or notifications incorporating them (apparently higher than regular). This post is even better!

It’s a data analysis of emojis used in our brand’s (JAM) tweets.

Since it’s mostly me who sends these tweets, it’s also to some extent an analysis of my favourite emojis or, even more accurately, of emojis I consider most relevant to the text of the tweets and the audience.

In any case, read on and expect wonders.

🚰 Emoji source: The data set

The dataset has been downloaded from Twitter Analytics, and covers the period form April to September 2020. As much as I’d love to have more, Twitter doesn’t store data going further back or, if it does, it doesn’t make it available to the account owner.

The dataset includes all the tweets sent in this period from the account @makingjam, has 529 rows, and 40 columns most of which we’ll not need.

I’ll save you a report on data cleaning. All steps and the full analysis of this data is in the notebook.

Emoji? Emojis? Let’s count the uncountable.

There are two libraries you can use to handle emojis in text, emojis and emoji. They offer diffrent methods, but for my needs, “emojis” was sufficient. First step, we’ll count emojis used in the ‘text’ column which holds the text of each tweet.

Now we can plot the distribution of emoji.

Output:

Contrary to my expectation, most tweets have 0 emojis in them!

I’ll have to compensate for that in the future. 😝 In all honesty, those are probably replies, as they are typically short.

Next we want to count all emojis in the corpus. I used the same method on the regular and on the tokenized version of the text and observed a disparity.

emoji_sum = 544 and emoji_sum_tk = 493

This seemed like a strange result. But, looking at a few rows of the tokenised text explains where it came from.

There are many emotions in these tweets…

If more than one emoji appear next to each other, the tokenizer processes it as one token, rather than as separate ones.

Lesson learned! For further analysis we’ll therefore use the number 544 as one representing all emojis appearing in the corpus.

What does this number tell us? If we count characters and words, it can tell use the percentage of emojis used per word and character.

The percentage of emoji among all characters is 0.53%.

And the percentage of emoji among all words is: 3.59%

3.6% emoji per word. That makes it one emoji every 28 words — be grateful this text is not written like that. 😅

💄 Line up the participants for the top emoji contest!

To figure out which emoji has been used most frequently, we’ll have to create a frequency list. We can easily create a frequency list of all words with a counter object, and then extract the emojis from there.

Then it’s just about sorting them, and turning into a clear data frame.

Output:

What’s up with that ticket? My friends, I truly don’t know.

Now it’s time for very sad news.

I know what you expected next: a neat chart with emojis on the x axis and the frequency on the y axis. Turns out this is far more complex to execute that we’d like it to be. I tried. But, if 20 tabs of stack overflow later there was no answer, I hope you forgive me for dropping the case and finding a workaround.

The workaround is: we can decode the emoji, that is, represent them as words, and then plot the chart. So, here is a function that decodes the emoji and plots the frequency of use.

A careful observes sees immediately this uses parts of the code above.

Let’s run…

plot_top_emoji(10)

Output:

Challenge: send me a tweet that uses all of these emojis and still makes sense.

Looks like we have a winner! Without pointing fingers…👉👉

Pointing fingers at correlations

That’s all great, you say, but what does it all tell us about Twitter engagement?

We can check correlations between the number of emojis in a tweet and various engagement metrics. To do this we turn to the Pearson correlation coefficient.

Note: As an earlier step I created a data frame with selected numerical columns, df_correl, and calculated sentiment for each tweet (watch for a separate post on text analysis!).

Output, aka the BIG REVELATION!

The number of emojis in a tweet correlates most negatively with sentiment_positive (-0.113), and most positively with character_count (0.385).

As the standard interpretation of these values has it: 0.7 would be a strong positive correlation, 0.5 moderate, and 0.3 weak (add a minus and it would make it negative correlations). So, just looking at the numbers they don’t reveal anything exciting.

Now time for the big revelation: The more emojis the more characters in a tweet!

Phew, aren’t you glad we checked that?! Who would have thought! 😝

Well, we didn’t know what correlations we’d find until we checked, so good job we did.

🚶‍♂️ Next steps

What I’d like to investigate is whether the use of any particular emoji can predict higher engagement.

To do this I imagine I’d have to extract emojis form each tweet, vectorize it, and at the simplest level calculate correlations between each emoji-specific column and engagement rate. This data frame is probably too small to produce any statistically significant results.

If you have better ideas how to execute this, or see flaws in my thinking, tell me!

--

--

Marta
The Startup

📈 Aspiring data scientist. Rationality fan. EA. Vegan. Working to improve global mental health at MindEase.io