How to Write a Successful STEM-related Tweet (Part 2)

What the data has to say about Retweets, likes, and replies

Code AI Blogs

Published in

CodeAI

5 min readMay 22, 2021

Introduction
Data Source
Setup
Exploratory Data Analysis
Conclusion
References

Introduction

Searching high and low for what makes a Tweet popular, we’ve scoured Twitter, gathering data on thousands of posts. But finding the secret to Tweet success by scrolling through data entries in an Excel spreadsheet is looking for a needle in a haystack. This is where data analysis comes in.

With data collected, we are one step closer to our goal of predicting the number of Retweets a STEM-related Tweet would receive.

Here in part 2, we’ll dive into just enough data exploration to build our final machine learning model.

Data Source

For this initial data analysis, we’ll use the data collected in part 1 of this Tweet success series:

How to Write a Successful STEM-related Tweet (Part 1)

What I learned while failing to create a Tweet success predictor

codeaiblogs.medium.com

The gathered data consists of 2873 entries that each have the following attributes:

datetime: Tweet creation date and time in UTC
id: Tweet ID
url: permalink pointing to Tweet location
tweet: text content of Tweet
retweets: Retweet count
likes: like count
replies: reply count
quotes: count of users that quoted the Tweet and replied
media: list of media types within Tweet
num_mentions: number of mentions
num_links: number of links
user num followers: number of followers
user num statuses: number of statuses
user num favourites: number of Tweets that the user has liked
user num listed: user listed count
user verified: whether the user account is Twitter verified
user account creation: user account creation date and time in UTC

Setup

First things first, I’ll import all the Python libraries I’ll need for this project:

Next, we’ll configure our graph settings for later data analysis.

And now, we’ll load our dataset into a pandas DataFrame.

Before moving onto the data analysis, we’ll need to clean up our data a bit. Let’s start by removing duplicate Tweets:

# dropping tweets with duplicate ids (keep first occurrance)tweets_df = tweets_df.drop_duplicates(subset='id', ignore_index=True)

Now we’ll drop the id and url columns as we no longer need the Tweet IDs and won’t be using their URLs.

# dropping id columntweets_df = tweets_df.drop(['id'], axis=1)
# dropping url columntweets_df = tweets_df.drop(['url'], axis=1)

We’ll also remove links and mentions from the Tweet content:

Finally, we’ll find and filter out giveaway Tweets.

Based on some preliminary data analysis, Giveaway Tweets tend to have disproportionately large amounts of engagement which may skew our results. We’re also more interested in genuine STEM-related Tweets for this project, which giveaway Tweets aren’t representative of.

Exploratory Data Analysis

Now that we have our data frame setup, let’s dive into data analysis!

To start, we’ll double-check that we have no null values while taking a closer look at our data using df.info():

We’ll also look at some quick stats using df.describe().

Next, let’s take a look at our engagement distribution:

Unfortunately, the vast majority of Tweets do not receive any engagement at all. To take a closer look, we’ll plot each type of engagement individually:

Since the peaks of our engagement distributions are on the far left of their plots, all of our histograms are heavily right-skewed. Because of this, it’s hard to see the variation at lower values. To solve this, let’s plot engagement on the log scale:

If we look back at our quick stats from df.describe(), the mean number of Retweets is 47, but the highest number is 2686! Similarly for likes, the mean is 170, but the max is 11434. This implies that our dataset contains a few outliers that receive far more engagement than all other Tweets.

Some of our data isn’t in the most suitable form at the moment, so we’ll begin extracting new features from our existing data.

For our user-related attributes, we’ll transform the user account creation attribute from a DateTime to the number of days since account creation.

We’ll also convert user verified from Booleans to integers:

tweets_df['user verified'] = tweets_df['user verified'].astype(int)

Moving onto our Tweet-related attributes, we’ll start by looking at hashtags. We’ll create a column with the number of hashtags in each Tweet while removing hashtags from the Tweet content.

We could also create one-hot encodings for the most popular hashtags in our dataset. But from my trial runs, they did not contribute to model performance and are thus not included in our final model.

Next, we’ll create an attribute with the number of emojis in each Tweet:

Now we’ll look at the media column. Let’s see what unique values we have:

From the output, the Tweets in our dataset contain three media types: videos, photos, and gifs. We’ll create categorical features for each of these media types:

The length of Tweets may also influence Tweet success, so we’ll next create an attribute for the number of words in each tweet.

Let’s take a look at the relationship (if any) between some of the features we just created and retweets:

Starting with the number of links, having a lower amount doesn’t guarantee more Retweets, but the few Tweets that do have 2 or more links didn’t receive much engagement.

The same holds true for the number of hashtags and emojis. The Tweets that receive the most engagement tend to have 5 or fewer hashtags and 3 or fewer emojis.

As for when the Tweet was created, label encodings for weekday and hour may be more effective than DateTimes.

Now we can analyze the number of Retweets by weekday:

We can also look at the number of Tweets by weekday in general:

From our plots, it looks like people Tweet about STEM more often on weekdays compared to weekends.

Completing the same analysis for the hour of the day, we get:

From these plots, it is clear that there is a cyclical trend in the number of Tweets posted at each hour. People Tweet about STEM most often at 2pm UTC and least often at 4am UTC.

For further analysis into the characteristics of successful Tweets, stay tuned for part 4 of this Tweet success series!

Conclusion

That’s all for our preliminary data analysis. Now that we have a better idea of what we’re working with, it’s time to move on to the next step.

Look out for part 3, where we’ll build our machine learning model.

References

In addition to the ones linked throughout this article, I wouldn’t have been able to complete this project without the help of these awesome examples and tutorials:

[1] Medium | How to Scrape Tweets With snscrape by Martin Beck

[2] RegEx Testing | Regex to match all emoji

[3] GitHub | Predicting Popularity by Belinda Zeng, Roseanne Feng, Yuqi Hou, and Zahra Mahmood

[4] Springer Link | Retweet Predictive Model for Predicting the Popularity of Tweets by Nelson Oliveira, Joana Costa, Catarina Silva, and Bernardete Ribeiro

[5] Medium | How to Write a Successful Data Science Article on Medium by Lukas Frei