How to Write a Successful STEM-related Tweet (Part 3)
A machine learning novice’s crack at creating a Tweet success predictor
Introduction
When I first had the idea of building a machine learning model to predict STEM-related Tweet success, I knew it would be tough. But it ended up even more difficult than I had expected.
My goal was to predict the number of Retweets a STEM-related Tweet would receive.
In this article, I’ll go over how I created my best (but still terrible) Tweet success predictor and what I learned along the way. We’ll pick up where we left off in part 2 of this Tweet success series by first tackling natural language processing.
Data Source
For my machine learning model, I’ll be using the data I collected in part 1:
To recap, the gathered data consists of 2873 entries that each have the following attributes:
datetime
: Tweet creation date and time in UTCid
: Tweet IDurl
: permalink pointing to Tweet locationtweet
: text content of Tweetretweets
: Retweet countlikes
: like countreplies
: reply countquotes
: count of users that quoted the Tweet and repliedmedia
: list of media types within Tweetnum_mentions
: number of mentionsnum_links
: number of linksuser num followers
: number of followersuser num statuses
: number of statusesuser num favourites
: number of Tweets that the user has likeduser num listed
: user listed countuser verified
: whether the user account is Twitter verifieduser account creation
: user account creation date and time in UTC
Exploratory Data Analysis
Picking up where we left off in part 2 of this Tweet success series, we’re moving on to sentiment analysis! We’ll use TextBlob, a Python library for processing textual data. You can read more about the TextBlob library here:
We’ll first define a few functions for cleaning Tweets and analyzing sentiment:
Now, we can create a sentiment analysis attribute with the result of the analysis. The sentiment values range from very negative at -1 to very positive at 1.
Alternatively, we can create one-hot encodings for our sentiment analysis like so:
But from my trial runs, one continuous sentiment feature trained better predictors than one-hot encodings.
Now, we’ll take a look at Term Frequency — Inverse Document Frequency, or TF-IDF. It is a statistical measure that evaluates how relevant a word is to a document in a collection. To do this, we’ll use both scikit-learn and the Natural Language Toolkit, or NLTK. Scikit-learn is a machine learning library, while NLTK is a platform for natural language processing. You can read more about both of these libraries at the following links:
To start, we’ll remove stop words, the English words that do not add meaning to sentences, from the Tweet content:
Now we’ll create n-grams, a sequence of n words, from the Tweet content. We’ll consider unigrams, bigrams, and trigrams for our model:
We can see that the keywords science and women are the most important to the Tweet content in our collected data.
For further analysis into the characteristics of successful Tweets, stay tuned for part 4 of this Tweet success series!
Data Preprocessing
Now that we have a better understanding of our data, it’s time to prep our dataset for training our machine learning model.
First off, we have to choose an indicator for Tweet success. In this article, I’ll use retweets
:
Alternatively, you can use likes
, replies
, quotes
, or some combination of these engagement indicators.
Next up, we’ll encode weekday
and hour (24-hour clock)
as cyclical features. Our current label encodings do not represent the cyclical nature of time. Sunday is encoded as ‘0’ while Saturday is encoded as ‘6’, for example. Cyclical encodings solve this through sine and cosine transformations. You can read more about encoding cyclical features here:
If we create sorted versions of tweets_df
for both weekday
and hour (24-hour clock)
, we can visualize these cyclical encodings.
To finalize our data features, we’ll drop the weekday
and hour (24-hour clock)
columns, reset the index to remove the Tweet content, and get rid of any null values that popped up during preprocessing:
Now that we have our final data frame, we can take a look at the correlation between all of our features:
Most of our features have a relatively low correlation with engagement, but the few that do stand out are user-related features, sin_hour
, and SA
. It would be hard to increase your social media engagement by increasing your number of followers directly, but Tweeting at an ideal time is definitely doable!
Finally, we’ll scale engagement to go from 0 to 1 and split our dataset into training and testing sets.
Scaling isn’t necessary for the algorithm we’ll be using, but it’s useful for performance comparison with models that do require this step, like LSTM models built using TensorFlow.
Training the Model
We’re ready to train our model! For our machine learning algorithm, we’ll use XGBoost, which stands for eXtreme Gradient Boosting.
Gradient boosting is a supervised learning algorithm that tries to predict a target variable by combining the estimates of a set of simpler, weaker models.
In our case, the target variable is the number of Retweets.
eXtreme Gradient Boosting is an efficient implementation of this gradient boosted trees algorithm. You can find the documentation for XGBoost here:
Results
Now that we have our trained model, it’s time to evaluate its performance! We’ll first plot the real number of Retweets against the number of Retweets predicted by the algorithm.
From our graph, we can see that we’ve built a pretty terrible model. Our mean absolute error is 0.0121, while our root mean square error is 0.0218. In terms of Retweets, this means our error is about 24 and 38 Retweets, for MAE and RSME, respectively. This doesn’t sound too terrible, but considering the fact that most Tweets have under 100 Retweets, it isn’t much better than pure guessing. Much of the difficulty involved in building an algorithm that can accurately predict the number of Retweets is the randomness involved.
We can also take a look at what features were the most important for predicting the number of Retweets a Tweet receives:
We see that user features heavily affect Tweet success, as the number of follows and favourites a user has are two of the most vital features. As for more actionable attributes, sentiment and the length of the Tweet also play a significant role.
Conclusion
We’ve successfully analyzed the makings for great Tweets while building a pretty terrible model along the way! But we have yet to create the holy grail of this Tweet success task: a machine learning model that can accurately predict Tweet engagement. Here are some steps that may bring us closer to our final goal:
- Collect more data
- Try varying Tweet scraping configurations
- Try other machine learning models, such as an LSTM model
- Modify the task from regression to classification by grouping the dataset into different levels of engagement
- Tune our model’s hyperparameters according to the following tutorial:
- Improve the accuracy of our sentiment analysis according to this article:
Another great place to look for Tweet success tips is the Twitter Developer Documentation:
That’s it for part 3 of this Tweet success series. Stay tuned for part 4, where I summarize my findings and dive into further analysis on how to craft the perfect Tweet!
References
In addition to the ones linked throughout this article, I wouldn’t have been able to complete this project without the help of these awesome examples and tutorials:
[1] Medium | How to Scrape Tweets With snscrape by Martin Beck
[2] Tutorials Point | Python — Remove Stopwords
[3] RegEx Testing | Regex to match all emoji
[4] Medium | Seaborn Heatmaps: 13 Ways to Customize Correlation Matrix Visualizations by Okoh Anita
[5] GitHub | Predicting Popularity by Belinda Zeng, Roseanne Feng, Yuqi Hou, and Zahra Mahmood
[6] Springer Link | Retweet Predictive Model for Predicting the Popularity of Tweets by Nelson Oliveira, Joana Costa, Catarina Silva, and Bernardete Ribeiro
[7] Medium | How to Write a Successful Data Science Article on Medium by Lukas Frei