How to Write a Successful STEM-related Tweet (Part 3)

A machine learning novice’s crack at creating a Tweet success predictor

Code AI Blogs

Published in

CodeAI

7 min readJul 3, 2021

Introduction
Data Source
Setup
Exploratory Data Analysis
Data Preprocessing
Training the Model
Results
Conclusion
References

Introduction

When I first had the idea of building a machine learning model to predict STEM-related Tweet success, I knew it would be tough. But it ended up even more difficult than I had expected.

My goal was to predict the number of Retweets a STEM-related Tweet would receive.

In this article, I’ll go over how I created my best (but still terrible) Tweet success predictor and what I learned along the way. We’ll pick up where we left off in part 2 of this Tweet success series by first tackling natural language processing.

Data Source

For my machine learning model, I’ll be using the data I collected in part 1:

How to Write a Successful STEM-related Tweet (Part 1)

What I learned while failing to create a Tweet success predictor

codeaiblogs.medium.com

To recap, the gathered data consists of 2873 entries that each have the following attributes:

datetime: Tweet creation date and time in UTC
id: Tweet ID
url: permalink pointing to Tweet location
tweet: text content of Tweet
retweets: Retweet count
likes: like count
replies: reply count
quotes: count of users that quoted the Tweet and replied
media: list of media types within Tweet
num_mentions: number of mentions
num_links: number of links
user num followers: number of followers
user num statuses: number of statuses
user num favourites: number of Tweets that the user has liked
user num listed: user listed count
user verified: whether the user account is Twitter verified
user account creation: user account creation date and time in UTC

Exploratory Data Analysis

Picking up where we left off in part 2 of this Tweet success series, we’re moving on to sentiment analysis! We’ll use TextBlob, a Python library for processing textual data. You can read more about the TextBlob library here:

TextBlob: Simplified Text Processing — TextBlob 0.16.0 documentation

Release v0.16.0. () TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for…

textblob.readthedocs.io

We’ll first define a few functions for cleaning Tweets and analyzing sentiment:

Now, we can create a sentiment analysis attribute with the result of the analysis. The sentiment values range from very negative at -1 to very positive at 1.

Alternatively, we can create one-hot encodings for our sentiment analysis like so:

But from my trial runs, one continuous sentiment feature trained better predictors than one-hot encodings.

Now, we’ll take a look at Term Frequency — Inverse Document Frequency, or TF-IDF. It is a statistical measure that evaluates how relevant a word is to a document in a collection. To do this, we’ll use both scikit-learn and the Natural Language Toolkit, or NLTK. Scikit-learn is a machine learning library, while NLTK is a platform for natural language processing. You can read more about both of these libraries at the following links:

scikit-learn

“We use scikit-learn to support leading-edge basic research […]” “I think it’s the most well-designed ML package I’ve…

scikit-learn.org

Natural Language Toolkit — NLTK 3.6.2 documentation

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use…

www.nltk.org

To start, we’ll remove stop words, the English words that do not add meaning to sentences, from the Tweet content:

Now we’ll create n-grams, a sequence of n words, from the Tweet content. We’ll consider unigrams, bigrams, and trigrams for our model:

We can see that the keywords science and women are the most important to the Tweet content in our collected data.

For further analysis into the characteristics of successful Tweets, stay tuned for part 4 of this Tweet success series!

Data Preprocessing

Now that we have a better understanding of our data, it’s time to prep our dataset for training our machine learning model.

First off, we have to choose an indicator for Tweet success. In this article, I’ll use retweets:

Alternatively, you can use likes, replies, quotes, or some combination of these engagement indicators.

Next up, we’ll encode weekday and hour (24-hour clock) as cyclical features. Our current label encodings do not represent the cyclical nature of time. Sunday is encoded as ‘0’ while Saturday is encoded as ‘6’, for example. Cyclical encodings solve this through sine and cosine transformations. You can read more about encoding cyclical features here:

Encoding cyclical continuous features — 24-hour time

Some data is inherently cyclical. Time is a rich example of this: minutes, hours, seconds, day of week, week of month…

ianlondon.github.io

If we create sorted versions of tweets_df for both weekday and hour (24-hour clock), we can visualize these cyclical encodings.

To finalize our data features, we’ll drop the weekday and hour (24-hour clock) columns, reset the index to remove the Tweet content, and get rid of any null values that popped up during preprocessing:

Now that we have our final data frame, we can take a look at the correlation between all of our features:

Most of our features have a relatively low correlation with engagement, but the few that do stand out are user-related features, sin_hour, and SA. It would be hard to increase your social media engagement by increasing your number of followers directly, but Tweeting at an ideal time is definitely doable!

Finally, we’ll scale engagement to go from 0 to 1 and split our dataset into training and testing sets.

Scaling isn’t necessary for the algorithm we’ll be using, but it’s useful for performance comparison with models that do require this step, like LSTM models built using TensorFlow.

Training the Model

We’re ready to train our model! For our machine learning algorithm, we’ll use XGBoost, which stands for eXtreme Gradient Boosting.

Gradient boosting is a supervised learning algorithm that tries to predict a target variable by combining the estimates of a set of simpler, weaker models.

In our case, the target variable is the number of Retweets.

eXtreme Gradient Boosting is an efficient implementation of this gradient boosted trees algorithm. You can find the documentation for XGBoost here:

Python API Reference — xgboost 1.5.0-SNAPSHOT documentation

This page gives the Python API reference of xgboost, please also refer to Python Package Introduction for more…

xgboost.readthedocs.io

Results

Now that we have our trained model, it’s time to evaluate its performance! We’ll first plot the real number of Retweets against the number of Retweets predicted by the algorithm.

From our graph, we can see that we’ve built a pretty terrible model. Our mean absolute error is 0.0121, while our root mean square error is 0.0218. In terms of Retweets, this means our error is about 24 and 38 Retweets, for MAE and RSME, respectively. This doesn’t sound too terrible, but considering the fact that most Tweets have under 100 Retweets, it isn’t much better than pure guessing. Much of the difficulty involved in building an algorithm that can accurately predict the number of Retweets is the randomness involved.

We can also take a look at what features were the most important for predicting the number of Retweets a Tweet receives:

We see that user features heavily affect Tweet success, as the number of follows and favourites a user has are two of the most vital features. As for more actionable attributes, sentiment and the length of the Tweet also play a significant role.

Conclusion

We’ve successfully analyzed the makings for great Tweets while building a pretty terrible model along the way! But we have yet to create the holy grail of this Tweet success task: a machine learning model that can accurately predict Tweet engagement. Here are some steps that may bring us closer to our final goal:

Collect more data
Try varying Tweet scraping configurations
Try other machine learning models, such as an LSTM model
Modify the task from regression to classification by grouping the dataset into different levels of engagement
Tune our model’s hyperparameters according to the following tutorial:

Hyperparameter tuning in XGBoost

This tutorial is the second part of our series on XGBoost. If you haven’t done it yet, for an introduction to XGBoost…

blog.cambridgespark.com

Improve the accuracy of our sentiment analysis according to this article:

Sentiment Analysis in Python: TextBlob vs Vader Sentiment vs Flair vs Building It From Scratch …

Sentiment analysis is one of the most widely known Natural Language Processing (NLP) tasks. This article aims to give…

neptune.ai

Another great place to look for Tweet success tips is the Twitter Developer Documentation:

Documentation Home

Guides and reference materials to help you get started, integrate, optimize, and troubleshoot your use of the Twitter…

developer.twitter.com

That’s it for part 3 of this Tweet success series. Stay tuned for part 4, where I summarize my findings and dive into further analysis on how to craft the perfect Tweet!

References

In addition to the ones linked throughout this article, I wouldn’t have been able to complete this project without the help of these awesome examples and tutorials:

[1] Medium | How to Scrape Tweets With snscrape by Martin Beck

[2] Tutorials Point | Python — Remove Stopwords

[3] RegEx Testing | Regex to match all emoji

[4] Medium | Seaborn Heatmaps: 13 Ways to Customize Correlation Matrix Visualizations by Okoh Anita

[5] GitHub | Predicting Popularity by Belinda Zeng, Roseanne Feng, Yuqi Hou, and Zahra Mahmood

[6] Springer Link | Retweet Predictive Model for Predicting the Popularity of Tweets by Nelson Oliveira, Joana Costa, Catarina Silva, and Bernardete Ribeiro

[7] Medium | How to Write a Successful Data Science Article on Medium by Lukas Frei