Whether you are growing a business or wanting your ideas to reach as many people as possible, increasing your social media engagement is the key to victory. But what factors influence engagement? Time of post? Sentiment? Emojis? To a machine learning novice, this was the perfect challenge.
My goal was to predict the number of Retweets a STEM-related Tweet would receive.
So what’s step 1, you may ask? Data collection, of course! Data collection is one of the most vital steps in any machine learning project. In part 1 of this Tweet success series, I’ll be going over how I collected data using the snscrape library.
My first roadblock in this journey was gathering data. Unable to find an existing dataset of STEM-related Tweets, I had to find a way to scrape Twitter data myself. To do this, I used the snscrape library:
snscrape is a scraper for social networking services (SNS). It scrapes things like user profiles, hashtags, or searches…
I based my Tweet scraping off of this tutorial:
How to Scrape Tweets With snscrape
A quick guide to scraping tweets after recent updates to Twitter’s API
You can pip the snscrape library within your command line or within your code like so:
%pip install snscrape
Next, I’ll import the snscrape library and the rest of the libraries we’ll need:
import snscrape.modules.twitter as scraperimport pandas as pdimport datetime
We have to decide what timeframe we want to scrape Tweets from. I arbitrarily chose the period from January 1st, 2020 to March 1st, 2021.
Next up is configuring the type of data we want. I collected Tweets that contained either ‘#STEM’ or ‘coding’ while filtering out Retweets, replies, and quotes. You can read more about advanced Twitter search here:
These operators work on Web, Mobile, Tweetdeck. There is some overlap, but largely these will not work for v1.1 Search…
Earlier, we generated individual start and end dates for our Tweet scraping. This is because we need to limit the number of Tweets we scrape per day, or else this process would take forever (unless you have a powerful CPU). I’ve arbitrarily set this limit to 100.
One of the limitations of the snscrape library is that you can only scrape Tweets in reverse chronological order rather than according to popularity. In our case, this means that for any given day, we start collecting Tweets at 11:59pm, move backward in time until we hit the limit or reach the previous day, then skip forward in time to the next day.
As we only scrape 100 Tweets per day, we need to increase the chances of scraping meaningful data, such as Tweets with high engagement. Otherwise, we would be scraping data from only the same few hours per day, which wouldn’t represent reality.
To do this, we’ll set some minimum engagement requirements:
Now we’ll tie all of our data configurations together and scrape our Tweet data! For each Tweet, we create a list of the desired pieces of data. We then append this list to
Next, we’ll load all our gathered data into a pandas DataFrame:
Finally, we’ll export the data into a CSV file! The data scraping process takes a long time, so we don’t want to scrape the same data every time we need it.
# Export dataframe into a CSVtweets_df.to_csv('STEM_tweets.csv', sep=',', index=False)
Here’s a peek at our collected data:
We’ve successfully gathered data on thousands of Tweets! We’re now ready to analyze our data and build our Tweet success predictor.
That’s it for part 1 of this series. Stay tuned for part 2!