Scraping Twitter in Search of People’s F😆👻lings
Why look?
Twitter is a mall food court where you can do some really amazing people watching. Except it’s not just the table over that you can eavesdrop on, it’s every account made public from around the world. This has some powerful implications. Information from various corners are not always siloed into niche communities. If information is important and or viral enough it has the velocity to spread like wildfire on the internet. I’ve been deeply curious about how we can map and visualize this information to gain insight into peoples opinions and thoughts on a greater scale than what has been historically possible. The following is a brief walk towards that goal.
The Technicals
Twitter API or Scraper
I was tempted to use the Twitter API, but was quickly turned off from the rate limits imposed on the entry level. I chose to use a scraping library to bootstrap data. Finding a working library was surprisingly difficult as it seems that Twitter ui changes often break these libraries. I wound up using the Scweet library (python). I did run into some minor reliability issues such as not being able to run it in a headless state and having to clean some of the data that gets produced before going further with it. Once I had a cleaned data set (in a dataframe) it was something like:
Using Sentiment Analysis
After getting a collections of tweets I wanted to run some sentiment analysis with a commercial api to see how they might perform on informal user content. I opted to test out the AWS Comprehend sentiment analysis (AWS has a pretty generous free tier to test with, though GCP is likely the same). All I needed to do was pass a tweet over the api and I received:
{
"Sentiment": {
"Sentiment": "NEUTRAL",
"SentimentScore": {
"Positive": 0.16761767864227295,
"Negative": 0.026001444086432457,
"Neutral": 0.8055717349052429,
"Mixed": 0.0008091145427897573
}
}
}
Here we get back 4 confidence ratings and a general sentiment for the text. Like most ML outputs normalized to a float from 0 being least confidant to 1 being most (which might be helpful for us later).
Visualization
My goal for this first pass was to be able to visualize 1. what people were saying about a particular search term and 2. how they felt towards that topic over time.
- I generated a word cloud by using this popular wordcloud library. The resulting output looked like:
2. I then took all the sentiment scores that I processed on the collected tweets and graphed the mean over a daily period using dataframe’s resampling ability. Something to note is that this is not a representation of the raw count of tweets for each sentiment type, but rather the mean of each type’s score.
What’s next?
- While a positive, negative, neutral scoring is nice it would be great to have a more granular description of data.
- Informal social content contains semantic quirks like sarcasm, which I found often got mislabeled.
- Social interaction happens socially, and visualizing the graph structure of this interaction might yield interesting insights.
- Informal communication online is not just through written content. Memes and images are used ubiquitously in social media, and having some classifications of them is no where in the commercial ml space.