Hashtags of Ferguson
A few weeks ago I wrote about the importance of preserving the digital content shared via Twitter in response to the killing of Michael Brown and subsequent protests the killing sparked in Ferguson, MO and across the country. Recently Ed Summers and I have been doing a little more thinking around that topic in an attempt to figure out an effective strategy to get at digital content that is actually related to Ferguson and helps document the events. This is especially important when working with Twitter datasets that can be massive. We decided one way we could begin to make sense of the data and to get through the first layer of noise around the good stuff, was to focus on the hashtags that were used in the tweets. Since hashtags were a popular way for people to engage in the Twitter conversations around the Ferguson events, it made sense to us to start there.
Our dataset was Ed’s Ferguson Twitter archive, collected from August 10th-27th, 2014, which includes 13,238,863 tweets. This is a very small slice of time by the way. As we know, the Ferguson protests lasted for months with spikes in activity as recently as March 2015 when the DOJ report was released. Ed’s data was collected using the Ferguson keyword so it picked up all kinds of tweets that were unrelated to the actual Ferguson events. These include tweets by bots, individual spammers, idiots, etc. So digging in to figure out the hashtags proved a valuable way to get an idea of what was there.
We found 112,149 unique hashtags in the Ferguson Twitter archive of 13 million+ tweets. The top five hashtags mentioned were #ferguson (9,044,800), #mikebrown (868876), #michaelbrown (209714), #tcot (117,283), and #justiceformikebrown (116,674). Below is a simple visualization of the top 25. Here is the full list including hashtags and number of mentions.
So why is it a good idea to know the hashtags? One theory is that the people tweeting and retweeting in the early days, right after an event occurs, are generally the people closest to the event, and the ones who are most engaged and interested, therefore the ones producing the most authentic content. It’s true the bots can catch on quickly to flood a hashtag, but the hashtags generally become popular initially because a core group of highly interested people consistently use them while sharing content.
For example, we know that there are 3,972,668 tweets (including retweets) with embedded media in the full Ferguson Twitter archive. This does not include tweets that link out to other media sharing sites like Instagram, YouTube, Vine and elsewhere on the Web. So the total number of media (audio, video, image) will be much higher when those links are included. Of that 3,972,668 we know that there are 285,804 original tweets (not retweets). What if we examined these 285,804 tweets to find the image files posted between August 10th-12th, which also use one or more of the top five or so hashtags? Is that a good way to try to zoom in on the more authentic content? Can you think of some other methods for identifying authentic embedded imagery in these tweets?
By the way, did you know about #tcot? Could be interesting to dig into the content of those tweets.