How Do We Monitor Social Media?

R. Michael Alvarez
Trustworthy Social Media
3 min readNov 24, 2020

One important step towards building tools to building a more trustworthy social media is to develop reliable methods for collecting social media data.

How do we monitor social media data, especially over long periods of time?

Our research group has been monitoring various topics on Twitter since 2014. For example, our election monitor has run in the 2014, 2016, 2018, and 2020 US election cycles. Below is an example of a word cloud of tweets we have collected recently on election fraud.

Word Cloud of November 30, 2020 Tweets About Election Fraud

We’ve also used our monitor to collect conversations on Twitter over long periods of time on other topics, like immigration, gay marriage, abortion, and gun control. These data have been used by some members of our research group, for example, Nicholas Adams-Cohen’s work using one of these datasets to study how conversations about gay marriage on Twitter were altered by the landmark Obergefell v. Hodges decision.

Since 2014, we’ve been building various types of architectures to reliably and robustly collect data from Twitter’s public “streaming API” over long periods of time. We’ve found that many things can happen that can interfere with collecting Twitter data in real-time over periods of weeks, months, and even years. For example, sometimes when we have used local servers to collect these data, the servers fail; sometimes the local network fails, and in other situations, things happen that force the monitors to stop collecting data, sometimes at important periods of time.

Recently, to try to alleviate these problems, we’ve been moving our architectures to the cloud. While that might seem like an easy thing to do, in particular for readers who know how to write python code and who can work in the cloud, in all honesty, it’s harder than you might think!

Even better — we developed the architecture to collect, pre-process, and store our Twitter monitor data using three different cloud platforms: the Google Cloud Platform, Oracle Cloud, and Amazon Web Services. While the details vary by platform, we have built a generalized approach that others can use for their research using Twitter’s public streaming APIs (as well as other public streaming APIs).

Basically, our architecture uses this simple workflow with four different parts. In the first stage, a producer requests the social media data from a public API, say Tweets for certain election keywords (like our current election monitor). The producer then passes information in real-time to a data streaming for short-term storage, essentially a place where we can time-stamp the data and make sure that we are capturing it. The data consumer then grabs the data from the data stream, and at that point, can either pass them to other workflows for pre-processing (say labeling or geotagging) or it can send them directly to some storage solution. Sometimes the storage component is short-term cloud-storage, in other cases we secure the data in the cloud but also pass it to the research group using an accessible storage and sharing application (like Google Drive, Box, or Dropbox).

We’ve written a working paper that presents this architecture in more detail, and provides information about how we implement it using three different cloud computing systems. The paper, “Reliable and Efficient Long-Term Social Media Monitoring”, written by Jian Cao, Nicholas Adams-Cohen, and myself, is available as a preprint online.

We’re working to make the code used for these architectures available for others to use, and recently Jian Cao (a postdoctoral scholar in our research group) put it on his GitHub repo. We hope that our paper, and the availability of our code, encourage others to use these tools, to improve upon then, and to also share their methods and tools for collecting and analyzing long-term social media datasets.

We’re also working on a process for sharing much of the archival data that we have collected using these processes going back to 2014. More on that soon!

--

--