Twarc-Cloud: Twitter data collection in the cloud

During my tenure at George Washington University Libraries, I was a member of the team that developed Social Feed Manager (SFM). SFM is open source software that harvests social media data from Twitter, Tumblr, Flickr, and Sina Weibo. It is intended to empower researchers, faculty, students, and archivists to collect, manage, and export social media data.

While SFM was a huge success in most respects and well received by the community, it is not widely adopted. There are around a dozen deployments in academic institutions of which I’m aware. In my experience, the reason for SFM’s deployment deficit is evident: while we worked hard to simplify deployment by using Docker (only 3 commands to bring up SFM!), at the end of the day needing a server and/or a sysadmin is a formidable barrier in most academic institutions.

Twarc-Cloud is an attempt to address this barrier by eliminating the need for a server and/or sysadmin by relying on cloud-based services, in particular, Amazon Web Services (AWS). This has motivated Twarc-Cloud’s design principles:

  • Serverless, so no server to maintain or pay for when not in use.
  • Use as few AWS services as possible to reduce complexity and cost.

Twarc-Cloud collects tweets using DocNow’s Twarc running as Fargate Elastic Container Service (ECS) tasks. Tweets are written to S3 buckets as gzip-compressed, newline-delimited JSON. Twarc-Cloud is managed using a command-line application. Thus, there is no web server, database, message queue, etc.

An additional benefit of using these AWS services is scalability becomes a non-issue. You can collect as many tweets as you want and AWS will automagically scale the compute and storage resources.

Using a cloud provider immediately begs the question of cost. Since Twarc-Cloud is stingy in its use of AWS services and only uses serverless resources, the cost is incredibly cheap:

  • Fargate ECS: $0.012345 per harvest per hour
  • S3: $0.023 per GB per month

Twarc-Cloud supports collecting user timelines, searches, and filter streams from Twitter’s API. User timeline and search harvests can be run as needed or according to a schedule. Filter streams are run continuously (i.e., they are either on or off). Filter streams run as an ECS service so that if there is a fault such as a network hiccup, collecting will be restarted.

I’d characterize Twarc-Cloud as situated between Twarc (command-line) and SFM (web-based). While Twarc is easy to get installed and running, it doesn’t provide scheduling or uninterrupted collecting. And, as mentioned, while Twarc-Cloud doesn’t require a server like SFM, it lacks a web interface and the enterprise-like features of SFM such as user/group management.

Here’s a whirlwind tour of using Twarc-Cloud.

A collection is specified by a collection.json file. Here’s an example for collecting some user timelines:

“id”: “dem_candidates”,
“keys”: {
“consumer_key”: “mBbq9ruEcnggfQzgTHUhr8eKn0”,
“access_token”: “2875189485-cf3rCYq59k1adfdhV88fyShQZ0rUsbZszp1”
“type”: “user_timeline”,
“users”: {
“216776631”: {
“screen_name”: “BernieSanders”
“357606935”: {
“screen_name”: “ewarren”
“delete_users_for”: [“protected”, “suspended”, “not_found”],
“timestamp”: “2019–03–14T01:20:37.851912”

Twarc-Cloud provides a number of commands to make creating a collection.json easier. For example:

$ python collection-config screennames @KamalaHarris
Getting users ids for screen names. This may take some time …
Added screen names to collection.json.

A collection can then be added and scheduled:

$ python collection add
Collection added.
Don’t forget to start or schedule the collection.
$ python collection schedule dem_candidates “rate(7 days)”

Twarc-Cloud will now run a harvest for this collection every 7 days.

Harvests can be monitored:

$ python harvest list
tweets_mentioning_trump => Bucket: twarc-cloud2. Status: RUNNING
dem_candidates => Bucket: twarc-cloud2. Status: RUNNING
$ python harvest last dem_candidates
dem_candidates => Bucket: twarc-cloud2. Harvest timestamp: 2019–03–14T01:29:18.684991. Tweets: 9,641. Files: 1 (3M)
No user changes.

For user timelines, not only are a user’s tweets collected, but so is information on the user’s account. If any changes are noticed for a user, these are recorded.

And when you’re ready, a collection can be downloaded:

$ python collection download dem_candidates
Collection downloaded to download/twarc-cloud2/collections/dem_candidates

Lastly, Twarc-Cloud keeps track over every change that is made to a collection:

$ python collection-config changes dem_candidates
users -> 216776631 -> screen_name changed from None to BernieSanders on 2019–03–14T01:27:14.420473
users -> 357606935 -> screen_name changed from None to ewarren on 2019–03–14T01:27:14.420473
users -> 30354991 -> screen_name changed from None to KamalaHarris on 2019–03–14T01:27:14.420473

For more details, check out the quickstart in Twarc-Cloud’s docs.

Some additional possibilities:

  • Twarc-Cloud is a potential step in the direction of Twitter data collection as a service (similar to Internet Archive’s Archive-It for web archiving) that conforms with Twitter’s Developer Policies. If you’re interested in exploring this, please reach out.
  • Having your Twitter data stored on S3 opens up all sorts of analytic possibilities, including EMR (Hadoop), SageMaker (machine learning), Comprehend (text analysis), and Elasticsearch (fulltext indexing).

Twarc-Cloud builds upon the work of the teams at GW Libraries and DocNow. I am greatly appreciative of their contributions to this field, as well as my colleagues at Stanford University Libraries’ Digital Library Systems and Services for schooling me on AWS.

As always, feedback, discussion, tickets, and pull requests are welcome.