Social Media Imagery in the Archive
TL;DR Pictured above is a screenshot of a viewer for #FreddieGray imagery that is being shared on Twitter. Read on for the details of how it was assembled, and why.
In keeping with our theme of exploring the use of Twitter as a lens for archival work I thought I would briefly describe some work Bergis and I have been doing around collecting #FreddieGray tweets — more specifically the media associated with these tweets.
On April 12 Freddie Gray was chased and arrested by police in Baltimore. He was in a coma within hours of his arrest, and died a week later. 80% of his spine was severed at his neck. Approximately 4,000 people are protesting at City Hall in Baltimore as I write this. I should really be there instead of collecting tweets.
As of today (April 25th) at 2:30PM there have been 190,971 tweets with the #FreddieGray hashtag since April 15th.
As Bergis indicated in a previous post we are particularly interested in the media (images, video and audio) that are shared via social media. Twitter provides an unprecedented source of documentary evidence for events on the ground. And the social interactions (replies, retweets, favorites) likewise form an essential part of the context around these media files.
Also, next week the Parren Mitchell Symposium at the University of Maryland (just down the road from the protests in Baltimore) is focused on Intellectual Activism, Social Justice, & Criminalization. I was hoping to provide a view into some of this Twitter content with media, to possibly put on display during the event. So here’s what I did.
- I used twarc to collect tweets using the #FreddieGray hashtag between April 15th and April 25th. This yielded 190,971 tweets.
- I filtered out all the tweets with embedded media. This decreased the pool of tweets to 73,424 tweets (38%).
- I filtered out all original tweets (removing retweets) which further decreased the pool to 5,822 (3%).
- When multiple tweets include the same image, the image URL included in the data for each tweet is exactly same. Perhaps the image is hashed in some way to generate the unique URL. So we can further reduce the pool of tweets by deduplicating identical images. This yields 4,863 tweets (2.5%).
- Finally I only included tweets that had been retweeted at least once. The rationale here is that it will be an indicator that someone else found the content valuable, and that it’s not spam. This lowered the pool to just 1,979 tweets (1%).
So you can see the results in our browser. You should be able to reuse the code if you build a similar list of Twitter identifiers. I’ve included the Python script that runs through steps 2–5 for you, if you’ve got a tweet dataset.
Here is a small sampling of some of the tweets I saw go by:
Obviously this is just a start. How do we automate this media collection so it’s easy? How do we preserve the media (these images in this post are referenced on Twitter)? How do present the context around the images: the retweets, favorites, replies and conversation? How do we use the social context as a filter for the content? How can we involve the media creators in the archive so that we respect their intellectual property rights? Let us know if you have ideas by commenting here or getting in touch with Bergis or myself on Twitter.
Thanks to Kyle Bickoff at MITH for mobilizing the tweet collection!