How Geocoded Tweets Turned Into A Junk Drawer Of Job-Wanted Ads and Bar Checkins (And Why That’s Pretty Good For User Privacy)
The tl;dr: If you are considering using geocoded Twitter to understand places more granular than cities, reconsider. Most currently geocoded tweets are automatic bot posts or cross-posts from other social media. This data will tell you little about the activity of a location beyond job-postings and social media checkins. For a few years data visualizers, journalists, and researchers had access to such a rich datasource in geocoded tweets mainly because Twitter users didn’t really understand they were divulging their precise location when tweeting. So it’s good Twitter fixed it.
Recently, the Creators Project reposted a 2014 article/tutorial describing how MapBox data visualizer Eric Fischer made a set world-wide maps of geocoded tweets. Fischer stored six billion tweets over the course of three years, and Fischer’s map is made up of over 590 million unique points. Even two years later, it’s impressive visualization work, confirming little details you always knew to be true (Times Square is lit up with tweets!) while providing opportunities for new discoveries (didn’t realize how popular highways were for tweeting). Together, these tweets show fine-grained information about where millions of people are at any given time, and the text and timestamps behind each dot tell us something about people feel and act at every point in time and space.
New data visualizers finding this post today may be tempted to recreate and extend Fischer’s work by making a map of their own community’s activity in 2016. Sadly, visualizers attempting to recreate Fischer’s work will find far fewer precisely-geotagged tweets. Those geotagged tweets that are being posted today will be mostly automated spam and crossposts from posts from Instagram, Foursquare, and UnTappd.
This change in quantity and quality of the data wasn’t gradual or the result of organic changes of user behaviors. Instead, it was Twitter’s realization that its users didn’t correctly understand how much geographic information they were divulging and changing the Twitter mobile app to more correctly match user expectations. This is a great example of changing an application’s interface to better match its users mental models, but it is bittersweet for people who like to play with and explore data.
Comparing 2014 and 2015 Geocoded Tweets
In order to illustrate exactly how much geocoded tweets have changed, let’s look at the data! I haven’t been collecting tweets of any kind for that long, so in order to do this investigation, people have kindly given me data to work with. One set of data we’ll look at are the precisely geocoded tweets from a number of large American cities and their surrounding areas from January 2014 to the present. I got this data from Dan Tasse (see our awesomely named paper from this dataset: Our House, In The Middle Of Our Tweets). Another set we’ll look at later is a subsample of the Twitter decahose for July to September 2014. The decahose records a random 10% sample of all tweets, and I was given a subsample of that sample by Jürgen Pfeffer. Thanks to both of them for giving me this historical data that would be impossible for me to access otherwise.
For brevity’s sake, I’m going to stop saying precisely-geocoded and just say geocoded. When I say that a tweet is geocoded, I mean that the tweet has latitude/longitude coordinates that indicate the exact location where the user was when they posted the tweet. For example, a tweet may be geocoded to the point <-79.972013, 40.556166> which will give the user’s location within a few feet.
First, let’s figure out if the number and content of the geocoded tweets have changed in the past few years. In order to explore this, we’ll use the cities dataset to stand in for precisely-geocoded tweets everywhere. This is a pretty safe generalization as Twitter is mostly used by urban and suburban users, and the cities we are sampling are from throughout the United States. Of course, these are only American cities, and they may not generalize well internationally. Particularly, the date of the drop in tweets that we’ll see may be different if Twitter rolled out changes at different times for different countries.
Looking at the geocoded city tweets, we see that sometime in April of 2015, something changed and the number of geocoded tweets dropped precipitously. Where before the cities saw between 50,000 and 200,000 geocoded tweets per week, now these cities have at most 35,000 tweets. We’re looking at an approximately 80% drop in geocoded tweets.
The culprit is Twitter’s 2015 decision to make the privacy implications of geocoding more transparent to users (and to better integrate its then-new partnership with Foursquare). In the original mobile application, if a user chose to mark their tweet with a place like Joe’s Diner or Pittsburgh, PA, no matter how broad that place may have been, they were also geocoding their tweet and divulging their precise location. New versions of the mobile app now require the user to explicitly share their latitude/longitude coordinates separates from sharing a place. The interface even shows the coordinates the user will divulge and the user must repeatedly opt-in every time they tweet. Great for privacy and consumer-rights, but it’s clear why this would lead to a drop in geocoded tweets.
Differences in the Content and the Users
Intrepid explorers of data may be noting right now that there are still approximately 75,000 geocoded tweets per week being produced by the Twitter users in the cities graphed above. That is not an amount of data to be scoffed at and could contain valuable data.
But in order for this reduced stream of tweets to be as valuable to us in 2016 as the original stream was for Eric Fischer’s visualization of human activity, the reduction in geocoded tweets should be randomly reduced, with eight out of ten formerly-geocoding tweeters opting out of geocoding. Let’s see if this is a correct assumption or if the users geocoding their tweets now are not representative of Twitter users in general.
In order to do this, let’s focus on the city of Pittsburgh in particular because we have the longest span of data, pick two timeframes before and after April 2015, and compare who is tweeting and what kinds of things they are saying. I’ve chosen two chunks that begin and end on the same month and day, but one is from 2014/2015 and one is from 2015/2016. This will allow us to compare times without worrying about issues of seasonality. 2014/2015 tweets total 1,141,350 and the 2015/2016 tweets total 216,830.
Reading through the tweets, it is immediately obvious that the two samples do not come from the same group of users. The older 2014/2015 tweets are hard to categorize and include a mixture of commentary, link-sharing, and discussion that will be familiar to any Twitter user. Perhaps the only systemic pattern are cross-posted tweets from Instagram applications.
A sample of the old tweets:
- 😩I’m not shit she said ya dad was on the news I said oh did he die 😩
- “@EmrgencyKittens: He’s a little lion! http://t.co/sjAZqlKYvV” 😍😭❤️
- @tylerxcii omg bless ur soul. I’ll fan girl about this on my tumblr
- To: @HouseofCards Thru: @netflix nice woman on this show
- Food trucks galore! Grab an early dinner, and head to our bar for $1 off drafts! #happyhour til 6pm @… http://t.co/YE2Ga4r1Wt
The new, 2015/2016 geocoded tweets have little in common with what any regular Twitter user would expect of their own feed. Instead of the normal stream of observations, link-sharing, and conversation, there is an endless stream of photo captions, job-wanted postings, and social media checkins.
A sample of the new tweets:
- @pehicc @rickygervais Hope tbose thieving,cruel bast*rds ROT!!!
- Lets Go Pens (@ CONSOL Energy Center — @consolenergyctr for Chicago Blackhawks vs Pittsburgh Penguins) https://t.co/nFrhMudnr6
- #PITTSBURGH, PA #CustomerService #Job: Sales Assistant at OfficeTeam https://t.co/d3s1rgZQlm #OfficeTeam #Jobs #Hiring
- partly cloudy -> fair temperature down 64°F -> 62°F humidity up 63% -> 72% wind 6mph -> 8mph pressure 29.98in falling
- Want to work in #Pittsburgh, PA? View our latest opening: https://t.co/3U28TsRU0i #Nursing #Veterans #Job #Jobs #Hiring #CareerArc
More generally, most of the new geocoded tweets can be broken down into a few key categories:
We can piece together exactly why each of the categories above have come to dominate the new geocoded data. Foursquare and Untappd are two social networks whose main feature is alerting others that a person has been a specific location. Foursquare does this generally, and Untappd caters to beer enthusiasts. When these third-party applications offer to crosspost a location-checkin, it is reasonable that the crossposted tweet also contains specific geocoded information as well. Instagram behaves similarly even though they are not explicitly location-focused.
It important to note that none of these cross-posted geocodes actually indicate where the user specifically is located. Instead the geocodes indicate an entrance or center-point of a place. For larger locations like cities, this center-point may be far away from where they really were when they posted. The specific coordinates also depend on what service is doing the geocoding. Instagram’s center-point location for Pittsburgh is 1.5 km away from Google Maps’ center-point, for example. David Shamma has a nice post about the social implications of the collapse of all tweets to center-points: The Social Concerns of Geo-Located Rectangles.
Instagram alone makes up almost 40% of the current geocoded tweets. Untappd and Foursquare another 10%. These crossposts, though not novel data sources still at least tell us something about daily activity if we filter out overly broad locations.
The other class of users who geocode today are accounts that automatically post information without any intervention of a human being. These accounts may be spammy, may be in good faith, but they tell us very little about the places they are geotagging.
Tweets with #hiring or #jobs make up 23% of all geocoded tweets today. That’s staggering and implies that one in four users are actively looking for jobs or to hire people! Of course they aren’t. Instead human-resource companies like peoplefluent.com, jibe.com, and careerarc.com post job postings to Twitter and geocode the center point of whatever geographic location they are posting the job for.
What kinds of jobs might you get through the Twitter Wanted ads? Looking through the postings medical professional jobs are popular: nurses, home health aids, medical assistants. Also popular are hospitality and restaurant workers. And customer service representatives.
The other common class of automatic posts are civically-minded, informational posts about particular locations from either dedicated individuals, news organizations, or government entities. Many usernames have “WX” appended to them indicating they report weather for a particular location (Fun fact WX was the telegraph shortening for weather and is still used for shorthand, “especially within the weather fraternity.” Thanks Yahoo answers). Every traffic issue out of the state of New York is posted and geocoded at @511NY. There’s also an @511Alaska and @511Alberta. Is there a vendor who operates all these accounts? Most likely.
Taken as a whole, the new geocoded tweets are clearly different than what the older set. They have little in common to normal Twitter and are instead made up mostly of cross-posts and automatic tweets. If you are interested in traffic in New York state, this information will perhaps interest you. For the rest of us though, this data is pretty meager and only tangentially related to general human activity.
Can Twitter’s Place Location Provide Adequate Replacement?
Twitter gives users the option to tag their tweet with a particular Foursquare location, a process I’m calling placecoding. Though the tweet does not have precise latitude and longitude coordinates, if the user has provided a specific enough placecode, then their location can be inferred from the placecode. If specific enough places are common, the loss of the precise geocodes that we have seen would not be so big after all.
In order to explore if the number of tweets that have been placecoded has increased or decreased, we need to look at a different dataset that our cities dataset as the data we’ve been studying only contains geocoded tweets, not placecoded tweets or non-geographic tweets. In order to do this, I’m going to examine the Twitter decahose and compare it to a sample of tweets I streamed from Twitter’s public streaming sample. While these two samples are from different points in time and of vastly different volume, they’ll allow us to roughly compare all the placecoded tweets.
First, we see in this different dataset the same precipitous drop in geocoded tweets. The cities dataset saw an approximately 80% decrease. Here we see an 84% decrease. But placecoded tweets have not changed in volume much.
We have the volume of placecoded tweets, but are the places specific enough to be useful for research, visualization or journalism? If they are as specific as Foursquare venues or Untappd bars, that will probably be enough for most uses.
Twitter helpfully categorizes every placecode based on its size. These range from a place of interest—similar to a Foursquare venue—all the way up to a country, with neighborhoods, cities, and admin areas (states/provinces) in the middle.
Looking at the randomly sampled data, of those tweets that do placecode, the majority do so at the city level or admin level. Places of interest and neighborhoods each get less than one percent of placecodes. Again, we can attribute this distribution to Twitter’s mobile applications. When a user does opt to placecode, the city-level is the default option.
Because the majority of placecodes are for very large geographic areas, placecodes are helpful for understanding the activity of cities, states, and countries. But for more granular comparisons, the majority of placecoded tweets are not useful.
Final Thoughts And An Interactive Map
In order to give a sense of the status of geocoded and placecoded tweets in June of 2016, I’ve put together an interactive map of a weekend’s worth of geocoded and placecoded tweets for the state of Pennsylvania. The weekend’s worth of data reconfirms the conclusions this article has already made. My filtering method for the map if anything overestimates “real” geocoded tweets as I only filtered out the broadest categories of automatic posts.
Twitter has been researchers’ and visualizers’ favorite social network because of the ease with which curious people can extract and stream its data. Many studies and visualizations have been built upon aggregating and exploring millions of 140-character tweets. For a number of years, we had it good when it came to geocoded tweets, but that knowledge was based on users’ lack of awareness of how much information they were really divulging.
While the general spigot of tweets continues on, it seems like the end-of-the-road for those of us interested in using Twitter to understand the spontaneous thoughts and activity that occur at specific locations. That is unfortunate. While Foursquare offers checkins and Instagram offers photo captions, no social media service quite has the same breadth of observations and feelings that Twitter offers. Twitter users have greater control of their own privacy today, but that privacy has meant that a wonderful source of information about human activity has disappeared. I only wish there was a way us to have our cake and eat it too.
If you want to talk more about this, feel free to send me an email SciutoAlex@gmail.com. Or tweet me. I’m @sciutoalex. And if you are curious, I rarely to never geocode/placecode my tweets.