Some Quick Data Science: A Look At The San Francisco Tweetspace
Right after grad school I started working on a dream project of mine: A location-tied message posting app. After a lot of consideration I decided that the best way to implement this idea was as a Twitter client, and so, Landburd (now in beta!) was born.
At its most basic level of functionality, Landburd searches and returns tweets associated with various apps in order to give the user a sort of basic sense of what’s happening nearby. The kinds of tweets I’m currently presenting are posted via Swarm, Foursquare, and Instagram. I also show regular tweets if they’re heavily favorited and have a picture attached. This means I’m naturally kind of curious about where and why people post tweets like this, and it’d be cool to get a sense of how useful their information is to my users.
So when I was presented with the opportunity to do a small project for a data science fellowship interview, I decided to examine about 8.5 days of tweets posted within 10 miles of the San Francisco, CA area. I accomplished this by interfacing to the Twitter search API via Python and performing a dump to a local MySQL database. From here I can perform interesting queries or make plots using the Python matplotlib package.
So what does San Francisco look like in tweets? Well, something like this:
This map has a lot of fun features. My favorites are the way that Market Street seems to be a line of symmetry for the distribution, how you can see all the people tweeting from ferries and boats, and whatever it is that creates tweets at even intervals across the Bay Bridge. It’s also easy to see that despite having poor integration with Twitter, posting to your feed via Instagram is very popular.
One way that my app determines the priority of a tweet’s information is by how well favorited it is. So how are people currently favoriting location tagged tweets?
As you can see in this 1-D histogram of the number of favorites for all of the tweets captured, posts with more than 5 favorites are pretty rare. Let’s see what the map looks like when I make a cut for favorites > 5.
With this cut for only “well favorited” tweets, we can definitely see some clustering. This hints, but doesn’t prove, that where a tweet is posted from may affect how many favorites it receives. An alternative hypothesis would be that this clustering results from multiple posts by heavily followed accounts that have a baseline high engagement level relative to other users. I’d have to look closer at the other data from tweets in these clusters to know for sure.
As I alluded to earlier, posting to Twitter via Instagram remains popular even though you can’t directly see Instagram photos from inside the official Twitter client. But does this affect engagement (as measured by favorites) when compared against photos posted using Twitter’s native image service? This is something I can inspect this with some simple MySQL queries of the data set.
So, yes: Even though photos posted via Instagram are far more numerous, they average about a full favorite less than their Twitter pic counterparts. It’s a good thing I added the ability to view Instagram photos from directly within my app! Not being to see an image easily really affects how likely people are to favorite it.
Finally I’d like to take a closer look at posts to Twitter made via Swarm. When you check-in with Swarm and post it to Twitter the default text starts with “I’m at…” if the user chose not to say anything about that particular check-in. This is not entirely useless for the purposes of Landburd, since such posts still mention the location of the check-in, giving my users a sense of what kind of stuff is nearby, but posts that have more informational content are preferred. So the question is: What percent of tweets posted via Swarm only have the default “I’m at…” text? Here are the raw numbers:
In this sample of data almost half, 49.1%, of Swarm users post to Twitter without having made a comment about their check-in. I consider that to be a fairly high number of comment-less posts, but not so high that I’m overly concerned about it. After all, I can always start filtering out tweets that start “I’m at…” in my app later if I decide that they just don’t carry enough relevant information.
Thanks for looking a this quick project! I found it really fun to look at an urban environment in terms of the number and kind of tweets produced, and it gave me some data that may prove useful as I continue to tweak Landburd.
You can view the GitHub for this project here.