Detecting the Pope Using Machine Learning and TLC Data

Geoff P
Geoff P
Sep 28, 2017 · 5 min read

I recently made a trip down to Bogotá, Colombia to participate in Bloomberg’s Data For Good Exchange Immersion program — Xavier Gonzalez and I were sent down there to assist Bogotá’s Veeduría Distrital (Office of Anti-corruption and Oversight) in building a data dashboard to allow city leaders to better understand and address citizen complaints. The program was an incredible experience — but what was also incredible was the fact that Pope Francis blessed us (and the dashboard) with a visit to Colombia right in the middle of our immersion.

Ready to see the pope in Cartagena

This reminded me of the data science work I did a few months ago with Ben Miller and Chris Streich — we were able to build a “Pope Detector” through machine learning and urban open data. More specifically, we built a set of algorithms that predicted a visit from Pope Francis to New York City by looking at NYC Taxi & Limousine Commission data. We used a number of machine learning anomaly detection techniques to pick out outlier days where yellow taxi pickups and drop-offs were way out of whack — we found that one of the biggest outlier days was when Pope Francis visited St. Patrick’s Cathedral on September 24th, 2015.

To build the “Pope Detector”, we first needed to download TLC data for 2015:

# loop through and download TLC data for each month of 2015
for num in $(seq -w 01 12)
echo $num
curl -o ./data/tlc_yellow_2015_$num.csv\-$num.csv

Next, after a bit of date conversion and data cleaning, we subset the data to a few blocks around St. Patrick’s Cathedral (lat: 40.758477, lon: -73.976223). After the subset, we then aggregated the data by day, summing the number of trips and averaging the total cost for each taxi trip.

In the plots above, we can see some peaks and troughs that reflect day of the week variations across time. However, you can see some outlier spikes in the data — our next task is to see when exactly those outliers are, how far of an outlier those dates are, and why those dates are outliers.

We chose to first perform a k-means outlier detection analysis, using only the number of taxi pickups and the average total amount for a taxi trip, aggregated by day, as the feature set. By looking at the data, there seems to be two groups in the data — days with a lot of trips and a higher average cost per trip, and days with fewer taxi trips, and a lower average cost per trip. The silhouette score confirms this.

K-means clustering results, with the top 10 outliers in red

In the plot above, you can see the two groups colored in light and dark blue, with the top 10 outlier points colored in red. Labeled are the days I find to be the most interesting. A snowstorm on January 27th ranks as the third highest outlier, as measured by distance from the closest K-means cluster center. September 24th, the day of Pope Francis’s visit to St. Patrick’s, ranks as the second highest outlier. Finally, May 31st, a rainy Sunday, ranks as the most distant outlier. I thought it would be interesting to also include the most normal day, as in, the data point / day closest to a cluster center. That happened to be April 17th, so I included it to give us all a good point of reference to what a typical taxi day around St. Patrick’s Cathedral looks like.

From L to R — TLC pickup map for April 17th, one of the most “normal” days — pickup map for September 24th, “Pope Day” — pickup map for May 31st, “Rainy Sunday”

The maps above (from L to R) show the taxi pickup chart for April 17th (the most “normal” day), September 24th (“Pope Day”), and May 31st — a rainy Sunday. You can right away see the difference between September 24th and the rest — the area around St. Patrick’s Cathedral (between 50th and 51st, 5th ave and Madison) has no pickups, presumably due to a police cordon.

We also performed an isolation forest anomaly detection algorithm to test the robustness of our model and results. The May 31st and January 27th dates were still tops, with the “Pope Day” of September 24th falling to a not-too-distant fifth place in the list of outliers.

So what does this all mean? Making “predictions” that happened over two years ago might not seem like the most powerful thing, but the fact that we were able to pick this out programmatically just by looking at aggregated TLC data is, in my opinion, extremely interesting. Much like astronomers scanning the sky for anomalous light sources, scanning the city for anomalous taxi patterns could prove to be very powerful. Taxis and traffic patterns are the lifeblood of New York City, and in addition to event detection, one could use them as a proxy to measure the movement of citizens and livelihood of different areas of the city.

As always, feedback is more than welcome, and all code is posted on the github repository.

Geoff P

Written by

Geoff P

urban data scientist - DS @ Ford Smart Mobility formerly: bloomberg fellow @ detroit land bank authority /// NYU's CUSP

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade