NY’s Top Ten Taxi Drivers

Hannes Mühleisen
4 min readSep 1, 2014

--

New York’s iconic taxi cabs have arrived in the modern age, and much of their activity is now routinely logged. Through a clever Freedom of Information request, some of this data has now become available online. The data contains about 170 million entries, each for a single cab ride in 2013. We get the driver and car registration (encoded, more on this later), trip duration and distance, and – importantly – the time and place where the passengers were picked up and dropped off, respectively.

Picture by Flickr user bobjagendorf

We wanted to know how cab rides are influenced by traffic. However, while trip distance and duration are included in the data, this is not enough. Two trips with about the same distance could have completely different durations even in similar traffic conditions due to turn restrictions, general road availability, differing road types etc. Therefore, we had to find out how long a trip between two points would take in ideal circumstances. What people normally do when they want to know the time it takes to get from A to B is to ask devices from TomTom or Google Maps. However, both cannot really be used for millions of trips.

Fortunately, the OpenTripPlanner (OTP) project has developed a free and Open Source trip planner that works with OpenStreetMap data, which are also freely available. OTP is able to take arbitrary coordinates and create a car route between them. We have plugged OTP into a small program (available on GitHub). Then, one million of those trip records were fed into that program on one of the beefy servers of our SciLens cluster. We now have the actual travel time between the two points and the time the trip planner tells us it would take. However, in order to be able to compare those millions of trips with each other, we calculated the delay in minutes per kilometer (take that, imperial unit barbarians). This allows us to plot the delay, for example by time of day:

New York Taxi Delays by Hour

Clearly, the famous phrase by Frank Sinatra, “I want to wake up, In that city that doesn’t sleep” is not that accurate. Judging from its traffic, the city sleeps between eight in the evening and six in the morning. Perhaps New Yorkers are even slacking off on weekends? Let’s plot the data by weekday:

New York Taxi Delays by Weekday and Hour

We can see how Saturday night is much busier than the other nights, and that Sunday already starts to resemble a weekday. The other weekdays are very similar.

As mentioned, the license of the cab driver is also included in the data, albeit in encoded form. The people who published the data simply took the MD5 hashing method, which is a widely used cryptographic hash function producing a 128-bit (16-byte) hash value, typically expressed in text format as a 32 digit hexadecimal number. It is supposedly impossible to compute the text that was hashed from the hash value. For example, the hash value of the text in the previous sentence is “40079160df469c618bca69665a73ec72". However, as other researchers have pointed out, the license numbers that were hashed are highly regular. In particular, they are mostly seven-digit numbers that start with a “5". We can now simply calculate all possible hashes of these numbers and find the actual license numbers of the cab drivers. In addition, the mentioned Taxi & Limousine commission publishes a list of licensed cab drivers with the drivers’ full names. We can now find the cab drivers that have the lowest delay rate. As promised in the title, these are obviously New York’s best (fastest) cab drivers:

So congratulations to Amer Mashni, who actually managed to beat the route planner’s prediction by 0.67 minutes on the kilometer on average. That’s it for now, I hope you enjoyed this, if you like to do more, you can just download my entire dataset

We would like to acknowledge the support of the COMMIT/ project for this work.

--

--