Tech Post: NYC Holiday Taxi Visualization

ImageWork Technologies
4 min readNov 24, 2014

--

In our last post, we announced a fun holiday themed data visualisation using NYC Taxi trip and fare data from 2013. The feedback we’ve gotten on Reddit and Twitter has been inspiring and insightful.

As promised, we’ve released our code to this Github repo and the following post describes our approach to building out the visualization itself. The post has been authored by Dheeraj Sayala, our front-end lead on this project.

Note: this visualization project was inspired by Chris Whong’s work with NYC Taxi Trip Data. The result of his work was a beautiful visualization called “NYC Taxis: A Day in the Life”.

Getting Directions

The raw data set had 173.2M rows, which we filtered down to 270,000 rows that matched certain radius from various Airport Terminal passenger pickup zones. This was a quick Hadoop 2.4.1 Map-Reduce job. We also prepared companion data (Terminal->Airline mappings, Terminal Coordinates) to assist with any render-time lookups that may be necessary. All data was uploaded to Google Sheets. Processed data was exported and converted to CSVs using xlsx2csv.

I have used bin/extract.py to get the columns we would require and also calculate the nearest terminal from the pickup point. This was done by taking the minimum of the distances from each terminal, calculated using Haversine formula.

Next step was getting directions between pickup and drop locations using a Directions service. Though the first attempt was with Google Directions API, I switched to the OpenMapQuest alternative and that was just as good for our needs, without a daily quota limit. All we needed from the response was the encoded polyline representation of each route. It was also time to trim all the unnecessary columns from our data.

Our web server, built using Express on Node.js, runs SQL queries based on user requests and returns a JSON response containing trips.

Map and Animations

Leaflet supports a lot of tile providers. Though there were many free options, we chose Mapbox to be able to customize styles. Hosting our own NYC tile server was an option, but we opted for convenience and Mapbox did the trick.

The following lines setup the map, load tiles and fit bounds to New York.

Next, we start the animation by projecting polylines onto the map using SVG. D3 works great with Leaflet, as explained here. Decoded latLngs from encoded polylines were 4–5 times in size which means a lot of bandwidth and latency, so I decided to do it on the client.

For the visualization to work smoothly, without any breaks, I had to prefetch and cache server responses (elaborated below). A setInterval timer was set to update the clock on left side one minute at a time, according to the speed. For each minute, we add trip lines to the page and animate them using D3's attrTween transition.

Graph

I followed this great D3 tutorial on bar charts. Using the SQL query above, I got the daily counts and formatted it as JSON. Adding ticks (to show important dates) and tooltip were pretty straight forward.

Optimisation

Caching API responses

Waiting for server responses while animating was not an option. I’ve used jQuery’s Ajax Promises to query and queue data for processing. So, when a batch is running, the next four will already have been cached in the background.

To know when to prefetch the next chunks, I have added a boolean to a trip that falls exactly in the middle of the batch.

setTimeout vs setInterval

This was interesting. It was difficult to keep track of time with thousands of timers created using setTimeout. CPU was revving up, and clearing these timers wouldn’t be clean when needed. D3's transition.delay was similar. The animations were not matching with moment’s time either, which made me force update it, causing it to be jerky.

Having a single, central setInterval timer was much beter. It would add a minute to the clock and trigger animation for trips starting in that minute if any (the pickup times were accurate only up to minutes anyway).

Ignoring accurate map bounds

A better knowledge of SVG would have saved me a lot of hours. I found from Chrome’s profiling tool that calculating the exact bounds for a group of lines takes as much work as projecting all those lines. I did everything from calculating bounds manually on the server, to stupidly looking for ways to do this faster. In the end, I realized that it wouldn’t matter as long as we provide good enough bounds for drawing.

Skipping some points on the Polyline

Encoded polylines produced a long list of latLngs on the path. Picking every other latLng made the line projection faster. With so many lines being painted every second, this was an effective optimization.

Other stuff

Good old jQuery, Bootstrap and Underscore play their roles. Moment.js is an excellent library for datetime manipulation. I found node-sql pretty handy for creating simple SQL queries. Forever and Nginx keep the backend running.

Overall, it was great fun learning and working on a Data Visualisation project. With data available like never before, there are hundreds of datasets waiting to be processed, visualised and made sense of.

--

--