Where do people go in NYC? — The recipe of an analysis

13 min readAug 15, 2019

The life of this article started about two or three weeks ago, when I decided to explore the HUGE dataset of the TLC Trip Record Data of the NYC yellow taxis rides.
Since that day, I asked myself mutltiple times why am I doing this again?, but I actually always found out a new argument showing how amazing this dataset is. Proof is, you can find countless analysis from so many angles, years, and with so many ways to represent the results.

I graduated in urban planning, so spontaneously when you tell me ‘Hey, I have over a billion data points from the taxi rides in NYC, want to play?’ — I naturally make crazy poses to show you how fun this sound and how excited I am to start.

So here we are, on the TLC Trip Record Data page (https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).... About 10 years available, files divided per month and taxi types (yellow, green, FHV). I try to download one month — ouch, 700Mb. Yup, this is going to be fun!

For the rest of this article, be aware that I used the following environment:

Jupyter notebook 6.0.0
Python 3.6
MariaDB (5.7.18)
Several libraries:

numpy 1.16.4
pandas 0.25.0
shapefile (pyshp) 2.1.0
pyproj 1.9.6
matplotlib 3.1.0
OpenCV 4.1.0
mysql-connector-python 8.0.16

Also, the code is available on my GitHub profile (https://github.com/acoullandreau/nyc_taxi_trips), in two Jupyter notebooks (one for each approach).

Step 1 — Let’s see what the data looks like

The dictionary of the dataset is available on the TLC Trip Record Data page (https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf).

Besides, a lookup table (https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv) is provided to match zone id with zone name and associated borough name (this file will be very useful!).

To get familiar with the data, let’s load a dataset to the Jupyter notebook, for example from February 2018.

Shape, head and description of Feb 2018 taxi rides DataFrame

Quite a lot trips already just in February! Let’s look at the missing data:

Pretty clean dataset, this is great! So first set of observations we make:

For the month of Feb, 2018, we have more than 8 millions data points, and no missing data at all
There seem to be three categorical columns [‘RatecodeID’, ‘store_and_fwd_flag’, ‘payment_type’]
Among those columns, we cannot make much sense out of the store_and_fwd_flag, and will then decide to drop it for further analysis
The two timestamp columns, [‘tpep_pickup_datetime’, ‘tpep_dropoff_datetime’], typed ‘object’, should be converted to new columns to split the date from the time
The total amount columns sums up the values from all prices columns (fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge) — we may then decide to keep only the total amount column as a meaningful column, and let go of the other ones
As the cash tips are not recorded, we will consider that the amount of tips paid by cash is negligeable to simplify the analysis of the paid amount

When taking a look at Jan, 2018 we reach the same conclusions that for Feb, 2018.

However, the structure is slightly different when we look back in time. It seems like up to 2016 the dataset contained pick_up and drop_off latitude and longitude instead of the pick_up and drop_off location ID.

Head of Feb 2013 and Feb 2016 DataFrame with different header

We therefore decide to keep it simple by sticking to 2017 or 2018 and using only the zone id and not the exact locations for pick up and drop off.

Step 2 — What value can we give to this data?

OK. So now that we know what the data look like, what sense can we give to it?

As mentionned in the introduction, I tried to give it sense from the urban planner point of view. The few questions I could come up with were therefore the following:

Can we see trends in the flow of passengers in 2018?
Is there a difference on holidays, hottest or coldest day of the year?
Is there a difference between weekdays and weekends?
Depending on the zone we look at, where are people most likely to come from? To go to? Is it different between weekdays and weekends?

Although at first glance it seemed like the data was pretty clean (meaning, no NaN values), we still need to make sure we don’t have any unexpected values.
To do that, I build a few functions that are supposed:

cleaning missing values (NaN if any, rows that do not have a valid zone id (Unknown ID))
process the timestamp columns [‘tpep_pickup_datetime’, ‘tpep_dropoff_datetime’]
ensure that we handle properly negative trip durations, trip distances, and money-related features
ensure that we handle properly the trips with no passengers at all
encode categorical variables if necessary
drop the features that we do not need for our analysis
merge PU and DO location borough and zone names (this is optional)

Note that the first function that removes missing values will drop all rows that do not have a valid zone id (Unknown ID). From the lookup table, we see that zone id 264 and 265 are associated with an unknown zone. Although we are ‘loosing’ points, it doesn’t make sense for our analysis to keep these rows.

Note also that we handle separately negative values from a few columns, as well as no passenger at all. I decided to keep the rows with no passenger in a separate dataframe, which may not be relevant as this value is inputted by the taxi drive. Meaning, if it is set to 0, chances are he/she forgot to set it.

Finally, later on during this project, I realised that the lookup file provided by TLC contained two lines with the exact same combination of Borough/Zone names but with different ids (id 56 and 57). In order to distinguish the two, and based on their position on the map, I edited this lookup file and changed the names to:
id 56 — Corona (South)
id 57 — Corona (East)
Therefore, I was using a copy of the lookup file linked above.

Step 3 — Can we use matplotlib?

(SPOILER ALERT: I will not stick to this approach….)

If you got to this point, it means you know what question we want to answer, and with a clean dataset. Great! So let’s start plotting some visualizations!
The approach was to use matplotlib to plot the *link* between two zones, using the number of passenger per link as the weight to calculate the transparency of the link.

Basically, I wrote a few functions that would plot a base map from the shapefile of the city (also provided by TLC: https://s3.amazonaws.com/nyc-tlc/misc/taxi_zones.zip), filter and format the dataframe grouping the count of passengers per origin and destination, and plot the links with their respective weight on a map.

Which brought me to get something like….that:

First plotting attempt of the whole city

Or that:

I even tried to add points to display the count either at origin or at destination:

First plotting attempt of Manhattan with weight of the number of passengers at the origin

First plotting attempt of Manhattan with weight of the number of passengers at the destination

Note that knowing that I wouldn’t stick to this approach, I did not spend too much time making this visualization look good!

So at this point, what can we conclude?

you cannot see it but I can tell you, it takes way too much time to load the dataframe (only one month!), and render one map on the filtered dataframe
visualise a flow using a static representation doesn’t make as much sense as using an animation

So basically, I need to try something else….

Step 4 — Let’s start over with a database and OpenCV!

And this new approach relies on two main players: a database, and a computer vision library (OpenCV). Now we are talking!

Loading all the data of 2018 was actually quite a challenge, as I did not optimize my queries prior to loading it (experience when you are talking….). So I ended re-writting the database multiple times. And yes…. I had about 100 million rows at the beginning of the operation, so every little change would take between 20 and 50 minutes to be executed. Don’t do that. Unless you wish a short life to your hardrive for example.
In summary, here are the queries that I should have executed to create the table and load the data.

Create a table MySQL query (example of 2017)

Load the data and clean up the table MySQL queries (example of 2017)

Note that all the cleaning steps that were detailed in the first approach are also used here (of course….).

Phew! Now it is just a matter of writting a script that will either render the animation or the chloropleth map.

To render the animations, we want the script to:

process the shapefile
render a base map
prepare the query
connect to the database and execute the query
process the query results
render each point by interpolating its position
render each frame
render the animation

And for the map:

Do the same until step 5
render two maps per zone (NYC and borough-focused map) looking at incoming flow
render two maps per zone (NYC and borough-focused map) looking at outgoing flow

Note that for the chloropleth map a few functions used for the animations were reused entirely. Still, some specific functions were needed and created specially for the purpose of the rendering of these maps.

The point of this blog post not being to document my code, I invite you to look at the documentation on the GitHub repository.

Examples from the documentation of the functions available on GitHub

However, I want to highlight a few decisions that I took.

Code structure choices

Two comments here:

I like when code is flexible, and I tend to want to pass as a parameter pretty much everything — so I used a lot of dictionaries as input objects for my functions
I like when code is reusable — so I used a lot of functions

But although I tried my best to meet these two requisites, I also hard-coded some attributes in several functions, such as:

the special dates calendar for 2018 (Christmas, National Day, hottest and coldest day, ….)
the colours to render
the positions of the text displayes (legend, titles, …)
the scaling of the points
the number of frames per second to render

Besides, as mentioned before we use the pick up date as a reference date to assign the flow of passenger to a travel date

Rendering choices for the animation rendering

Regarding the colour code used:

I chose a black background to illuminate the map and allow contrast to be more visible
I picked the viridis color palette. Although recommended for its smooth transitions that specifically applied to chloropleth maps, I also used two colors to represent the dots in the animations.

Regarding the video parameters:

I chose a rather high resolution (1920x1080) to allow the image to be of good quality (the more details the better without exaggeration)
I chose to render 30 fps, to give time to see the animation at normal speed. But I could have gone for 60 to be able to record in slow motion using video editing afterwards

Regarding the plot itself:

I chose to normalize the weight of the point based on the max number of passengers of the whole period analysed, which means that from one day to another, the points will have a size varying between the max and the min of passengers on the whole period. It can be that for day with low traffic, the contrast in the size of the points is not very obvious.
What is represented is actually the flow of people from one zone to another, extrapolated to make the point move between its origin and its destination. I.e not an itinerary, not a time related position of people. Just an animation of the flow of people between one origin and one destination, averaged or counted per day.

Rendering choices for the chloropleth map rendering

Regarding the colour code used:

To be consistent with the animation choiced, I chose a black background to illuminate the map and allow contrast to be more visible
However, I used another color palette, where darker (closer to the background color) means few people traveling and lighter means more people traveling. To plot the difference between weekdays and weekends, we use two different tones for positive and negative values, but the logic is the same.

Regarding the plot itself:

I use a color scale that spans from 0 to max value, and normalize the weight using this scale. It can happens that if the min value of closer to the max value than 0, the contrast between the plotted colors is not evident.
One map is dedicated to one zone, highlighted with a thicker yellow outline.

Besides, I decided to create an extra table with preprocessed data in order to speed up the queries to render the maps.

Libraries choices

The comments regarding the libraries are the same.

I chose to use OpenCV as I was dealing with rendering images and videos. Although it makes it almost trivial to render an image and a video, there are two main limitations I didn’t manage to come across:

the size of the text can only be specified as an integer, as well as the diameter and center of a circle
there is no relative positioning (we have to specify the position of one pixel used as a reference to draw the shape or the text).

Regarding the other libraries, they appeared as the most appropriate for the task to be performed, and I tried to limit them to the strict minimum.
Note that I used a library for the projection of the coordinates in the first approach, but I ended up writting my own projection function when working on the second approach.

In the end, what do we get?

an animation for the whole year 2018, with 1–2 seconds of animation showing the flow of passengers every day (i.e the movement of dots on the map representing passengers going from one point to another)
an animation only for aggregated weekdays and weekends data — which would be represented by 1–2 seconds per week, with either just weekdays or just weekends.
several maps showing for each zone the number of incoming passengers traveling on average overall during the year
several maps showing for each zone the number of outgoing passengers traveling on average overall during the year
several maps showing for each zone the difference between the average of incoming passengers between weekdays and weekends (so for each zone we have a map, that uses a color code to show where people are mostly coming from, and whether it is more on weekdays or on weekends).
several maps showing for each zone the difference between the average of outgoing passengers between weekdays and weekends (so for each zone we have a map, that uses a color code to show where people are mostly going to, and whether it is more on weekdays or on weekends).

Now what?

Well first of all, you are welcome to read the other post about the interpretation I make of the results obtained. I know, carried out by the process, we tend to forget why we even started this adventure…. Totally worth it I would say from my side!

Just a comment on performance : I really tried to optimise both the queries and the code as to minimise computation tasks and memory usage. There are probably improvements that can be done.
To give an idea on how much time it took to run on my environment:

about 6 minutes to render the maps (so if we render whole year and difference between weekdays and weekends we need about 12 minutes)
about 23 minutes to render the video of NYC with the whole year
an extra 15 minutes to render another video of a borough with the same query results
about 13 minutes to render the video of NYC with only weekdays, aggregated per week
an extra 3 minutes to render another video of a borough with the same query results

Then there is a list of things to either improve, or analyse further! For example:

Some improvements on the rendering
Looking at the code structure and my general expectations, I would say going for OOP wouldn’t be a bad idea, as it would make the code easier to maintain and would simplify a lot its structure.
Make the map function more flexible (choose which maps to render)
Use the coordinates instead of just the location id
Compare the results across more years
Represent the variation over time withing one day
Include other datasets (green taxi and FHV) for the whole set of analysis
Compare the observations on the flow with the availability of the subway lines
Go further in terms of the analysis of the correlation of the various transportation types (public transportation, bikes, personal cars, walking,….)
Analyse whether there is a correlation between the date, time, pick-up and drop off zones and the tip percentage, or the payment type.
Propose a predictive model of how much time the ride will take depending on origine-destination
…

This project was amazing to work on. It is really rewarding to be able to give sense to such an amount of raw data, of great quality, and playing with animations to share my conclusions!
If you never did it, I can only recommend that you give it a try! Besides, I see it as a personal achievement as it is the first time that I work on such a big ‘coding challenge’ (remember, I graduated in urban planning and started very recently to code ;)).

What do you think about this project? I would love to hear your opinion!