Data Visualization and Capital Bikeshare

Published in

Mission Data Journal

5 min readMay 16, 2016

In 2016 our Labs team is committed to exploring various new tech. Thus far we have worked on projects relating to IoT, iBeacons, haptics, and the Amazon Echo. Now we are about to begin exploring data visualization and analytics. This is the beginning of a running commentary that will cover our experimentation with that topic by using data from DC’s Capital Bikeshare. We’ll use the #bikeshare tag to track stories related to this topic.

Here at Mission Data we have started using the Pandas data analysis library to dive into the Trip History Data released by Capital Bike Share. We decided to start with an exercise to gain familiarity with the Trip History Data. The data is downloaded as comma separated files (.cvs) from the Capital Bike Share website and contains one file for each quarter of the year going back to Q4 of 2010. Each quarter contains roughly 0.5 to 1 million trips. The data contains attributes for:

Duration — Duration of time bike was removed and then returned to a rack
Start Date — Start date and time
End Date — End date and time
Start Station — Starting station name and number (each station has a unique identifier)
End Station — Ending station name and number
Bike # — Unique ID of the bike used for the trip
Member Type — Casual or Registered

Pandas provides the methods we will need to explore the data. However, in order for the data to be consumable by Pandas it will need to be scrubbed and merged into a single data set that can be loaded into memory. Remember that the data is spread across multiple files, a CSV file for each quarter.

In Pandas, we call this in-memory information a data frame. As we load each CSV file, we normalize the column headers to follow standard naming conventions. We needed to do this because the column headers are labeled differently through the raw files. For example, trip durations are labeled as ‘duration’, ‘duration (ms)’, or ‘total duration (ms)’ using multiple forms of capitalization. To normalize the columns we define a hash to map all the different source names to our standard ones.

{'duration' : 'Duration', 'duration (ms)' : 'Duration', 'total duration (ms)' : 'Duration',}

Additionally, the format of the values for trip duration differ from file to file. In some, they are expressed as a millisecond integer and in others they are expressed as a string formatted as “hours:minutes:seconds”.

After loading the CSV files, the dates are also represented as formatted strings. The code snippet below shows how easy it is to create actual datetime objects from the formatted strings using Pandas.

df['Start Date'] = pd.to_datetime(df['Start Date'], format="%m/%d/%Y %H:%M")

The key concept with the above format string is that pd.to_datetime will apply it to the entire series to create an actual date time object. After a bit more scrubbing, all the data is loaded into a single data frame and we can now slice it in different ways for analysis and exploration.

For this study, we focused on the number of trips from station to station because each record in our data frame contains a start and end. However, our data frame contains approximately 400 unique stations (all of the ones operated by Capital Bike Share) and approximately 11.4 million individual trips. This would simply be too many stations to show in the visualization we wish to use. We either need to reduce the number of stations by grouping them into neighborhoods or find another visualization. For now we will continue with the original visualization method and will consider finding another one for a later post. To perform the grouping operation we again use a hash to map stations into neighborhoods. The snippet below shows how this is done with Pandas and our map, which is a native Python dictionary.

df['Start Neighborhood'] = df['Start Station'].apply(lambda x: mapping.get(x, x))

This provides us a new attribute derived from another one in the data frame. The grouping operation is also performed using the end stations. According to our mapping, which currently is limited to the Virginia stations, we have 16 neighborhoods among which to count trips.

For counting we will use two Pandas methods, groupby and size. We group the data by starting neighborhood and then by ending neighborhood. The following code snippet shows how we perform the grouping operation and calculate the sizes.

trips = df.groupby(['Start Neighborhood', 'End Neighborhood']).size() matrix = [[trips.get(k, {}).get(kk, 0) for kk in labels] for k in labels]

Above the variable “labels” is simply a unique list of each neighborhood. The result of these two lines will be a square matrix where the rows represent the origin and the columns represent the destination. The values give the total count of trips between them. For example, suppose we have three neighborhoods in our data. The matrix in that case might look like the following:

[[100, 8, 4], [11, 130, 1], [6, 10, 125]]

Notice the relatively high trip counts along the diagonal of the matrix. These trips represent the ones where the origin and the destination are the same location. Not surprisingly a common occurrence in the trip history data. It turns out that this square matrix is the exact data structure we will need for our visualization.

For the visualization, we chose to use a chord diagram produced using a JavaScript library, called D3js (Data Driven Documents). The idea for this first analysis is inspired by Mike Bostock’s visualization of Uber data. To get our data into the chord diagram, we wrote our matrix to a json file and provided a CSV file for labels and coloring for each neighborhood. The example chord diagram is dependent on these two files. After a few minor tweaks to the example, we put all the needed files on a test web server and generated the visualization.

You can view a fully interactive version of the chord diagram on our website.

Most of the effort for this analysis was in the data scrubbing and understanding how to leverage Pandas to munge the data and tally the trips. The capabilities of Pandas and D3js are quite impressive, as we are able to go from raw data to visualization with very little code.

We are currently working on incorporating additional neighborhoods in the DC and Maryland area. We are interested in determining an effective way to visualize trips between stations within a given neighborhood. Additionally, we plan to look into the ability to present this visualization over a given time span and learn more about how the trips change over time. Specifically, how do trips change throughout the year and how does that change compare to previous years. It would also be interesting to pull in other data sources, such as weather forecasts to see how weather affects trips. Or, perhaps we could obtain traffic data to see if there are any correlations.

Please check back with us or offer your own thoughts and experiences. We hope you have enjoyed this post and are looking forward to seeing more.

Have a web app, mobile app, or piece of custom software you need designed and developed? Drop us a line

Data Visualization and Capital Bikeshare

Published in Mission Data Journal

Written by Jared Blakney