Building a Daily Aggregate Dataset from Raw Citibike Trip Data

Published in

qri.io

5 min readNov 19, 2020

This post describes the creation of nyc-transit-data/citibie_daily_counts_2020 , a Qri dataset which includes station-level trip start and end counts for the New York City bike share system. While this is a simple aggregation that any reasonable data analyst should be able to perform, the value add here is sharing it as its own product, so others can use it immediately without having to download several large CSV files and write the code on their own. We’re saving the world disk space and time!

tl;dr just give me the data please: https://qri.cloud/nyc-transit-data/citibike_daily_counts_2020

Photo: Flickr user Nestor Galina https://www.flickr.com/photos/nestorgalina

The Status Quo

Citibike does a great job publishing archival trip data on their website at https://citibikenyc/system-data. When you click “Download Citi Bike trip history data”, you land on the index page of an S3 bucket, where a zip file for each month of data awaits you. The archives go back to 2013!

Raw monthly citibike trip data lives in well-named ZIP files in an S3 bucket

So what’s in the data? Here’s a glimpse of what’s inside the dataset for October 2020 (101MB zipped, 427MB unzipped!). This is already becoming a substantial disk space commitment if you want to look at multiple months worth of data.

The raw data contains a row for each trip, with a lot of duplicate data about each station

We can see that each row is a single trip, with the duration, start time, end time, start station details (and location), end station details, etc. There’s even a birth year and gender. There’s also a bike id, meaning it is possible to track a single bike in its travels around the city. Someday it might be fun to animate a bike’s journey, similar to the way I did with Taxis back in 2014, but I digress.

The Code

I chose to do this aggregation in R Studio, but it could just as easily have been done in Python/Jupyter or even Javascript. The full code is on github if you want to take a look.

First, I manually downloaded each of the monthly zips for 2020 and unzipped them. This can be scripted too, but I already had them on disk from an earlier analysis.

Next, I wrote a function called processMonth which, as the name suggests:

reads a single months’s csv
pulls out the calendar date from the trip start timestamp
creates a lookup table for each unique station
counts the trip starts for each day/station
counts the trip ends for each day/station
joins the start counts, end counts, and station info

processMonth <- function(month) {
  raw <- read_csv(paste("/Users/chriswhong/Sites/citibike-daily/raw/", month, '-citibike-tripdata.csv', sep=''))
  
  # clean names, extract the date from the timestamp
  daily <- clean_names(raw) %>% mutate(date = as.Date(starttime))
  
  startstations <- distinct(daily, start_station_id, .keep_all = TRUE) %>%
    select(start_station_id, start_station_name, start_station_longitude, start_station_latitude) %>%
    rename(station_id = start_station_id, station_name = start_station_name, station_longitude=start_station_longitude, station_latitude = start_station_latitude)
  
  endstations <- distinct(daily, end_station_id, .keep_all = TRUE) %>%
    select(end_station_id, end_station_name, end_station_longitude, end_station_latitude) %>%
    rename(station_id = end_station_id, station_name = end_station_name, station_longitude=end_station_longitude, station_latitude = end_station_latitude)
  
  station_lookup <- bind_rows(startstations, endstations) %>% distinct(station_id, .keep_all = TRUE)
  
  
  # count daily trip starts for each station/date
  daily_starts <- group_by(daily, date, start_station_id) %>%
    summarize(startcount = n()) %>%
    rename(station_id = start_station_id)
  
  # count daily trip ends for each station/date
  daily_ends <- group_by(daily, date, end_station_id) %>%
    summarize(endcount = n()) %>%
    rename(station_id = end_station_id)
  
  # full join them so we don't lose any where there were starts but not ends and vice versa
  combined <- full_join(daily_starts, daily_ends, by = c('date', 'station_id'))
  
  with_station_info <- left_join(combined, station_lookup, by='station_id')
  
  return with_station_info
  
}

Once I had this function working correctly for one month, all I had to do was run it for each month, combine the results together, and a new CSV is born!

dailies <- list()months <- c('202001', '202002', '202003', '202004', '202005', '202006', '202007', '202008', '202009', '202010')
for (i in months) {
  dailies[[i]] <- processMonth(i)
}# combine all of the months into a single table
for (i in seq_along(months)) {
  if (i == 1) {
    combined <- dailies[[months[1]]]
  } else {
    combined <- bind_rows(combined, dailies[[months[i]]])
  }
}# write to CSV
write_csv(combined, '/Users/chriswhong/Sites/citibike-daily/citibike-daily-counts-2020.csv')

Here’s what citibike-daily-counts-2020.csvlooks like (now just 20MB!):

A new dataset is born! Daily starts and ends for each station for 2020 is easy to use, and is only 20MB in size

At the top of the dataset are all the aggregates for January 1, 2020, each row is a station with a count of starts, count of ends, a human-friendly station name, and lat/lon coordinates for mapping.

The 2020 dataset has been pushed to qri.cloud, where it’s hosted for all to find. If you find it useful, we’d love to hear about it!

Citibike Daily Counts 2020 | Dataset Published on qri.cloud

A dataset of aggregated station-level citibike trip starts and ends for the year 2020. Each contains the count of…

qri.cloud

A Quick Chart

The original prompt for the creation of this dataset came from a Qri user and fellow urban data analyst, Ben Oldenburg. He’s been making use of the 2020 NYC Subway Turnstile Counts dataset in some projects, and wondered if a similar “daily aggregates” dataset existed for Citibike. The answer was no, so we decided to make one.

This is the chart we were after, a quick visual of the effects of the arrival of coronavirus on Citibike traffic in March, followed by what looks like a recovery. We would need to compare this to previous years or ask a citibike expert (I am not one) to get a better feel for “normal”, but the curve is clearly quite different from what we see in the subway data.

Aggregating the starts for each yields single line chart for all of 2020, showing the dip in ridership in March/April due to covid19

For comparison, this chart shows NYC subway turnstiles for roughly the same time period (2020 YTD)

Is this dataset helpful to you? Is anything missing? Do you *really* want us to make the same dataset for 2013–2019 and Jersey City? Give us a yell on twitter or drop into our Discord and let us know what features you’d find helpful. Thanks for reading!

Building a Daily Aggregate Dataset from Raw Citibike Trip Data

Citibike Daily Counts 2020 | Dataset Published on qri.cloud

A dataset of aggregated station-level citibike trip starts and ends for the year 2020. Each contains the count of…

Written by Chris Whong