Exploring Citibike Ridership Data

Jeff Marvel
6 min readDec 19, 2021

--

For my final project at Flatiron, I performed a time series analysis on Citibike ridership data. In this first post, I’ll review the exploratory data analysis I performed as part of time series modeling, as well as some lessons learned from working with “big data”

Bikeshare Background and Project Motivation

Citibike, New York City’s bikeshare program that’s owned and operated by Lyft, has been a huge success since it was first rolled out in 2013. Outside of a COVID related hiccup in ridership in 2020, Citibike has experienced growth every year and is on pace for another record-breaking year, with an expected ~27mm rides in 2021. This growth mirrors the broader increase in popularity of bikesharing programs across the country. There are currently 60+ cities in the U.S. with some form of bikeshare program¹. These programs provide a vital transportation alternative that can be scaled relatively cheaply compared to other public transit option. Among the other documented benefits include lower pollution² and improved safety for all cyclists on the road³.

I am an avid user of Citibike myself, having taken ~1,300 rides since I first signed up in 2015. So when I learned that Citibike freely publishes their individual ride data since launch, I was keen to apply my Data Science chops to the problem of forecasting ridership. In particular, I wanted to answer the following questions: what is the total expected ridership in 2022, and which are projected to be the fastest growing neighborhoods in NYC? These are important questions to be able to answer, as Citibike themselves have said that record ridership this year has caused strains on the system⁴. Having an accurate projection of future growth could allow Citibike to appropriately plan staffing and inventory levels.

Working with “Big Data”

Citibike publishes their data here. Each zip contains a CSV with ride level data for that month. “JC” refers to Jersey City, which I ignored for this analysis. Each file has a wealth of detailed information about the ride, including start / stop station, start / stop time, ride type (member or casual), gender, birth year, bike ID, and station coordinate (lat / long). A sample CSV looks as follows:

My first approach was to manually save down all CSV files, read them in individually using pd.read_csv(), and concatenate them together. Don’t do this! This is silly for a couple reasons. One, we’re Python programmers, there are plenty of existing tools to scrape and save down zip files from websites. Two, the final dataset was 20gb with 135mm rows. I ran out of memory on my 16gb RAM Mac performing any simple Pandas operation.

To solve the first issue, I wrote a scraping script to automatically pull down the CSVs through a specified date (link to Github below). To work with a dataset that large, there were a couple options. The first was to use server capacity from a third-party service like Google Collab or AWS to process the data. However, since I knew my models would run on my local machine (it was just a matter of processing the data), I decided to go a separate route and “batch process” the data. In other words, as I read in each monthly CSV, I re-aggregated to a daily, station-level view so the final file size was a fraction of the original. The downside to this approach is that you need to be confident in the final column set you need. Running this script can take hours, and you will lose some detail in the process. In my case, I was confident I could safely remove several of the columns for my business problem, and successfully worked around this “big data” issue. On to EDA!

Time Series EDA

My final processed data set had 2.1mm rows, with columns for station ID, date, station lat / long, NYC neighborhood, and Borough (the latter two I appended using a GeoJson file, which is quite interesting! I’ll write a future blog on this). Among the first steps of any time series analysis is converting the date to a datetime object and making that the index of the dataframe. This allows for easier plotting and is a critical component for time series models

Next, I wanted to check for outliers. First, I filtered for the unique station IDs. Then, I converted the dataframe to a Geopandas object, which will easily allow us to plot the coordinates.

There appear to be two groups of outliers: stations with missing coordinates and a clump of stations that appear well north of NYC (actually in Montreal, oddly enough). Since these represented a tiny fraction of rides, I decided to remove them from my dataset, then re-plot the data.

By removing the outliers, we see that the map very closely resembles NYC! Next up is a simple plot of daily ridership through time.

From this plot, a few obvious trends common to time series problems are apparent:

  • Upward trend: ridership is increasing through time
  • Seasonality: ridership very obviously peaks during the summer months
  • Increasing Variance: the difference between the peaks and troughs increases through time

Doing the seasonal decomposition of the time series makes this even more obvious.

The combination of these observations strongly suggests the time series is not stationary. For time series modeling, having stationary data can materially improve the model’s performance. Stationarity means that the time series’ statistical properties do not change through time. This is clearly not the case with our data. Fortunately, there are easy transformations we can do to solve each of these problems above:

  • Upward trend: differencing the data by one day (subtracting previous value from current for all entries) can remove the upward trend
  • Seasonality: this can be specified explicitly as a term in a SARIMA model (more on that next post)
  • Increasing Variance: log transforming the data can keep variance constant

Performing the above transformation (minus the seasonal term, which will be passed a model parameter), the data is obviously more stationary. In other words, there is no clear trend with time.

On to modeling! (to be continued)

--

--

Jeff Marvel

Beginning my adventure in Data Science through the Flatiron School. Writing here on all things Data Science as I continue learning.