Processing 42 Million Bike Share Rides into a Leaderboard with Zero-True (Part 1)

How I proved I’m the fastest person to ever bike from my house to my cofounder’s house using Zero-True.

Red Giuliano
Zero-True
6 min readNov 28, 2023

--

The ICONIC CitiBike, courtesy of NYC Bike Maps, https://www.nycbikemaps.com/

Most people who use their city’s bike share service do not think it’s a competition, but I’m here to prove them wrong. I’ve used CitiBike for several years now and love how easy it is to get around NYC that way. I also love Strava, and I’m a big fan of their KOMs/segments that allow you to create a sort of virtual leaderboard on a certain stretch of road or trail.

I’ve always had the gnawing thought that I clocked the fastest ever time from the station near my house to the one near my cofounder’s house. I’ve searched online for a CitiBike leaderboard but I’ve never been able to find something, so I finally built one myself using the Zero-True notebook. Let me run you through the steps to get that done:

Exploring Data Sources

In order to create the leaderboard, I had to scour the internet for sources. Thankfully there are plenty of CitiBike enthusiasts and they’ve even created a subreddit which ranks in the top 9% by size on reddit! Interestingly, it seems that riding some of the older bikes around is a point of pride in the community.

Bike Number 14, Image by reddit user u/Tuttikanaynee, source: https://www.reddit.com/r/Citibike/comments/nax08i/taking_bike_00014_probably_one_of_the_oldest/

Going down a bit of a rabbit hole in this subreddit brought me to a post by a reddit user who actively maintains a dashboard at https://ctbk.dev/. They’ve done an amazing job documenting everything and have published their code on github. This is their twitter profile, if you’d like to check them out: https://twitter.com/RunsAsCoded.

From this point onwards I was able to google around and find the official CitiBike s3 data store, which is also documented relatively well. These are the relevant columns for our analysis that they list in their documentation:

  • Started at (time)
  • Ended at (time)
  • Start station name
  • End station name

In theory, this should be all we need in order to create the leaderboard. Let’s get started then with the zero-true notebook.

Installing and Starting Zero-True Notebook:

Installing Zero-True is very easy. Simply run the following command in your terminal:

pip install zero-true

Once everything has installed you can type:

zero-true notebook

Then if you navigate to http://localhost:1326 in your browser your should see an empty zero-true notebook.

Downloading the Data

We want our analysis to be reproducible. Instead of downloading the files manually, let’s write a script to download the data using the built-in requests library. To do this, let’s create a python cell in our zero true notebook. Here’s the code we’re going to use to download the data. This will be the first code cell in our zero-true notebook:

import requests
import os
from zipfile import ZipFile
from io import BytesIO


dest_folder = "downloaded_data/"

# Create destination folder if it doesn't exist
# This prevents the data from being downloaded multiple times.
if not os.path.exists(dest_folder):
os.makedirs(dest_folder)
url = "https://s3.amazonaws.com/tripdata/202201-citibike-tripdata.csv.zip"
print(f"Downloading {url} ...")
response = requests.get(url)
if response.status_code != 200:
print(f"Failed to download {url}")

else:
with ZipFile(BytesIO(response.content)) as zip_ref:
zip_ref.extractall(dest_folder)

Thankfully, the tables follow a well-defined naming convention so when we want to expand our analysis we can simply generate the URLs in a loop and download the data.

Proof of Concept

Now that we have the data let’s explore a bit and see if it’s even possible to create a leaderboard. Every dataset is different and just because the column names seem descriptive it’s always important to trust but verify.

To do this let’s add a sql cell in our notebook. The first step is just to select the first couple rows from one table in order to get a better sense for the data. The SQL editor in Zero-True is powered by duckdb so you can simply specify a file path for your query and get blazingly fast analytics.

SELECT * FROM 'downloaded_data/202201-citibike-tripdata.csv'
LIMIT 10
Running the POC query in a zero-true notebook. Image by author.

It seems like the data is clearly labeled and we can probably just start building our query. Let’s choose a couple of stations to build our proof of concept. Thankfully duckdb provides built-in methods for dealing with timestamps so our queries are very simple:

SELECT start_station_name, end_station_name,
started_at, ended_at,
date_diff('SECOND', started_at, ended_at) as Total_Time
FROM 'downloaded_data/202201-citibike-tripdata.csv'
WHERE start_station_name = 'Carmine St & 6 Ave'
AND end_station_name = 'Howard St & Lafayette St'
ORDER BY Total_Time DESC
LIMIT 100

Ok, now we have a table with the rides from Carmine St & 6 Ave to Howard St & Lafayette St that happened in January 2021.

Turning our Notebook into a Dynamic Dashboard

To turn our notebook into an interactive dashboard we are going to allow users to choose which start and end stations they want to select. Let’s add another cell to our notebook. We simply import zero_true at the top of the cell and add some logic so the user can choose which station names they want to query.

import pandas as pd 
import zero_true as zt

stations = list(pd.read_csv('downloaded_data/202201-citibike-tripdata.csv', usecols=['start_station_name'])['start_station_name'].unique())
start_station = zt.Autocomplete(id ='start_stations',items = stations)
end_station = zt.Autocomplete(id = 'end_stations', items = stations)

You’ll notice when you run this cell that two searchable selectboxes will appear at the bottom of the cell where a user can input what stations they are interested in. To use the user selected values in our query we just add the following SQL query below:

SELECT start_station_name, end_station_name,
started_at, ended_at,
date_diff('SECOND', started_at, ended_at) as Total_Time
FROM 'downloaded_data/202201-citibike-tripdata.csv'
WHERE start_station_name = {end_station.value}
AND end_station_name = {start_station.value}
ORDER BY Total_Time DESC
LIMIT 100

Now if choose different start and end stations we can see the query update accordingly! Our Zero-True notebook automagically parses through your notebook to figure out which cells are related!

Visualizing Our Results

The Zero-True notebook offers seamless integrations between SQL and python. For example, you’ll notice that at the top of our SQL cell we can define a variable name. This variable name will be assigned to the result of the query and will be accessible in python! Let’s name our query: “leaderboard_data”. Now in the python cell below we can visualize the top 100 times as follows:

import plotly.express as px
fig = px.scatter(leaderboard_data, x=leaderboard_data.index, y='Total_Time', title='Total Time vs. Row Number')
fig.update_xaxes(title='Row Number')
fig.update_yaxes(title='Total Time (seconds)')
zt.PlotlyComponent.from_figure(id='plotly', figure=fig)* Insert Plotly Plot **

Ok, now we can see our data in a beautiful plot thanks to Zero-True and the beautiful Plotly library. Now we have the foundation we need to go ahead and process 42 million rows into a dashboard and to prove that I’m the fastest, but you will need the patience to wait for part 2 for that!

Publishing our analysis with Zero-True Cloud:

** Some of the features in this section are in beta testing. if you are interested in publishing your zero-true notebook please comment below and I can get in touch**

We are going to publish this data exploration as an application at a live URL for anyone to be able to see and interact with. Usually this would be a relatively painful process but with zero-true you can just run a one liner from your terminal and see you notebook go live in just a couple minutes. For those of you that our lucky enough to be in our limited beta, the command is:

zero-true publish [api-key] [user-name] [project-name] [project-source]

Check out our live data exploration here!

--

--