Citi Bike Trips Analysis And Prediction

CHAPTER 1: DATA ANALYSIS

7 min readAug 20, 2023

Introduction

CitiBike is a privately owned public bicycle-sharing system working in New york city that started operating in October 2013. Currently, they have a total of 17k operational vehicles, 1k+ stations, and around 48k daily ridership average (recorded in 2018), you can find more about CitiBike on Wikipedia.

In this blog, I will explore CitiBike data, particularly for the year 2020. Using this analysis, one can get knowledge or hidden patterns from data and can build a Predictive model.

Objective

The objective of this project is to extract insights from CitiBike NYC trip history data for the year 2020, We will try to find problems and will do some assumptions, Secondly Based on this data we will try to predict the number of trips for Q1 of 2021. Development of Model will be covered in PART 2 Of this blog.

Data Overview

CitiBike provides detailed data on trips, start and end time of trips with start and end station id, Rider type (Subscriber or casual rider), trip time, Bike information, and many other related details. Data was distributed in monthly files, so I have downloaded 12 files for the year 2020, and all files are huge in size (ranging between 100 MB to 500 MB). Processing this huge data together might be a difficult task, so I have applied some methods to reduce the size of data frames.

Technique to Reduce the size of dataframes:

Dont load all columns
shrink numerical columns with smaller dtypes
shrink categorical columns using Categorical dtype

I saved 50% of memory usage using the above techniques, I joined all 12 months of data together to create one single dataframe for the year 2020. now dataset is huge, I preferred to work on the Google collab as my system was not able to process this huge dataset

With reference to the documentation of data provided by Citibike, Here are some key highlights:
* tripduration — total time in seconds ride was active
* starttime,stoptime — Trip start time and end time
* start station id , start station name , end station id, end station name — start and end station details
* bikeid — bike details
* usertype — Customer = 24-hour pass or 3-day pass user; Subscriber = Annual Member
* birth year— Year of birth
* gender — 0=unknown; 1=male; 2=female

Quick Transformation:

I did some quick transformations here,

changed trip time to minutes, as there won’t be trips that last for less than minute
And rather than using birth year we will convert this to get the age of users, As data was captured in 2020, we will check users' age for 2020

userAge = 2020 - birthYear

In the above Age column distribution, How come someone’s age is 147, yes that is the max-age! Actually, we can consider 85 as the maximum and anything above that as Outlier.

Data Analysis

1. Trips And Users

We have a total of 19.5M (19506857) trip data, which have 2 types of user data.
1. Customers: users who use 24 hours/3 days plan
2. Subscribers: users who use the Annual membership plan
So while looking into the data, we can see the majority of users are Subscribers.

2. Daily Trips

To understand year-wide data, we can analyze trips count per day. I have grouped data start day to get a trip count for the day. Now I have 366 Rows (As 2020 is the leap year we have 366 days in the year).

3. Trips and Days of Week

People use Citibikes weekly, Or there is Spike on Weekends?
To Answer this question, Let's group data by day of the week. So My look like this

The average of trips is quite similar except for Weekends, And People choose to ride CitiBike on Weekend over Weekdays. Saturday shows the largest variation throughout the year whereas Wednesday shows the smallest variation.

4. Trips and Holidays

Do people choose to ride bikes on Holiday?

5. Trip Duration

Riding a bike requires energy, and assuming most of the riders will be casual riders. Most Rides should last no more than 60 minutes.

By plotting distribution we can see a very long right tail in distribution which indicates outliers in trip duration data.

Assumptions :
* Maximum Trip duration in minutes is 6000+ (100 Hours), which is clearly wrong
* logically trip duration greater than 45–60 min makes no sense, and this is clearly Outlier due to some issue.
We can assume the issue is related to the Trip ending functionality, There might be a malfunction while ending the trip, which is causing the trip timer to run.

After capping outliers, We can see insights from data showing most of the riders choose to ride for less than 20 minutes.

Top 10 Stations with Ride End Malfunctions:

We came up with a list of the top 10 stations that the Citi bike team should check, assuming there is some technical glitch happening.

5. Bike And Station Analysis

With the data we have, here is another insight,

Most Popular Bikes by number of times used and number of minutes used:

Calculating Bike’s Idle time:

bike’s “idle time” at any station as the time spent at that station between two consecutive trips.

I sort data by bike id and start time, so we have continuous information for each bike. For calculating idle time, the previous end station and next start station must be the same, so we can say the bike was idle at the station,

There will be some cases where our assumption will be wrong, for the same bike's previous end station and the new start station might be different if the bikes are relocated by the Ciki bike crew, There might be less demand at this end station.

So the formula for calculating the Ideal time :
startTimeOfCurrenttrip−endTimeOfPrevioustrip

We will check if this is satisfying 2 conditions:

both bike ids are the same
both station ids are the same

# diff of two times in minute
df_trip['idletime'] =(df_trip['starttime'] - df_trip['stoptime'].shift(1)).astype('timedelta64[m]')

# check if current bike and previous bike was same, if not we wont use this trip
df_trip['bike_same'] = np.where((df_trip['bikeid'] == df_trip['bikeid'].shift(1)),1,0)

# change Flag to False if station are diff
df_trip['flag'] = df_trip['start station id'] == df_trip['end station id'].shift(1)
df_trip

6. Trips and Weather

Different weather conditions might affect daily trips count, People might not ride if the outside climate is too cold, or they might use bikes more on Sunny days.

To support this Hypothesis, Let's download climate data from NOAA for Central Park station. This dataset has daily weather data with temp (min/max), snowfall info, and other weather info.
I have removed all unnecessary columns hence our dataset looks like this

Finally after merging weather data with daily trip count data, I plot some graphs.
Temperature and Rides relation?

Snowfall and Rides?

Trips increase when the temperature increases, until its hottest day
People choose not to use bikes when there is Snowfall

Summary

The dataset is huge, you can use different techniques or other libraries to analyze the data
After Converting Birth Year Column to Age, We observe values of more than 80, CitiBike Team needs to recheck and fix this data issue.
Of the total trips in 2020, 77% of riders were Annual Subscribers.
Due to the effect of the lockdown and Covid, the trip count dropped in mid-march. But People started moving towards CitiBike after 2 months.
People like to ride on Weekends, except on Holidays. And People Also like to Ride on a sunny day, when the Trip count drops on Snowfall.

You can use other useful features from data to get insights, Like using coordinates data to visualize trips on maps. We can use this data to Predict next year's Trips count.

The source code for this Project is available on GitHub. We shall meet in the next Chapter. Happy Learning!