Cyclistic Case Study

John Simko
9 min readMay 28, 2023

--

Google’s Data Analytics Certification Capstone Project

Docked Bikes in Chicago

Overview

This is a capstone project as a part of the Google Data Analytics Professional Certificate course offered by Coursera. I am analyzing the usage patterns of Cyclistic, a fictitious bike-sharing company in Chicago, to identify differences between casual riders and annual members. My goal is to help the marketing team devise strategies to convert casual riders into annual members, ultimately driving company growth. I will follow a structured approach using Google’s data analysis process: ask, prepare, process, analyze, share, and act. By examining historical trip data over the past 12 months, I aim to answer key business questions, uncover trends, and provide actionable insights backed by compelling data visualizations. The resulting analysis and recommendations will be presented to the executive team to inform marketing strategies aimed at increasing the number of annual members.

About Cyclistic

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Lily Moreno, the Director of Marketing, believes that maximizing the number of annual members will be key to future growth.

1. Ask

Business Task

Analyze the usage patterns of casual riders and annual members to help the marketing team devise strategies to convert casual riders into annual members, ultimately driving company growth.

Key Business Questions

  1. How do annual members and casual riders use Cyclistic bikes differently?
  2. Why would casual riders buy Cyclistic annual memberships?
  3. How can Cyclistic use digital media to influence casual riders to become annual members?

2. Prepare

About the Data

I am using Cyclistic’s historical trip data to analyze and identify trends from here. The data has been made available by Motivate International Inc. under this license. There is a separate csv file for each month of data. I have downloaded the previous 12 months (May 2022 through April 2023) of Cyclistic trip data.

3. Process

Tools

The tools I used for this project were RStudio (RStudio 2023.03.1 Build 446) running R (version 4.3.0) and Microsoft Excel (Office 365 version) running under Windows 11 (Home) operating system.

You can find my supporting files here.

Load Data

I imported the data from each .csv file and stored them into separate data frames. After performing some initial manual data integrity checks I merged the individual data frames into a single data frame.

Verify and Clean the Data

I performed various validation and cleaning techniques against the merged data frame. Below are some key steps and findings.

  • I replaced empty string values with “NA” in a character columns. As you can see below there is a high percentage of nulls in the start and end station name/id columns. For example, 14% of start_station_name values are Null. There are also null values in the end lat and long columns that I analyzed, but didn’t take any action.
  • There are more than double the number of start and end station name/ids than the official 692 Cyclistic stations. This could be because there are now more new official stations, data entry issues or they track starting and ending locations even if they are not at official stations. No action taken.
  • There is inconsistency in both start and end station id naming convention. The values are in various lengths. Some are all numbers and some are character based. I identified and removed some outlier stations (e.g., DIVVY 001 — Warehouse test station).
  • I converted rideable_type and member_casual columns which contain categorical values into “factor” type to reduce data redundancy and save memory space.
  • I checked if started_at was greater than ended_at in the same row and swapped them.
  • I verified that there are no records with started_at < “2022–04–01” or started_at > “2023–05–01” and ended_at > “2023–05–01”.
  • I converted started_at and ended_at to date-time and created new columns (e.g., start_hour, start_dow, ride_duration) to assist in further analysis.

4. Analyze

Below are some key steps and findings during the analysis phase.

I ran summary() against ride_duration.

The statistics indicate extremely high ride duration values in the dataset (i.e., the max value is much higher than the third quartile). This is causing the mean to be much higher than the median. This indicates a positive (right) skew, meaning there are a small number of very large values (i.e., outliers). I separated the data into casual and member data sets and created a box plot.

Ride Duration Box Plot with No Adjustments

The outliers cause the two boxes to show as flat lines on the x-axis.

I identified outliers according to their interquantile range (IQR) and created new data sets to store the results. See (https://www.r-bloggers.com/2020/01/how-to-remove-outliers-in-r/) for details. This technique finds the interquartile range of a list of numbers, the width of the range, and defines boundary outside of which lie removable outliers.

In addition, I do not believe that ride duration < 1 minute should be included for this analysis, so I removed the 140,291 records that meet this criteria.

I created a new box plot based on the adjusted datasets with outliers and ride duration < 1 removed.

Ride Duration Box Plot with Outliers Removed

The new box plot shows that casual riders spend more time on their bikes than the member riders on average.

However, members take more rides (3,198,857) than casual riders (2,098,900).

Number of Rides by Rider Type

Casual riders took the most rides during the weekend. Members took the most rides during Monday through Friday.

Number of Rides by Rider by Day of Week

Members start their rides earlier in the day than casual riders to.

Number of Rides by Hour of the Day

Members used Classic and Electric Bikes more than casual riders. Docked Bikes were used only by Casual riders. Note: when researching the different bike types on the Divvy website, they don’t differentiate between Classic bikes and Docked bikes.

Total Rides by Bike Type

Electric bikes are starting to gain interest with members.

Member Rides by Month by Bike Type

Electric bikes are the preferred bike choice for casual riders.

Casual riders ride the Docked bike the longest, but this could be misleading since I think docked bikes should be treated like classic bikes.

Casual Ride Duration per Month by Bike Type

Streeter Dr & Grand Ave station is the most popular start station for casual riders with 44,526 rides.

Casual Start Stations

Kingsbury St & Kinzie St is the most popular start station for members with 24,377 rides.

Member Start Stations

Note: Members and casual riders do not share top 5 start stations.

Streeter Dr & Grand Ave station is the most popular end station for casual riders, with 45,329 rides.

Casual End Stations

Note: The Top 5 casual rider start and end stations are the same, with Streeter Dr & Grand Ave station being the most popular.

Kingsbury St & Kinzie St is the most popular end station for members, with 24,412 rides.

Note: The Top 5 member start and end stations are the same, with Kingsbury St & Kinzie St station being the most popular.

5. Share

Omitted Variables

The following is a list of data elements that were not provided in the dataset that would have been useful in the analysis.

1. Cyclistic does not provide cost information — We can’t calculate the amount charged for each ride, which would be useful in determining the break-even point for casual riders switching over to becoming a member.
2. Cyclistic does not provide rider identification — We can’t determine the unique number of casual or member riders in the system. We can’t determine the number of rides and ride duration of individual riders (both members and casual).
3. Cyclistic does not provide riders’ addresses — We can’t target local (Chicago) casual riders.
4. Cyclistic does not provide ride purpose (e.g., commute) — We can’t separate commute rides from pleasure rides. Not clear how Cyclistic determined 30% rides are for commuting.
5. Cyclistic does not provide distance traveled — We can calculate point to point distance using lat/long for start and end locations, but not total distance covered by each ride.
6. Cyclistic does not provide an identifier of the official Cyclistic stations
— We can’t perform analysis on rides starting and ending at Cyclistic vs non-Cyclistic stations.

Business Question Responses

  1. How do annual members and casual riders use Cyclistic bikes differently?
  • Members take more rides (60%) vs casual riders (40%)
  • Casual riders ride on average longer (~15min) than members (~10min)
  • Members take more than 2x the number of rides than casual riders during the months Nov to Mar
  • Casual riders ride more frequently on weekends and members ride more frequently during the week
  • Member and casual riders take approximately the same number of rides on the weekends, but casual riders ride for a longer amount of time
  • Member rides spike during commuter hours (6–9am and 4–7pm) However, they take more rides in the afternoon, which indicates that members are riding in the afternoon for pleasure also
  • Casual riders start riding later in the morning compared to members. This is true on weekends also, which leads me to think that some members commute on weekends
  • Members used both classic and electric bikes more than casual riders
  • Docked bikes were used only by casual riders. When researching the different bike types on the Divvy website, they don’t differentiate between Classic and Docked bike types, so I would need to ask the Marketing team to confirm if there is a difference
  • Casual riders prefer electric bikes even if you combine docked and classic bikes
  • Docked average ride duration was much higher than classic and electric Bikes
  • Streeter Dr & Grand Ave station is the most popular start station for casual riders, with 44,526 rides
  • Kingsbury St & Kinzie St is the most popular start station for members, with 24,377 rides
  • Members and casual riders do not share top 5 start or end stations.

2. Why would casual riders buy Cyclistic annual memberships?

  • Cost savings: Casual riders that ride bikes frequently may benefit from cost savings when switching to an annual membership. Unfortunately, Cyclistic does not provide price/cost or rider information, which makes it difficult to target riders that would benefit from potential cost savings.
  • Time savings: Casual riders could save time in commuting and getting around town for fun.
  • Special benefits: Casual riders could gain additional benefits (e.g., special bikes, Cyclistic user portal access, member parties) that are available only to members.

6. Act

The Act phase can be supported by responding to the 3rd Business Question.

3. How can Cyclistic use digital media to influence casual riders to become members?

  • Create a membership promotion on the cost savings if they ride their bike a certain number of times or for a certain amount of time
  • Create a membership promotion that highlights the number of stations throughout the city and the potential time savings on riding a bike to get around the city for both commuting and pleasure riding
  • Create a membership promotion with additional weekend discounts
  • Create an electric bike promotion, showcasing the member cost savings and benefits
  • Create a membership promotion receiving a reduced membership price if the casual rider shares their “Cyclistic experience” on their social media platforms

--

--