Data Analysis — Case Study II

TiFFium. ✿
7 min readAug 24, 2024

--

Introduction

Hi, I’m Tiffany. I’ve been studying for the Google Data Analytics Professional Certificate through Coursera for almost 5 months. I’m currently working on this capstone project, where I’ve learned about the 6 stages of the data analysis process.

ask, prepare, process, analyze, share, act

In this case study, I’ll perform a junior data analyst at the company — Cyclistic. In order to answer the key business question, I’ll use the analysis tools (SQL & Looker Studio) to get insight from the data.

Background

Cyclistic, launched in 2016, operates a successful bike-share program in Chicago with a fleet of 5,824 geotracked bicycles and 692 stations. The bikes can be unlocked and returned at any station, offering flexible mobility options. Cyclistic provides single-ride passes, full-day passes, and annual memberships. Casual riders use single-ride or full-day passes, while annual members enjoy more benefits and are more profitable. To boost growth, Cyclistic aims to convert casual riders into annual members by understanding their differences and leveraging digital media for targeted marketing.

Ask

  1. How do annual members and casual riders use Cyclistic bikes differently?
  2. Why would casual riders buy Cyclistic annual memberships?
  3. How can Cyclistic use digital media to influence casual riders to become members?

Prepare

Download and insert data

In order to conduct the analysis, I download historical Cyclistic bike trip data (July 2023 — July 2024) CSV files in my local drive and upload it to Bigquery.

Preview the data details which are match the ROCCC standard.

Reliable, Original, Comprehensive, Current, Cited

Process

Combining data

These tables are separated by year and month, and I’m supposed to combine all the data into one table for analysis. I’ll check all the fields’ attributes and value types, then organize the elements I need to use.

--combine the data--
CREATE TABLE
`sideproject-431707.divvy_tripdata.202307-202407`
AS SELECT *
FROM (
SELECT *
FROM `sideproject-431707.divvy_tripdata.202401-divvy-tripdata`
UNION ALL
SELECT *
FROM `sideproject-431707.divvy_tripdata.202402-divvy-tripdata`
UNION ALL
SELECT *
FROM `sideproject-431707.divvy_tripdata.202403-divvy-tripdata`
UNION ALL
SELECT *
FROM `sideproject-431707.divvy_tripdata.202404-divvy-tripdata`
UNION ALL
SELECT *
FROM `sideproject-431707.divvy_tripdata.202405-divvy-tripdata`
UNION ALL
SELECT *
FROM `sideproject-431707.divvy_tripdata.202406-divvy-tripdata`
UNION ALL
SELECT *
FROM `sideproject-431707.divvy_tripdata.202407-divvy-tripdata`
UNION ALL
SELECT *
FROM `sideproject-431707.divvy_tripdata.202307-divvy-tripdata`
UNION ALL
SELECT *
FROM `sideproject-431707.divvy_tripdata.202308-divvy-tripdata`
UNION ALL
SELECT *
FROM `sideproject-431707.divvy_tripdata.202309-divvy-tripdata`
UNION ALL
SELECT *
FROM `sideproject-431707.divvy_tripdata.202310-divvy-tripdata`
UNION ALL
SELECT *
FROM `sideproject-431707.divvy_tripdata.202311-divvy-tripdata`
UNION ALL
SELECT *
FROM `sideproject-431707.divvy_tripdata.202312-divvy-tripdata`

);

Data cleaning & reorganizing

--count all rows--
SELECT COUNT(*)
FROM
`sideproject-431707.divvy_tripdata.202307-202407`;

--count distinct ride id--
SELECT COUNT(DISTINCT ride_id)
FROM
`sideproject-431707.divvy_tripdata.202307-202407`;

--count duplicate ride_id--
SELECT
ride_id,
COUNT(ride_id) AS count_ride
FROM
`sideproject-431707.divvy_tripdata.202307-202407`
GROUP BY ride_id
HAVING count_ride >1
ORDER BY count_ride DESC;

--dupliacte ride_id detail--
SELECT
*
FROM
`sideproject-431707.divvy_tripdata.202307-202407`
WHERE ride_id = "5DCA3BA77BCB1B6D";
--- delete duplicate rider id--
MERGE INTO `sideproject-431707.divvy_tripdata.202307-202407` AS t
USING (
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY ride_id ORDER BY started_at) AS rn
FROM
`sideproject-431707.divvy_tripdata.202307-202407`
) AS r
ON t.ride_id = r.ride_id AND t.started_at = r.started_at AND t.ended_at = r.ended_at
WHEN MATCHED AND r.rn > 1 THEN
DELETE;

I should delete the duplicate and null data, but since BigQuery’s DML requires payment to use, I have to skipped this cleaning step and proceeded to the next stage.

Analyze

Member type with bike type

--categorize the member type and ridealbe type-
SELECT
member_casual,
COUNT(member_casual) AS member_count,
rideable_type,
COUNT(rideable_type) as rideable_type_count
FROM `sideproject-431707.divvy_tripdata.202307-202407`
GROUP BY member_casual,rideable_type
ORDER BY member_count DESC;

There are two member types: member and casual.

  • The number of classic bikes is higher than that of electric bikes within both member types.

Total usage by year and month

SELECT
EXTRACT(YEAR FROM started_at) AS YEAR,
EXTRACT(MONTH FROM started_at) AS MONTH,
COUNT(DISTINCT ride_id) AS ride_count
FROM `sideproject-431707.divvy_tripdata.202307-202407`
GROUP BY YEAR, MONTH
ORDER BY YEAR, MONTH;
usage by year
SELECT
EXTRACT(YEAR FROM started_at) AS YEAR,
EXTRACT(MONTH FROM started_at) AS MONTH,
rideable_type,
COUNT(DISTINCT ride_id) AS ride_count
FROM `sideproject-431707.divvy_tripdata.202307-202407`
WHERE rideable_type = "classic_bike"
GROUP BY YEAR, MONTH, rideable_type
ORDER BY ride_count DESC;
eletric_bike and classic_bike usage by month

The usage ratio of electric bikes and classic bikes is similar across the months.

  • Usage is concentrated in July, August, and September.

Usage distribution by day of the week for different membership types.

SELECT
member_casual,
rideable_type,
COUNT(rideable_type) AS rideable_type_count,
CASE
WHEN EXTRACT(DAYOFWEEK FROM started_at) = 1 THEN 'SUN'
WHEN EXTRACT(DAYOFWEEK FROM started_at) = 2 THEN 'MON'
WHEN EXTRACT(DAYOFWEEK FROM started_at) = 3 THEN 'TUE'
WHEN EXTRACT(DAYOFWEEK FROM started_at) = 4 THEN 'WED'
WHEN EXTRACT(DAYOFWEEK FROM started_at) = 5 THEN 'THU'
WHEN EXTRACT(DAYOFWEEK FROM started_at) = 6 THEN 'FRI'
ELSE 'SAT'
END AS day_of_week
FROM `sideproject-431707.divvy_tripdata.202307-202407`
GROUP BY member_casual, rideable_type,
CASE
WHEN EXTRACT(DAYOFWEEK FROM started_at) = 1 THEN 'SUN'
WHEN EXTRACT(DAYOFWEEK FROM started_at) = 2 THEN 'MON'
WHEN EXTRACT(DAYOFWEEK FROM started_at) = 3 THEN 'TUE'
WHEN EXTRACT(DAYOFWEEK FROM started_at) = 4 THEN 'WED'
WHEN EXTRACT(DAYOFWEEK FROM started_at) = 5 THEN 'THU'
WHEN EXTRACT(DAYOFWEEK FROM started_at) = 6 THEN 'FRI'
ELSE 'SAT'
END
HAVING member_casual = "member"
ORDER BY rideable_type_count DESC;
  • Member usage is consistent throughout the week, with the highest usage on Wednesdays, Tuesdays, and Thursdays.
  • Casual usage, on the other hand, shows a clear difference, with the highest usage on Saturdays, Sundays, and Fridays.

Usage patterns by hour for the two membership types.

SELECT
member_casual,
-- rideable_type,--
COUNT(rideable_type) AS rideable_type_count,
EXTRACT(HOUR FROM started_at) AS time
FROM `sideproject-431707.divvy_tripdata.202307-202407`
GROUP BY member_casual, time
ORDER BY rideable_type_count DESC;
  • The peak usage times are at 8:00 AM and 5:00 PM, indicating that most people are commuting to and from work.

The average usage time for both members and casual members.

SELECT
member_casual,
AVG(TIMESTAMP_DIFF(ended_at, started_at, MINUTE)) AS avg_usage_time_minutes
FROM `sideproject-431707.divvy_tripdata.202307-202407`
GROUP BY member_casual;

From this, we can infer that member users tend to have shorter usage times, suggesting that they ride with a specific purpose during certain times (e.g., primarily for commuting). On the other hand, casual users have longer usage times, with a more gradual usage pattern, indicating that their rides are likely for leisure purposes.

member ‘s popular start station

SELECT
rideable_type,
start_station_name,
COUNT(start_station_name) AS popular_start_station,
start_lat,
start_lng
FROM `sideproject-431707.divvy_tripdata.202307-202407`
WHERE member_casual = "member"
GROUP BY
rideable_type,start_station_name,start_lat,start_lng
ORDER BY popular_start_station DESC
LIMIT 10;

The most popular starting stations for member users are mostly located in inland areas, and the usage frequency across different regions is relatively even.

casual ‘s popular start station

SELECT
rideable_type,
start_station_name,
COUNT(start_station_name) AS popular_start_station,
start_lat,
start_lng
FROM `sideproject-431707.divvy_tripdata.202307-202407`
WHERE member_casual = "casual"
GROUP BY
rideable_type,start_station_name,start_lat,start_lng
ORDER BY popular_start_station DESC
LIMIT 10;

On the other hand, the most popular starting points for casual users are mostly in scenic locations. Among them, the station at Streater Dr & Grand Ave, which is near Navy Pier, is the area with the highest usage frequency.

Share

  • Classic bikes are more popular than electric bikes for both types
  • Usage peaks in summer months (July-September)
  • Peak usage times are 8 AM and 5 PM, suggesting commute patterns.

Member Users:

  • Primarily use classic bikes.
  • Exhibit consistent riding habits, especially during commuting hours on weekdays.
  • Members take shorter, purposeful rides (e.g., commuting).
  • Most popular starting stations are located in inland areas.

Casual Users:

  • Casuals take longer leisure rides
  • Ride longer with peak usage on weekends.
  • Popular casual stations are in scenic areas, especially near Navy Pier

Act

Better Infrastructure

  • Make sure there are enough bikes at busy inland stations, especially during morning and evening rush hours.
  • Add more bikes and docking stations at scenic locations for weekend users. Consider providing more electric bikes for tourists.
  • Put more bikes at popular spots like Navy Pier
  • Make longer ride packages for users who like longer trips

Advertising

  • Run special deals in summer to get more customers
  • Spread out bike stations evenly in residential areas for members
  • Create commuter-friendly plans for regular weekday riders
  • Team up with local businesses to offer deals to riders
  • Start a points system to reward frequent riders and encourage more use

Reference:

https://medium.com/@iainselliott/google-data-analytics-capstone-project-cyclistic-case-study-8baed2f5a286

https://medium.com/@josephpatrick.garcia.98/google-data-analytics-capstone-project-cyclistic-case-study-sql-tableau-7381c4d69f8a

--

--