Google Data Analytics Capstone Project: Cyclistic Case Study (SQL & Tableau)
BACKGROUND: The analysis is based on the first case study of the Google Data Analytics Certificate. I am assuming the position of a ‘Jr. Data Analyst’ role at Cyclitic, a bike-share company based in Chicago. The aim of the analysis is to understand the behaviors between Cyclistic’s two main users: members and non-members. The reason why the company would like to understand the behaviors between the two is that the company’s membership numbers have been stagnant and the management team does not have strategies to help increase it. Analyzing the behaviors between the two would create insights that would help create strategies to promote their membership plan.
The data will be cleaned and analyzed using SQL and Tableau and will follow Google’s six-step process of processing data: ask, prepare, clean and process, analyze, share, and act.
Ask
In this phase, it is important to understand what are the major questions that drive the analysis. Given that the company has a problem with increasing its membership numbers and it has data about its users, it begs the question: “How do annual members and casual riders use Cyclistic bikes differently?”
Prepare
The data for this analysis was collected from here. It should be noted that only Quarter I data from 2023 were used due to the fact that not all data were available yet for 2023. Each table were in a .csv format and I was able to combine them all using the following function in SQL:
--combines the data
CREATE TABLE `keen-airlock-388921.Cylistics.Q1_trips` AS
SELECT *
FROM (
SELECT * FROM `keen-airlock-388921.Cylistics.apr_trip`
UNION ALL
SELECT * FROM `keen-airlock-388921.Cylistics.feb_trips`
UNION ALL
SELECT * FROM `keen-airlock-388921.Cylistics.march_trip`
UNION ALL
SELECT * FROM `keen-airlock-388921.Cylistics.jan_trips`
);
I also checked the length of ride_id’s (primary key) to make sure that I know what the primary key looks like. I also checked if there are any null values per row and if there are any duplicates.
--checks length of ride_ids
SELECT length(ride_id)
FROM `keen-airlock-388921.Cylistics.Q1_trips`
--checks total rows (1066014)
SELECT count(*)
FROM `keen-airlock-388921.Cylistics.Q1_trips`
-- checks if ride_ids have any duplicates (returns 1066014 values which means no duplicate ride_ids)
SELECT count(distinct (ride_id))
FROM`keen-airlock-388921.Cylistics.Q1_trips`
--check the number of nulls per row
SELECT
COUNT(*) - COUNT(ride_id) AS ride_id_count,
COUNT(*) - COUNT(rideable_type) AS rideable_type_count,
COUNT(*) - COUNT(started_at) AS started_at_count,
COUNT(*) - COUNT(ended_at) AS ended_at_count,
COUNT(*) - COUNT(start_lat) AS start_lat_count,
COUNT(*) - COUNT(start_lng) AS start_lng_count,
COUNT(*) - COUNT(end_lat) AS end_lat_count, --861 null values
COUNT(*) - COUNT(start_station_name) AS start_station_name_count, -- 151918 null values
COUNT(*) - COUNT(start_station_id) AS start_station_id_count, -- 152050 null values
COUNT(*) - COUNT(end_station_name) AS end_station_name_count, -- 161646 null values
COUNT(*) - COUNT(end_station_id) AS end_station_id_count, -- 161787 null values
COUNT(*) - COUNT(member_casual) AS member_casual_count,
COUNT(*) - COUNT(end_lng) AS end_lng_count -- 861 null values
FROM
`keen-airlock-388921.Cylistics.Q1_trips`
I also did the following queries to figure out how many stations there are, and how many member types there are
-- checks how many stations there are (50 stations)
SELECT DISTINCT end_station_name AS end_station
FROM `keen-airlock-388921.Cylistics.Q1_trips_valid`
-- confirms the station count (50 stations still)
SELECT DISTINCT start_station_name AS start_station
FROM `keen-airlock-388921.Cylistics.Q1_trips_valid`
-- confirms that member type is only 2 values which are member and casual riders
SELECT DISTINCT member_casual AS member_Type
FROM `keen-airlock-388921.Cylistics.Q1_trips_valid`
Clean and Process
Once I understood what values have NULL values, I created a new table where there would be no NULL values to continue cleaning
-- removes nulls
-- due to limitations in BigQuery, I cannot use the DELETE FROM function. Instead, I will be selecting the tables and creating a new table called Q1_trips_valid to bypass the paywall
SELECT *
FROM `keen-airlock-388921.Cylistics.Q1_trips`
WHERE
start_station_name IS NOT NULL AND
end_station_name IS NOT NULL AND
end_station_id IS NOT NULL AND
end_lng IS NOT NULL AND
start_station_id IS NOT NULL;
After there are no NULL values and duplicate values, it’s time to add new rows for analysis purposes. I appended the table to include the following:
- Day of the week AS day_of_week
- Month AS month
- Day AS day
- Year AS year
- ride length as ride_length
- formatted timestamp AS time
I also included data that have a valid time length (more than 1 minute and less than 24 hours)
SELECT
ride_id,rideable_type,start_station_name, end_station_name, start_lat, start_lng,end_lat, end_lng, member_casual AS member_type,started_at,
CASE
WHEN EXTRACT (DAYOFWEEK FROM started_at) = 1 THEN 'SUN'
WHEN EXTRACT (DAYOFWEEK FROM started_at) = 2 THEN 'MON'
WHEN EXTRACT (DAYOFWEEK FROM started_at) = 3 THEN 'TUE'
WHEN EXTRACT (DAYOFWEEK FROM started_at) = 4 THEN 'WED'
WHEN EXTRACT (DAYOFWEEK FROM started_at) = 5 THEN 'THU'
WHEN EXTRACT (DAYOFWEEK FROM started_at) = 6 THEN 'FRI'
ELSE 'SAT'
END AS day_of_week,
CASE
WHEN EXTRACT (MONTH FROM started_at) = 1 THEN 'JAN'
WHEN EXTRACT (MONTH FROM started_at) = 2 THEN 'FEB'
WHEN EXTRACT (MONTH FROM started_at) = 3 THEN 'MAR'
WHEN EXTRACT (MONTH FROM started_at) = 4 THEN 'APR'
ELSE 'UNKOWN'
END AS month,
EXTRACT (DAY FROM started_at) AS day,
EXTRACT (YEAR FROM started_at) AS year,
TIMESTAMP_DIFF (ended_at, started_at, minute) AS ride_length_m,
FORMAT_TIMESTAMP("%I:%M %p", started_at) AS time
FROM `keen-airlock-388921.Cylistics.Q1_trips_valid`
WHERE TIMESTAMP_DIFF (ended_at, started_at, minute) > 1 AND TIMESTAMP_DIFF (ended_at, started_at, hour) < 24
Analyze
Once the data were cleaned, I can now analyze the data answering the question “How do annual members and casual riders use Cyclistic bikes differently?”, which can be answered in two main categories: behaviors in the difference between time spent on the bicycle and the routes each users take. The analysis on the time spent on the bicycle were made through many queries because I would like to know the length of the ride in terms of minutes and hours while also knowing the trends when people ride the bikes in days and months. The following queries were used:
Trip time to calculate analysis on bike ride minutes (average ride length, minimum ride length, and maximum ride length)
-- trip time (minutes)
-- average trip time this is for total users
SELECT ROUND(AVG(ride_length_m), 2)
FROM `keen-airlock-388921.Cylistics.Q1_trips_valid_clean`
-- average trip time for each group. this shows 10.74 minutes for members and 19.64 for casual riders
SELECT ROUND(AVG(ride_length_m), 2) AS average, member_type
FROM `keen-airlock-388921.Cylistics.Q1_trips_valid_clean`
group by member_type
--max trip time for both members
SELECT ROUND(max(ride_length_m), 2) AS max_trip_time, member_type
FROM `keen-airlock-388921.Cylistics.Q1_trips_valid_clean`
-- minimum amount of time spent on a bike divided per group
group by member_type
SELECT ROUND(min(ride_length_m), 2) AS max_trip_time, member_type
FROM `keen-airlock-388921.Cylistics.Q1_trips_valid_clean`
group by member_type
Trip time to calculate when the bike ride is mostly being used throughout the week (Monday — Sunday)
-- trip time (day)
-- this query checks what days are the most popular ones within the week
SELECT count(day_of_week) AS count, day_of_week
FROM `keen-airlock-388921.Cylistics.Q1_trips_valid_clean`
GROUP BY day_of_week
-- this query checks for the most popular and least popular day for casual riders
SELECT count(day_of_week) AS count, day_of_week
FROM `keen-airlock-388921.Cylistics.Q1_trips_valid_clean`
WHERE member_type = 'casual'
GROUP BY day_of_week
-- this query checks for the most popular and least popular day for members riders
SELECT count(day_of_week) AS count, day_of_week
FROM `keen-airlock-388921.Cylistics.Q1_trips_valid_clean`
WHERE member_type = 'member'
GROUP BY day_of_week
Analysis of when bike rides are popular throughout the months of the first quarter (January — April)
--trip time (month)
-- this query selects the number of frequency of rides per month for members
SELECT count(month) AS count, month
FROM `keen-airlock-388921.Cylistics.Q1_trips_valid_clean`
WHERE member_type = 'member'
group by month
-- this query selects the number of frequency of rides per month for casual riders
SELECT count(month) AS count, month
FROM `keen-airlock-388921.Cylistics.Q1_trips_valid_clean`
WHERE member_type = 'casual'
group by month
Analysis of the popular hours each user rides the bikes
--trip time (hour)
SELECT
EXTRACT (HOUR from started_at) AS time_of_day, count (*) AS occurances, member_type
FROM `keen-airlock-388921.Cylistics.Q1_trips_valid_clean`
GROUP BY
time_of_day , member_type
ORDER BY
time_of_day DESC
Analysis of the stations of each bike ride (most and least popular start and end stations for members and non-members)
-- shows the least and most frequented start station name for casual riders
SELECT count(*) AS rank, start_station_name
FROM `keen-airlock-388921.Cylistics.Q1_trips_valid_clean`
WHERE member_type = 'casual'
GROUP BY
start_station_name
ORDER BY
rank DESC
-- shows the least and most frequented start station name for members riders
SELECT count(*) AS rank, start_station_name
FROM `keen-airlock-388921.Cylistics.Q1_trips_valid_clean`
WHERE member_type = 'member'
GROUP BY
start_station_name
ORDER BY
rank DESC
-- shows least and most frequented end station name for members
SELECT count(*) AS rank, end_station_name
FROM `keen-airlock-388921.Cylistics.Q1_trips_valid_clean`
WHERE member_type = 'member'
GROUP BY
end_station_name
ORDER BY
rank DESC
-- shows least and most frequented end station name for casual riders
SELECT count(*) AS rank, end_station_name
FROM `keen-airlock-388921.Cylistics.Q1_trips_valid_clean`
WHERE member_type = 'casual'
GROUP BY
end_station_name
ORDER BY
rank DESC
-- Checks the most popular routes
SELECT
COUNT(*) AS frequency,
end_station_name,
start_station_name
FROM
`keen-airlock-388921.Cylistics.Q1_trips_valid_clean`
GROUP BY
start_station_name,
end_station_name
ORDER BY
frequency DESC LIMIT 2;
-- checks the most popular routes for casual riders
SELECT
COUNT(*) AS frequency,
end_station_name,
start_station_name
FROM
`keen-airlock-388921.Cylistics.Q1_trips_valid_clean`
WHERE
member_type = 'casual'
GROUP BY
start_station_name,
end_station_name
ORDER BY
frequency DESC LIMIT 2;
-- checks the most popular routes for members
SELECT
COUNT(*) AS frequency,
end_station_name,
start_station_name
FROM
`keen-airlock-388921.Cylistics.Q1_trips_valid_clean`
WHERE
member_type = 'member'
GROUP BY
start_station_name,
end_station_name
ORDER BY
frequency DESC LIMIT 2;
-- checks the least popular route for member
SELECT
COUNT(*) AS frequency,
end_station_name,
start_station_name
FROM
`keen-airlock-388921.Cylistics.Q1_trips_valid_clean`
WHERE
member_type = 'member'
GROUP BY
start_station_name,
end_station_name
ORDER BY
frequency ASC LIMIT 2;
-- checks the least popular route for casual riders
SELECT
COUNT(*) AS frequency,
end_station_name,
start_station_name
FROM
`keen-airlock-388921.Cylistics.Q1_trips_valid_clean`
WHERE
member_type = 'casual'
GROUP BY
start_station_name,
end_station_name
ORDER BY
frequency ASC LIMIT 2;
Share
In this phase of the project, I am sharing the insights I received from my analysis through visualizations using Tableau. For a better experience, I highly recommend checking out my dashboard here.
Act
Key Takeaways:
Q1 members often use the bikes for work while non-members use the bike for leisure
Q1 members often use the bike during the weekdays while non-members often use the bikes during the weekends
Both users prefer classic bikes to electric
The Average ride time for members is less than casual riders (10.74 minutes for member riders and 19.64 minutes for casual riders)
Recommendations:
- ) Provide a membership package to categorize riders into two groups: leisure riders (people who use the bikes for leisure activities) and business riders (people who use the bikes for school or work).
- ) When marketing Cyclitic bikes, use classic bikes as the main model of the bikes because most users like that more
- ) Design classic bikes are unique for leisure riders and business riders. This enforces the idea of users being either a “leisure rider” or a “business riders”. This allows the user to be a part of a group identity and become a part of something bigger than them which could be attractive to some.
- ) Make sure that there are enough bikes to handle the volume of riders during the weekends for leisure riders and weekdays for business riders.
- ) Since the ride time for leisure riders are often longer, create a marketing effort to show them that users whose ride time is longer than 10 minutes can become a member
Next steps:
Some recommendations for the next steps are to create a breakdown on the analysis of the drop-off and pick-up stations. This will allow the company to understand what locations are most popular for members and non-members.
Thank you for reading my article! Hope you learned a lot here and understood how data analysis can create impactful insights on organizations. To learn more about my methodology on this case study, you can view the related case studies I am attaching below. Also, if you have any feedback on how I can improve, I am always open to feedback. Leave a comment and don’t forget to clap! Thank you!
Inspiration: