Cyclistic’s Bike Sharing

Benazir Kemunto
7 min readFeb 2, 2023

--

A walkthrough of the Google Data Analytics Capstone Project

This is my first case study as a Data Analyst, a first among many firsts I’m experiencing at the moment.

In this scenario, I’m a junior data analyst in Cyclistic, bike_sharing company and the Director of Marketing, Moreno, assigned me this project to answer some questions. She believes the company’s future success depends on maximizing the number of annual memberships. Therefore, the team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, the team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve my recommendations, so they must be backed up with compelling data insights and professional data visualizations. Data was provided in this link.

About the Program

It was launched in 2016 with 5,800 bicycles and 600 docking stations. It has grown to 5,824 bicycles that are geotracked and locked into a network of 692 stations which can be unlocked from one station and returned to any other station in the system anytime. Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments but they recently decided to be inclined more on data and analytics. The client categories are casual riders — who purchase single-ride passes & full-day passes and annual members — who purchase annual memberships.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders to members since they’re already aware of the Cyclistic program and have chosen us for their mobility needs

ASK PHASE

Business Objective

Given the background above, the objective is to convert casual riders to annual members through a targeted marketing strategy.

Business Task

Moreno gave me a simple assignment i.e to find out how differently annual members and casual riders use the bikes.

Other stakeholders in this project are Marketing & Executive teams.

PREPARE PHASE

Data Description

Data was downloaded from the link provided and saved in a folder locally. The most recent 12 csv files were used for this project i.e from January 2022 to December 2022. They contained quantitative and qualitative data. The data was compatible since all 13 columns were the same across all files. The data is real and credible and has been made available by Motivate International Inc. under this license Data License Agreement | Divvy Bikes. Additionally, there’s no personally identifiable information in the dataset thus safety isn’t compromised by making it publicly accessible. The name has been changed to Cyclistic, a fictitious company, for the purposes of the case study.

Issues with the Data

Exploration of the data revealed the following:

· The started_at and ended_at columns have been reversed for some 77 observations.

· Some rides had impractical long hours but no break-times were provided.

· Out of the total number of observations i.e 5,667,717, there were 833,064 & 892,742 nulls in start_station_name & end_station_name columns. It was assumed there had to be a reasonable explanation for this e.g a merger. For purposes of this case study, it was assumed that Cyclistic made an acquisition from Planet9 Bikes, another bike_share company.

· 5,858 nulls in end_lng & end_lat columns.

· 95,957 observations had a ride_length (i.e difference between end and start of a ride) of 0 minutes.

· Some station id’s had numbers only while others had numbers and alphabets combined.

PROCESS & ANALYSIS PHASES

Tools Used

i) Microsoft Excel — Power Query is an advanced excel option which I used to extract, transform and load data. To get an idea of the visualizations I created a connection to the data model and used pivot tables and charts. Access the step-by-step process using this link.

One of the advantages of Excel is you can teach it what to do (using formulas). For instance, after extracting dates and times from started_at and ended_at columns, I used corresponding numbers of the week i.e 1 = Sunday, 2= Monday, etc and corresponding numbers for the months i.e 1 = January, 2 = February, etc.

A column named ride-length was also created using end_time — start-time in minutes which aided in further analysis. Some data transformations included changing data type to the most appropriate, removing duplicates, removing observations that were unaccounted for. View the excel file here.

ii) BigQuery — used for further cleaning and to gain more insight.

The files were uploaded into the Google Cloud Plaftorm (GCP) for further analysis using SQL analysis. To merge the files, I used the query below which copied data from one table to another and repeated for all 11 files.

INSERT INTO `vaulted-quarter-375910.Cyclistic.202201-divvy-tripdata`

SELECT *

FROM `vaulted-quarter-375910.Cyclistic.202202-divvy-tripdata`

I repeated all steps done using Excel to confirm results. Using SQL, I was able to deeper into the data review and gained a better understanding of issues picked with Excel. Data transformation included changing data types, removing duplicates, removing observations that were unaccounted for. Click here to view the SQL script and step-by-step process.

Advantages of BigQuery include: it was fast once all files were uploaded into the GCP and one can choose the option to auto-detect schema for the tables in the dataset. Which is a good thing especially if you’re not sure of the data types when creating a database. However the downside was, exporting my SQL script and merged data in order to use outside BigQuery was a challenge so I had to copy all code to an external document then save it.

Data was analyzed to determine the following for casual riders and annual members:

· Number of rides according to month

· Number of rides according to weekday

· Number of rides per start_hour

· Number of rides per end_hour

· Popularity of bike_type

· Total ride length per weekday

· Average ride length per weekday

· Descriptive statistics i.e max, min average

iii) Tableau — used for visualization

The 12 csv files were imported to Tableau Desktop and merged using union. Some cleaning was done since I already had insight from my SQL analysis. I used functions to extract the date and time from started_at and ended_at column, fix the null issue in start_station and end_station columns. Afterwards, I developed an interactive dashboard with findings from my analysis.

iv) R -used for processing, analysis & visualization

After importing the 12 files, I did some data wrangling and made some amazing discoveries which I hadn’t with the other 3 tools. I was able to comfortably account for all 5,667,717 observations. Having worked with the same dataset in Excel, SQL & Tableau made it even more fun and enabled me to get more insight. For the R script and change logs, interact with my Cyclistic repo in GitHub.

SHARE PHASE

Key findings

59% of the total riders i.e 5.6 million riders for Jan to Dec 2022 were annual members.

Most casual riders prefer electric bikes maybe because of their smooth transition, speed among other factors. Preference is almost the same for annual members.

Average length per weekday for casual riders in longer than annual riders. The members’ ride_length is almost consistent throughout the week, implying that they mostly they choose Cyclistic for their commuter needs to work.

There are more casual riders during weekends compared to other days, therefore implying that they use the bikes for leisure and to run errands which are most suitable for weekends.

Peak hours for annual members are 7–9 am & 4–6 pm. This could mean they mostly use the rides for commuting to work. As for casual members, the number steadily increases throughout the day and peaks at 3–6 pm which could mean they use the bikes to for short runs e.g to run errands. Same applies to end_hours.

Weather is a factor in bike usage which explains why it’s a bell curve. The normal distribution means that an increase in bike usage is experienced during summer months. A good number of the casual riders could be tourists.

More insights available in my interactive dashboard.

ACT PHASE

Recommendations

For the targeted marketing strategy, Moreno, the Executive team, Marketing team could consider the following:

i. Marketing the electric bikes more among the casual riders especially towards/during the weekends and towards/during summer months starting from April.

ii. Introduce a weekly subscription and make it easy to upgrade from being daily casual riders. After some time e.g quarterly performance can be reviewed before introducing monthly subscription in the 3rd quarter of the year.

iii. Make the prices more flexible and introduce discounts depending on ride_length especially during weekends.

iv. Partner with tourist attraction sites and leisure centres to get more casual riders to subscribe for annual membership at a discounted price.

— — — — —

Feel free to check out my portfolio for more details and follow the steps to reproduce your own analysis. If you’d like some collaboration feel free to engage me via LinkedIn. I’ll be glad to work with/for you.

— — — — —

--

--