Google Data Analytics Capstone Case Study: Cyclistic

Darryl Brown
6 min readFeb 26, 2023

--

Scenario

Stylistic is a bike share company based in Chicago. The director of marketing, Lily Moreno, believes the company’s future is based on maximizing annual memberships over casual hires and I have been tasked with analysing the data and providing insight to the current stakeholders in order to design a new marketing strategy.

Ask

My deliverable here is to produce a clear statement of the business task. The problem is understanding how annual members and casual riders use Cyclistic services differently. I must also consider my stakeholders who are Lily Moreno, director of marketing and the marketing analytics team as well as the Cyclistic executive team.

I will be required to provide insight into why casual riders would purchase annual memberships and how digital media could be used to influence this.

Therefore, the business task is.

To analyze how usage of Cyclistic bikes differ between annual and casual riders and provide insight into how and why we can convert casual riders into annual memberships.

Prepare

The dataset used is available here

After downloading the data, it is stored appropriately in a working folder on my computer. I reviewed one of the months of data in Microsoft Excel as they are in .csv format. Initially, I noticed approximately 180,000 rows so I made the decision to complete some processing in Excel as I inspected each of the 12 .csv files to ensure the layout and data was consistent.

My initial observations highlighted the following.

  1. There are 19 columns of data
  2. Columns 9–12 contained latitude/longtitude data and would not be required and could be removed
  3. The date and time column was combined and would need to be split
  4. Ride/hire duration would need to be calculated
  5. There was a substantial amount of missing data for the start_station_name, start_station_id, end_station_name and end_station_id. I would need to decide on if this data is necessary to provide analysis or if safe to remove.
  6. I would need to review the balance of annual/casual members to consider bias
  7. The ride_id is a unique reference and not required.

Process

In Excel I calculated a few more columns. These columns could have been created using a variety of methods and also pre-post merge. I decided that as I would verify each month of data out of the 12 it was sensible to do this in Excel pre-merge whilst inspecting.

snapshot of .csv data

Ride duration =D2-C2
Weekday =WEEKDAY(C2)
Start time =TIME(HOUR(C2),MINUTE(C2), SECOND(C2))
End time =TIME(HOUR(D2),MINUTE(D2), SECOND(D2))
Total days =DAYS(D2,C2)
Duration Minutes =N2*1440 (view as number)

These formulas were copied in and filled. Further processing of the data in R was the next step I knew I was working with tabular data that had a consistent structure.

R packages used

I have imported the relevant libraries needed to complete a few functions in R to work with merging and writing an output file. The following code (credit to Joachim Schork) allows the 12 files to be merged to create a combined dataset for a 12 month period roughly covering Jan-Dec 2022.

R code to merge .xlsx

I was able to view the data in R to determine how it would need to be cleaned and transformed and I identified the following.

My merged file was now complete and ready

Finally the following code writes an output file to a .csv file.

R code to write to .csv

I decided to continue processing the data in Power BI. I chose Power BI instead of Tableau and R at this stage based on the advantages that Power Query would have with my analysis as I was not sure at this stage how I would use the station data if at all and Power Query makes it easy to go back, change or add steps in the transformation stage.

In Power Query I completed the following steps

  1. Checked column names were headers for my columns
  2. Checked the datatypes for each column and made sure they were correct e.g. date/time/whole number
  3. Deleted my start & end time columns I created in Excel as they had created an error and were now displaying as whole numbers instead of time. I duplicated the started_at and ended_at columns and changed the datatypes so I now had separate start and end date and atime columns.
  4. Reordered the columns for structure
  5. Renamed some columns to more appropriate headers e.g started_at became start_date
  6. Deleted the latitude and longitude columns
  • At this stage I have opted to keep the null data for the stations in the dataset so I can perform some analysis however I will replace blank values with ‘unknown’
Power Query also documents data processing steps

The final processing stage is to convert the weekday in my data from a number to a day. Although this could also have been completed in Power Query I decided to create a weekday table in excel and add to my data

Weekday table

Once the weekday table was added it was connected to my data model in Power Pivot and the day(number) was connected to the weekday which will help with my visualizations.

Data model

Some quick Exploratory Data Analysis (EDA) revealed a small error where it appeared the start date and end date had probably been keyed backwards resulting in a hire period of negative 7 days. This row was deleted.

Analyze & Share

The analysis was completed in Power BI

--

--