Google Coursera Data Analysis Capstone (Cyclistic) — With R

Jonathan Shmulovich
9 min readMar 9, 2023

--

Today I revisit an old project I once did with SQL and Tableau: Analyzing the data for the fictitious Chicago bike rental service: Cyclical. However, this time I am doing the entire project in R, from cleaning, to visualizing, to analyzing. The goal of this project is to take the data about all the ~5.5 million rides in 2022 and discover insights about members and casual riders that can help us convert more casual riders into members (the company sees this as the best path to maximizing profits). You can find the project instructions in this link, and the data for each month in this link:

DATA IMPORTING & CLEANING

First I must install the necessary packages and load the relevant libraries

install.packages(c("tidyverse","lubridate","hms","data.table","skimr","hydroTSM","gridExtra","R.utils","scales"))
install.packages("geosphere", repo="http://cran.r-project.org", dep=T)
packages <- c("tidyverse","lubridate","hms","data.table","skimr","hydroTSM","gridExtra","geosphere","R.utils","scales")
lapply(packages, require, character.only = TRUE)

Next, I set my working directory as the folder I made to hold each month’s data: 2022_Bike_Data.

Then I make a list with all the csv names, and use lapply to read each csv in the list I created, and then bind all these read csvs(now dataframes) to become one dataframe: df.

setwd("/Users/jonathanshmulovich/Downloads/2022_Bike_Data")
files <- list.files( pattern = '.csv')
df <- rbindlist(lapply(files, fread))
df <- rename(df, bike_type = rideable_type, member_type = member_casual)
str(df)

We now have a dataframe to do the rest of the project with, but we still need to add some things and fix some things to it:

Now It’s Time To Manipulate Our Data

First add five columns, which are the simpler ones, and delete duplicates as well as rows with blank values:

# ride_length is just the difference between started_at and ended_at, and as_hms makes it in hh:mm:ss format
df <- df %>% mutate(ride_length = as_hms(ended_at - started_at))

#!duplicated() gets rid of exactly identical rows, like same values across all columns
df <- df[!duplicated(df$ride_id), ]

# here we add a column that returns the string value for day of the week the ride started
df <- df %>% mutate(day_of_week = weekdays(df$started_at))

#here we extract the month from the started_at datetime of the ride in a new column
df <- df %>% mutate(ride_month = format(started_at, "%m"))

#here we get the hour in a new column based on the started_at column
df <- df %>% mutate(ride_hour = format(started_at, "%H"))

#we get the season of the ride in a new column
df <- df %>% mutate(ride_season = time2season(started_at,out.fmt = "seasons"))

# remove rows with blanks
df <- df[complete.cases(df),]
df <- na.omit(df)

str(df)

Now it’s time to add one more which column which is not as simple as the others: distance calculation. This involves more complicated functions like matrix(). After we find the distances in meters, we convert from m to km’s

df$ride_distance <- distGeo(matrix(c(df$start_lng, df$start_lat), ncol = 2), matrix(c(df$end_lng, df$end_lat), ncol = 2))
df$ride_distance <- df$ride_distance / 1000
str(df)

Though distance did not turn out to offer any insights or actionable differences between members and casual riders (casual riders ride on average for a little longer distances when looked at in any different timeframe like day of the week, or month, hour,etc). However, the existence of this column will enable someone else to see things for themselves in the future perhaps (if Cyclical were a real company).

Then below, we proceed to remove all rows that were unrealistically short in duration, and therefore probably not a ride (greater than 60 seconds).

And after, we proceed to remove the outliers, which we do by finding the 1st and 3rd Quartiles of ride_length

df <- df[df$ride_length >=60, ]
min(df$ride_length)

Q1 <- quantile(df$ride_length, .25)
Q3 <- quantile(df$ride_length, .75)
IQR <- IQR(df$ride_length)
df <- subset(df, (df$ride_length > (Q1 - (1.5 * IQR))) & (df$ride_length < (Q3 + (1.5 * IQR))))
str(df)

Next let’s remove outliers for distance. Now, a distance of 0 is not automatically a grounds for rejection because someone might have rented the bike for exercise, as I sometimes did with the Citibikes in NYC. However, if they traveled a distance of 0 and traveled for less than 15 minutes, than I have reason to believe the rental was a ride the rider changed his/her mind about or was an error.

Q1 <- quantile(df$ride_distance, .25)
Q3 <- quantile(df$ride_distance, .75)
IQR <- IQR(df$ride_distance)
df <- subset(df, (df$ride_distance > (Q1 - (1.5 * IQR))) & (df$ride_distance < (Q3 + (1.5 * IQR))))
df<- df[!((df$ride_distance == 0) & (df$ride_length < 600)), ]
str(df)

We see that the row count has dramatically shrunk to 4.8 million rows, meaning there were a lot of unrealistically outliers and unrealistically short rides (because of the nature of the outliers, 0 did not qualify as an outlier despite being the shortest distance/ duration possible).

After, we just make sure that there are no extra spaces in start or end station names. Also, as extra assurance, I just delete all station names that are blank:

library(stringr)
df$start_station_name <- str_squish(df$start_station_name)
df$end_station_name <- str_squish(df$end_station_name)
df <- df[(df$start_station_name != "") & (df$end_station_name != ""), ]
str(df)

Lastly, save this refined dataframe to a csv. And if it so happens that you forgot to maybe plug your computer battery in or something like that, here we write the into a csv so we don’t need to compile the original 12 files and clean/ manipulate them all over again:

fwrite(df, "bike_data_sample_final.csv")

ANALYSIS & VISUALIZATION

While I looked at this data from plenty of angles, I am showing you the analysis/ plots that give us the most actionable information.

Let’s first get some basics, like ride count by member type:

ggplot(data=calc_2,mapping=aes(x=member_type, y=ride_count, fill=member_type)) 
+ geom_bar(stat='identity')
+ scale_y_continuous(labels=comma)
+ labs(x="Member Type", y = "Ride Count", title = "Ride Count by Day of Week and Member Type")

As we see above, members make up about ~3 million of the 4.8 million rides, so about 60%. And the casual riders make up about 40% of the rides.

And to get some of the distribution for our numeric data, we can do skim_without_charts():

Remember, we removed the outliers, so that’s why our max values aren’t anything too crazy for the variables ride_distance or ride_length.

Also, let’s check the distribution of the bike types:

bike_counts <- df %>%
group_by(member_type, bike_type) %>%
summarise(count = n())

ggplot(data=bike_counts, mapping=aes(x=bike_type, y=count, fill = bike_type))
+ geom_bar(stat='identity',position='dodge')
+ facet_wrap(~member_type)
+ scale_y_continuous(labels=comma)
+ labs(x = "Member-Bike Type", y= "Count", title = "Bike Type Count by Member Type and Bike Type")
cas_count <- bike_counts %>%
filter(member_type == "casual")
print(cas_count)
ggplot(cas_count, aes(x="",y=count, fill = bike_type)) + 
geom_bar(width = 1, size = 1, color = "white", stat= "identity") +
coord_polar("y") +
geom_text(aes(label = paste0(round((count*100)/sum(count)),"%")), position = position_stack(vjust=.5)) +
labs(x = NULL, y = NULL, fill = NULL, title = "Distribution of Bike Type Amongst Casual Riders") +
guides(fill = guide_legend(reverse = TRUE)) +
theme_classic() +
theme(axis.line = element_blank(), axis.text = element_blank(), axis.ticks = element_blank(), plot.title = element_text(hjust = .5, color = "saddlebrown"))

Now we can go into more specific analysis now that we got the broad summary stuff covered. I made many visualizations but these are the ones I found had the most actionable insight, like ride counts by day of the week

To analyze this, the first step is to set the day_of_week variable to a factor, so that these days will not just be recognized as strings, but values that are to be viewed in the hierarchy that they are:

df$day_of_week <- factor(df$day_of_week, levels = c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"))

After, we can plot it out

ggplot(df,aes(x=day_of_week, fill=member_type)) +
geom_bar() +
scale_y_continuous(labels=comma) +
labs(x="Days of the week", y = "Ride Count", title = "Ride Count by Day of Week and Member Type")

Insight #1: It is clear that casual riders take much fewer rides than members in the week, especially Tuesday & Wednesday. Thus, it could give us the most bang for buck to send a membership invite to casual riders on the weekdays.

Let’s now look at these metrics by hour each day.

calc_3 <- df %>% 
group_by(ride_hour,member_type) %>%
summarise(average_duration = (mean(ride_length)/60),
average_distance = mean(ride_distance), ride_count=n())

print(calc_3,n=48)

Let’s first Check Average Distance By Hour

ggplot(calc_3, aes(x=ride_hour, y=ride_count, group=member_type)) +
geom_line(aes(color=member_type)) +
geom_point(aes(color=member_type)) +
labs(x="Hour", y = "Ride Count",
title = "Ride Count By Hour and Member Type")
Here we see that membership rides understandably peak during rush hour before and after work. A closer look shows that at 5PM member rides aren’t even twice that of casual riders. However, at 8 AM, member rides are about 3x that of casual rides. Thus, I would as a starting experiment, recruit casual riders during morning rush hour rather than evening rush hour.

Let’s check the range of durations now:

ggplot(df, aes(x=ride_month, y=(ride_length/60), fill=member_type)) 
+ geom_boxplot()
+ labs(x="Month", y="Ride Duration",
title="Duration Distribution by Month and Member Type")
Casual riders have a much larger Interquartile Range (IQR), meaning they take rides of more varied duration. Also a larger IQR means that a longer ride is less likely to be an outlier which is why we have quite few for casual riders compared to members (we removed outliers from the original dataset, but now we have new ones): it may be worth considering to make all the casual riders who take a ride considered an outlier, because then they are doing something much more common for a member than a casual rider. Also, if someone is taking a long ride and this app charges by the mile or minute, then it really is in their interest to become a member if they took a long ride, even if they won’t do it again for a while!

Now let’s see the what the most common rides are by start and end stations are for both type of riders, because if there are some rides members often take that casual riders don’t often take, then we may want the app to automatically send an invite to casual riders who do that trip.

station_counts <- df %>%
group_by(member_type, start_station_name) %>%
summarise(ride_count=n())
station_counts <- data.frame(station_counts)


stations_top <- station_counts %>%
arrange(desc(ride_count)) %>%
group_by(member_type) %>%
slice(1:20)

stations_filter <- stations_top %>%
group_by(start_station_name) %>%
summarise(count = n())

stations_filter <- stations_filter %>%
filter(count == 1)
station_name_vector <- c(stations_filter$start_station_name)

stations_top_final <- stations_top %>%
filter(start_station_name %in% station_name_vector) %>%
arrange(desc(member_type))

print(stations_top_final, n = 24)

Of the top 20 start stations below for members and casual riders, these are the ones made the top 20 for only one member_type. I filtered it this way because now, if a casual rider takes one of these listed popular member routes, perhaps they should be prioritized for advertising if they have taken a path popular with members.

For example, Kingsbury St & Kinzie St is a station that didn’t make the Top 20 for casual riders. Thus, casual riders who take a ride from Kingsbury to Kinzie may be more convertible to membership.

SUMMARY & RECOMMENDATIONS

I first must say, with more data, I could do even more to recruit members. For example, if there were a user_id for each user, I could recommend membership to casual riders who took like 3 rides in a month or a quarter or example. But, given the data we are provided with, I suggest the following to increase membership:

  1. Invite casual riders who took a ride on Tuesday or Wednesday, since on these days they are the most few of the distribution that day.
  2. If you really want to hone in the promotion, you can make it directed at not just Tuesday & Wednesday casual riders, but those who took their ride during morning rush hour.
  3. Invite any casual rider who took an extra long ride for casual rider standards (like 25 min +)
  4. Invite casual riders who started their ride from one of the especially member-popular stations in the table above like “Kingsbury St & Kinzie Street” or “Clinton St & Washington Blvd”, etc.

--

--

Jonathan Shmulovich

I look forward to sharing with you all my data analysis projects and other thoughts!