Google Data Analytics: Case Study 2 (Using RStudio)

Joe Yong
13 min readMay 24, 2022

--

Hello again! This will be another documentation of how I approached the 2nd case study within the Google Data Analytics Professional Certificate on Coursera.

As usual, I will be showcasing my understanding of the data analysis process which is: Ask, Prepare, Process, Analyze, Share & Act.

I will be using RStudio & Tableau in this article. I will as well, publish a separate article as to how I used BigQuery SQL for the data processing phase.

Scenario

You are a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Urška Sršen (aliased as U.S, the cofounder & Chief Creative Officer) believes that analyzing smart device fitness data could help unlock new growth opportunities for the company.

You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices which will help guide the marketing strategy for the company.

Products:

  • Bellabeat app: Provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits.
  • Leaf: A wellness tracker that can be worn as a bracelet, necklace, or clip. Connects with the Bellabeat app to track activity, sleep, and stress.
  • Time: A wellness watch with smart technology to track user activity, sleep, and stress. Connects to the Bellabeat app to provide insight into daily wellness
  • Spring: A water bottle that tracks daily water intake using smart technology to ensure you are adequately hydrated. Connects to the Bellabeat app to track hydration levels.
  • Bellabeat membership: A subscription-based membership program that gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health, beauty, and mindfulness-based on their lifestyle and goals

About the company

Bellabeat is a high-tech company that manufactures health-focused smart products which are beautifully developed by U.S that informs and inspires women around the world regarding their activity, sleep, stress & reproductive health.

Bellabeat has also invested in traditional advertising media, such as radio, out-of-home billboards, print, and television but focuses on digital marketing extensively such as investing in Google Search, maintaining active Facebook & Instagram pages, consistently engaging consumers on Twitter, running video ads on Youtube and displays ads on the Google Display Network to support campaigns around key marketing dates.

U.S knows that an analysis of Bellabeat’s available consumer data would reveal more opportunities for worth. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are using their smart devices. Using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy.

Ask

U.S asks you to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. She wants you to select one Bellabeat product to apply these insights to your presentation. These questions will guide your analysis:

  1. What are some trends in smart device usage?
  2. How could these trends apply to Bellabeat customers?
  3. How could these trends help influence Bellabeat marketing strategy?

Prepare

We will be using Fitbit Fitness Tracker Data (here) which was made available by Mobius (here) that contains personal fitness tracker from thirty Fitbit users who consented to the submission of personal tracker data, which includes minute-level output for:

  • Physical activity
  • Heart rate
  • Sleep monitoring

It also includes information about daily activity, steps, and heart rate that can be used to explore user habits.

Since the files are already grouped in a folder, there’s no need to organize them. The names of the files are also fairly easy to recognize given the context of the data, so we will not be modifying them as well.

Process

Lets load install and load the necessary packages required for this process which would be: Tidyverse, Janitor, Lubridate & Skimr

Disclaimed: Sentences followed after # are comments/explanation and are not lines of code

#Installing the packages
install.packages('tidyverse')
install.packages('janitor')
install.packages('lubridate')
install.packages('skimr')
#Loading the packages
library(tidyverse)
library(janitor)
library(lubridate)
library(skimr)

After this, we would need to import the datasets into RStudio using read.csv(). I will also be making slight name changes as well

#df_name <- read.csv(dataset_location)
daily_activity <- read.csv("Fitbit Data/dailyActivity_merged.csv")
daily_sleep <- read.csv("Fitbit Data/sleepDay_merged.csv")
weight_log <- read.csv("Fitbit Data/weightLogInfo_merged.csv")

A disclaimer: my csv’s were located inside the folder “Fitbit Data” and my working directory is in “Google Case Study 2” to separate the file types as shown below:

Let's inspect our data to see if there are any errors with formatting by using str()

#str(dataframe_name)
str(daily_activity)
str(daily_sleep)
str(weight_log)

and we would get the following output:

After a brief view of the output, there are a few issues that we need to address:

  • The naming of the column names (camelCase)
  • daily_activity$ActivityDate — Is formatted as CHR not as a date format
  • daily_sleep$SleepDay — Is formatted as CHR not as a date format
  • weight_log$Date — Is formatted as CHR not as a date format
  • weight_log$IsManualReport is formated as CHR not logical (for boolean values)

To clean the column names, we would use clean_names()

daily_activity <- clean_names(daily_activity)
daily_sleep <- clean_names(daily_sleep)
weight_log <- clean_names(weight_log)

Lets also format daily_activity$ActivityDate, daily_sleep$SleepDay, weight_log$Date into the proper date format. using as.Date() & as.POSIXct()

format(as.Date()) & format(as.POSIXct()) is giving me errors which is why I use as.Date() & as.POSIXct() instead

For weight_log$date, it's a little tricky because if you look closely, there’s the PM indicator at the end. POSIX.ct does not recognize this and will return all values as NA, so we will need to use parse_date_time from Lubridate.

And to format weight_log$is_manual_report to a logical format, we will use as.logical()

After a quick look at our current data, let's add a day of the week, sedentary hours & total active hours column for further analysis in daily_activity. I will not be adding a month column since the dataset only provides information collected within a month.

Let's also add new columns which convert the current minutes of collection to hours and round it using round() in daily_sleep. I will also be adding a column to indicate the time taken to fall asleep in daily_sleep as well.

We will also be removing weight_log$fat, as it has little to no context and would not be helpful during the analysis phase by using select(-c())

Lastly, I will also be adding a new column in weight_log called bmi2 which will indicate whether the user is underweight, healthy, or overweight by using a line of code I recently learned about which is case_when!

more information about case_when here

A̶n̶a̶l̶y̶z̶e̶

Before we move onto the phase where we actually start to analyze the dataframe, we need to remove any outliers from the data.

In this case, let's remove rows in which the total_active_hours & calories burned are 0. The reasoning behind this is that we’re using data collected from Fitbits, which are wearables. If they don’t wear their smart devices it doesn’t collect information, hence we will remove the clutter from the data frame. Users might have also disabled GPS/accelerometer functions that allow for the collection of steps taken.

#In laymans term, '!' means is not equals to
daily_activity_cleaned <- daily_activity[!(daily_activity$calories<=0),]
daily_activity_cleaned <- daily_activity_cleaned[!(daily_activity_cleaned$total_active_hours<=0.00),]

If you’re using an external visualization tool such as Tableau or PowerBI, we need to export our dataframe using

write.csv(daily_activity_cleaned, file ='fitbit_daily_activity.csv')
write.csv(daily_sleep, file = 'fitbit_sleep_log.csv')
write.csv(weight_log, file = 'fitbit_weight_log.csv')

Analyze & Share (RStudio)

I will be using ggplot for this section of the analysis phase. I will also be including another section in which I used Tableau instead.

As per usual, let's revisit our business task to ensure we are not plotting or trying to hypothesize information/relationships which will not help in solving the business task which are:

  1. What are some trends in smart device usage?
  2. How could these trends apply to Bellabeat customers
  3. How could these trends help influence Bellabeat marketing strategy?

After having a brief view of the current data, I will be plotting a few observations revolving around:

  1. The average: Steps taken, sedentary hours, very active minutes & total hours asleep.
  2. Which days are users the most active.
  3. The relationship between total active hours, total steps taken, and sedentary hours against calories burned.
  4. The relationship between weight, total active hours & steps taken
  5. The number of overweight users

Let's have a quick look at the average steps taken, sedentary hours, very active minutes & total hours of sleep using summary().

With a brief view of the outputs above:

  • The average number of steps per day were 8319, which is within the 6000–8000 recommended steps per day, however, 25% of people do not hit that recommended quota.
  • The average sedentary hours were 15.87 hours, which is absurdly high, shattering the recommended limit of 7–10 hours
  • The average very active minutes also falls short of the recommended 30 minutes of vigorous exercise every day. Only 25% of people manage to hit this quota
  • The average hours spent asleep (6.9) also barely hits the quota of the recommended sleep time of 7–9 hours

Now let's have a look at which days are users most active:

which produces the following:

As we can see, the most active days for the Fitbit users were on Sunday, with a slow decline throughout the week. This could be due to motivation levels being fairly high during the end of the week.

Next, let's investigate the relationship between total active hours, total steps taken, and sedentary hours against calories burned by using the following:

Which produces the following:

At a glance, we can tell that there is a positive correlation between calories burned and total steps taken/total active hours. However, in the last chart, we can see that the correlation is confusing.

I was expecting an inverse relationship with the first 2 charts however I was wrong. The relationship between sedentary hours and calories burned was fairly positive up till about the 17-hour mark.

For the relationship between weight & physical activity we would use:

Which would produce:

From the chart above, we can infer that users weighing around 60kg & 85kg are the most active.

As a disclaimer, I will not be using/reccomending violin charts as they often communicate information differently than what we think, further explanation in the Tableau viz.

We will carry out descriptive analysis to observe how many overweight & healthy users by using the following

more information regarding distinct syntax here

Out of the 30 users, only 8 submitted their responses regarding weight. 5 users are overweight and only 3 are within the healthy BMI range of 18.5–24.9

Analyze & Share (Tableau)

Here are the visualizations I've made from Tableau. My findings are shown below:

Distribution of total steps taken
Distribution of time spent sedentary
Distribution of time spent engaged in vigorous activity
Distribution of time spent asleep

Above are the distributions of the selected variables. As shown:

  1. The majority of users have taken a total of 5000–10,000 steps with a sharp drop off after that.
  2. While a bit confusing with the multiple spikes in recorded counts, most users just simply spend too much time sedentary, mainly 10–21hours.
  3. Most users barely exercise as we can see in the huge spike in recorded counts near the 0 on the x-axis. Even then, most users spend about 20 minutes exercising and even see a sharp drop after 70 minutes
  4. While the average user gets a good amount of sleep, we have quite a few records in which users only get about 5.5 to barely 7 hours of sleep.
Total steps taken by weekday
Total minutes engaged in vigorous activity by weekday
Total calories burned by weekday

Something to take note of, these are a collection of information collected throughout a month, which is then grouped by the day .

These are the visualizations to find out the activity of users in order to identify which days they spend the most time being active.

Here we can see that they spend a lot of time engaged in physical activity starting from Sunday, which then slowly trails lower and lower. This could be due to the fact that motivation levels were higher on the weekends.

Relationship between calories burned and minutes engaged in vigorous activity
Relationship between calories burned and total steps taken
Relationship between calories burned and sedentary hours

Here we can see a positive trend with the first 2 charts, which indicates that the more time you spend engaged in physical activity, the more calories you tend to burn.

For the last chart, I was expecting an inverse relationship with the first 2 charts. However I was proven wrong, the data speaks for itself.

As a disclaimer, this only displays the relationship between 2 variables. We do not have height data, which means we cannot calculate BMR hence we cannot claim that walking x steps burns x calories, and can only hypothesize that walking more steps burns more calories.

I suspect that the calories column, is calories burned THROUGHOUT the day which would be TDEE. I’ve come to this conclusion because to burn 3000 calories, you would need to walk an equivalent of 100k steps.
More information here

Relationship between weight and time spent engaged in vigorous activity
Relationship between weight and total steps taken

The thicker the lines, the more recorded counts of activity

As you can see here, while the 2 violin charts are plotted differently, It is, in fact, exactly the same over here.

Violin charts often “smooth” the distribution of data to make it look more pleasing to the eye. The width of the violin plot doesn’t always equate to a bigger count, in fact, it will often mean that there is a “wider” distribution (min max).

As we can see from the 2 charts above, the most active users are within the 50kg–85kg. We also see a sharp decline in activity (physically and in count) for users over 90kg.

BMI of users

In the last chart, we have the BMI of users. Out of the 30 users, only 8 submitted their weight records of which 5 of them are overweight and only 3 have a healthy BMI.

Act

In the previous section of Analyze & Share, we have covered the 1st and 2nd business task which are:

  1. What are some trends in smart device usage
  2. How could these trends apply to Bellabeat customers (I believe that displaying the trends would already indicate how Bellabeat customers would follow suit.)

Based on my findings after my analysis, I would like to share my hypothesis on this matter.

  1. Users spend more time engaged in physical activity specifically on Sundays, which then proceeds to wane throughout the week with a slight peak on Thursdays which then sees a slow climb on Saturdays.
    I suspect that:

Motivation levels & free time are higher on the weekends, which would provide an opportunity for users to sneak in a workout.
As work load decreases, a window of opportunity to exercise would present itself in the midweek (Thursdays)
We see an alltime low of recorded activity on Friday’s due to the possibility of social engagement with friends/coworkers after working hours.

Now to answer the final business task, I would like to share my recommendations based on my findings to help influence Bellabeat’s marketing strategy.

  1. Bellabeat could host events limited to those that are enrolled in their Bellabeat memberships which would reward users who engage in a healthy lifestyle(IE 8k steps a day, less than 7 hours sedentary etc.) with points. With enough points, users could then use points to purchase products that help supplement a healthy lifestyle.
  2. Bellabeat could partner with brands (IE wellness, sports, health) to reward users who consistently engage in a healthy lifestyle with coupons/store discounts.
  3. With the 2 previous points combined, Bellabeat could select previously unhealthy individuals (who are now healthy), interview them and publish motivational videos as to how Bellabeat encouraged them to have a change in lifestyle.

Next, I would provide some general recommendations to further improve Bellabeat’s products:

  1. Bellabeat could implement personalized milestones, to encourage users to slowly engage in a more healthy lifestyle. A simple way of doing this is to create some sort of AI companion on the app/product that would be grumpy/sad if the user does not hit the milestone.
  2. Bellabeat could implement a simple reminder to inform users that they’ve been sedentary for too long by indefinitely vibrating the device until the device picks up movement/increase in heart rate, which would indicate that they’ve engaged in some sort of physical activity.

Additional remarks:

  • Bellabeat should require users to input their height and their activity levels so that BMR calculations and a more accurate calculation of TDEE would be possible. This would aid future analysis as well.
  • Bellabeat should create devices that would track sleep more sophisticatedly (IE REM sleep tracking, deep sleep tracking) to provide more insights into sleep health, as in the dataset provided, we only had the quantity of sleep, not the quality of sleep.

Authors note:

That concludes the case study! I hope i have helped shed some light as to how to approach this case study, as it was a real challenge for me! Figuring out an answer for the marketing question was real tough for me!

--

--

Joe Yong

Ex cafe manager, looking to start a career as a data analyst!