Bellabeat App Behavior Analysis: Insights & Trends ⌚️

Published in

Towards Data Engineering

19 min readJan 1, 2024

In this blog post, I will document how I approached the Bellabeat case study as part of the Google Data Analytics Professional Certificate on Coursera. I will walk through my thought process, the steps I took, and the tools I used. I will also share some of the insights I gained from the case study.

The Bellabeat case study is a great opportunity to learn about data analysis and how it can be used to make business decisions. The case study provides a dataset of customer activity data, and the goal is to use this data to answer a series of business questions.

I approached the case study by first understanding the business questions that needed to be answered. Then, I explored the data to see what insights I could find. I used a variety of tools, including RStudio, Google Cloud Platform, and Tableau, to analyze the data.

I learned a lot from the Bellabeat case study. I learned how to ask the right questions, explore data, and use data to make business decisions. I also learned about the different tools that can be used for data analysis.

I will be showcasing my understanding of the data analysis process which is:

Scenario
Ask
Prepare
Process
Analyze
Share
Act

Scenario

You are a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Urška Sršen (aliased as U.S, the co-founder and chief Creative Officer) believes that analyzing smart device fitness data could help unlock new growth opportunities for the company.

You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices which will help guide the marketing strategy for the company.

Products

Bellabeat app: Provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits.
Leaf: A wellness tracker that can be worn as a bracelet, necklace, or clip. Connects with the Bellabeat app to track activity, sleep, and stress.
Time: A wellness watch with smart technology to track user activity, sleep, and stress. Connects to the Bellabeat app to provide insight into daily wellness
Spring: A water bottle that tracks daily water intake using smart technology to ensure you are adequately hydrated. Connects to the Bellabeat app to track hydration levels.
Bellabeat membership: A subscription-based membership program that gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health, beauty, and mindfulness based on their lifestyle and goals.

About the company

Bellabeat is a high-tech company that manufactures health-focused smart products that are beautifully developed by the U.S. that inform and inspire women around the world regarding their activity, sleep, stress & reproductive health.

Bellabeat has also invested in traditional advertising media, such as radio, out-of-home billboards, print, and television but focuses on digital marketing extensively such as investing in Google Search, maintaining active Facebook & Instagram pages, consistently engaging consumers on Twitter, running video ads on Youtube and displays ads on the Google Display Network to support campaigns around key marketing dates.

The U.S. knows that an analysis of Bellabeat’s available consumer data would reveal more opportunities for worth. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data to gain insight into how people are using their smart devices. Using this information, she would like high-level recommendations for how these trends can inform Bellabeat’s marketing strategy.

Ask

The U.S. asks you to analyze smart device usage data to gain insight into how consumers use non-Bellabeat smart devices. She wants you to select one Bellabeat product to apply these insights to your presentation.

Question

These questions will guide your analysis:

1. What are some trends in smart device usage?
2. How could these trends apply to Bellabeat customers?
3. How could these trends help influence Bellabeat’s marketing strategy?

Prepare

We will be using Fitbit Fitness Tracker Data (here) that contains personal fitness trackers from thirty Fitbit users who consented to the submission of personal tracker data, which includes a minute-level output for:

Physical activity
Heart rate
Sleep monitoring

It also includes information about daily activity, steps, and heart rate that can be used to explore user habits.

Since the files are already grouped in a folder, there’s no need to organize them. The names of the files are also fairly easy to recognize given the context of the data, so we will not be modifying them as well.

ROCCC analysis

Reliability: LOW — The dataset was collected from 30 individuals whose gender is unknown.
Originality: LOW — third-party data collected using Amazon Mechanical Turk.
Comprehensive: MEDIUM — dataset contains multiple fields on daily activity intensity, calories used, daily steps taken, daily sleep time, and weight record.
Current: MEDIUM — data is 5 years old but the habit of how people live does not change over a few years
Cited: HIGH — data collector and the source is well documented

Process

Let load install and load the necessary packages required for this process which would be:

Tidyverse
Janitor
Lubridate
Skimr

# Install the packages
install.package("tidyverse")
install.package("janitor")
install.package("skimr")
install.package("lubridate")

# Load the packages
library(tidyverse)
library(janitor)
library(skimr)
library(lubridate)

Before importing the dataset, we selected the directory of the file. To do this, the path is

session — set working directory — choose directory

Shortcut of choosing directory — CTRL + SHIFT + H

The code will display in the console section like this:

# Set the working directory
setwd("/cloud/project")

After this, we would need to import the datasets into RStudio using “read.csv()”.

#df_name <- read.csv(dataset_location)
daily_activity <- read.csv("Fitbit Data/dailyActivity_merged.csv")
daily_sleep <- read.csv("Fitbit Data/sleepDay_merged.csv")
weight_log <- read.csv("Fitbit Data/weightLogInfo_merged.csv")

Let’s inspect our data to see if there are any errors with formatting by using “str()”.

#str(dataframe_name)
str(daily_activity)
str(daily_sleep)
str(weight_log)R

and we would get the following output:

> str(daily_activity)
'data.frame': 940 obs. of  15 variables:
 $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
 $ ActivityDate            : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
 $ TotalSteps              : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
 $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
 $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
 $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
 $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
 $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
 $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
 $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
 $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
 $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
 $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
 $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
 $ Calories                : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775
...

> str(daily_sleep)
'data.frame': 413 obs. of  5 variables:
 $ Id                : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
 $ SleepDay          : chr  "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
 $ TotalSleepRecords : int  1 2 1 2 1 1 1 1 1 1 ...
 $ TotalMinutesAsleep: int  327 384 412 340 700 304 360 325 361 430 ...
 $ TotalTimeInBed    : int  346 407 442 367 712 320 377 364 384 449 ...

> str(weight_log)
'data.frame': 67 obs. of  8 variables:
 $ Id            : num  1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
 $ Date          : chr  "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
 $ WeightKg      : num  52.6 52.6 133.5 56.7 57.3 ...
 $ WeightPounds  : num  116 116 294 125 126 ...
 $ Fat           : int  22 NA NA NA NA 25 NA NA NA NA ...
 $ BMI           : num  22.6 22.6 47.5 21.5 21.7 ...
 $ IsManualReport: chr  "True" "True" "False" "True" ...
 $ LogId         : num  1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...

After a brief view of the output, there are a few issues that we need to address:

The naming of the column names (camelCase)
daily_activity$ActivityDate — Is formatted as CHR not as a date format
daily_sleep$SleepDay — Is formatted as CHR not as a date format
weight_log$Date — Is formatted as CHR not as a date format
weight_log$IsManualReport is formatted as CHR not logical (for boolean values)

To clean the column names, we would use “clean_names()”.

#Change the column name style
daily_activity <- clean_names(daily_activity)
daily_sleep <- clean_names(daily_sleep)
weight_log <- clean_names(weight_log)

Let’s also format ‘daily_activity $ ActivityDate’, ‘daily_sleep $ SleepDay’, and ‘weight_log $ Date’ into the proper date format using “ as.date()”.

# Convert string into date using as.date()
daily_activity$activity_date <- as.Date(daily_activity$activity_date,'%m/%d/%y')
daily_sleep$sleep_day <- as.Date(daily_sleep$sleep_day, '%m/%d/%y')

For weight_log$date, it’s a little tricky because if you look closely, there’s the PM indicator at the end. POSIX.ct does not recognize this and will return all values as NA, so we will need to use parse_date_time from Lubridate.

# Change string to date using parse_date_time.
weight_log$date <- parse_date_time(weight_log$date, '%m/%d/%y %H:%M:%S %p')

weight_log$date <- parse_date_time(weight_log$date, '%m/%d/%y %H:%M:%S %p')A

str(weight_log)
'data.frame': 67 obs. of  8 variables:
 $ id              : num  1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
 $ date            : POSIXct, format: "2016-05-02 23:59:59" "2016-05-03 23:59:59" "2016-04-13 01:08:52" "2016-04-21 23:59:59" ...
 $ weight_kg       : num  52.6 52.6 133.5 56.7 57.3 ...
 $ weight_pounds   : num  116 116 294 125 126 ...
 $ fat             : int  22 NA NA NA NA 25 NA NA NA NA ...
 $ bmi             : num  22.6 22.6 47.5 21.5 21.7 ...
 $ is_manual_report: chr  "True" "True" "False" "True" ...
 $ log_id          : num  1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
 
 #As you can see, it also formats chr to POSIXct format as well.

To format weight_log$is_manual_report to a logical format, we will use “as.logical()”.

# Convert string into logical 
weight_log $ is_manual_report <- as.logical(weight_log $ is_manual_report)

#Similar to other as.formattype syntaxes
weight_log$is_manual_report <- as.logical(weight_log$is_manual_report)

str(weight_log)
'data.frame': 67 obs. of  8 variables:
 $ id              : num  1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
 $ date            : POSIXct, format: "2016-05-02 23:59:59" ...
 $ weight_kg       : num  52.6 52.6 133.5 56.7 57.3 ...
 $ weight_pounds   : num  116 116 294 125 126 ...
 $ fat             : int  22 NA NA NA NA 25 NA NA NA NA ...
 $ bmi             : num  22.6 22.6 47.5 21.5 21.7 ...
 $ is_manual_report: logi  TRUE TRUE FALSE TRUE TRUE TRUE ...
 $ log_id          : num  1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...

After a quick look at our current data, let’s add a day of the week, sedentary hours & total active hours column for further analysis in daily_activity. I will not be adding a month column since the dataset only provides information collected within a month.

Let’s also add new columns that convert the current minutes of collection to hours and round it using round() in daily_sleep. I will also be adding a column to indicate the time taken to fall asleep in daily_sleep as well.

We will also be removing weight_log$fat, as it has little to no context and would not be helpful during the analysis phase by using “select(-c())”.

# Round basically rounds off a number, the syntax would be = round(object, digits = x)
# Select(-c(column_name)) removes a column from a dataframe

daily_activity$day_of_week <- wday(daily_activity$activity_date, label = T, abbr = T)
daily_activity$total_active_hours = round((daily_activity$very_active_minutes + daily_activity$fairly_active_minutes + daily_activity$lightly_active_minutes)/60, digits = 2)
daily_activity$sedentary_hours = round((daily_activity$sedentary_minutes)/60, digits = 2)

daily_sleep$hours_in_bed = round((daily_sleep$total_time_in_bed)/60, digits = 2)
daily_sleep$hours_asleep = round((daily_sleep$total_minutes_asleep)/60, digits = 2)
daily_sleep$time_taken_to_sleep = (daily_sleep$total_time_in_bed - daily_sleep$total_minutes_asleep)

# Remove the fat column from weight_log
weight_log <- weight_log %>% 
  select(-c(fat))

 str(weight_log)
'data.frame': 67 obs. of  7 variables:
 $ id              : num  1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
 $ date            : chr  "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
 $ weight_kg       : num  52.6 52.6 133.5 56.7 57.3 ...
 $ weight_pounds   : num  116 116 294 125 126 ...
 $ bmi             : num  22.6 22.6 47.5 21.5 21.7 ...
 $ is_manual_report: logi  TRUE TRUE FALSE TRUE TRUE TRUE ...
 $ log_id          : num  1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...

Lastly, I will also be adding a new column in weight_log called bmi2 which will indicate whether the user is underweight, healthy, or overweight by using a line of code I recently learned about which is “case_when”!

#case_when is an R equivalent of of SQL's CASE WHEN
#The last TRUE is basically the R equivalent of SQL's ELSE

weight_log <- weight_log %>% 
  mutate(bmi2 = case_when(
    bmi > 24.9 ~ 'Overweight',
    bmi < 18.5 ~ 'Underweight',
    TRUE ~ 'Healthy'
  ))

Remove Outliers

Before we move onto the phase where we start to analyze the data frame, we need to remove any outliers from the data.

In this case, let’s remove rows in which the total_active_hours & calories burned are 0. The reasoning behind this is that we’re using data collected from Fitbits, which are wearables. If they don’t wear their smart devices it doesn’t collect information, hence we will remove the clutter from the data frame. Users might have also disabled GPS/accelerometer functions that allow for the collection of steps taken.

#In laymans term, '!' means is not equals to
daily_activity_cleaned <- daily_activity[!(daily_activity$calories<=0),]
daily_activity_cleaned <- daily_activity_cleaned[!(daily_activity_cleaned$total_active_hours<=0.00),]

Analyze (RStudio)

I will be using ggplot for this section of the analysis phase. I will also be including another section in which I used Tableau instead.

As per usual, let’s revisit our business task to ensure we are not plotting or trying to hypothesize information/relationships that will not help in solving the business task which is:

What are some trends in smart device usage?
How could these trends apply to Bellabeat customers
How could these trends help influence Bellabeat’s marketing strategy?

After having a brief view of the current data, I will be plotting a few observations revolving around:

The average: Steps taken, sedentary hours, very active minutes & total hours asleep.
Which days are users the most active?
The relationship between total active hours, total steps taken, and sedentary hours against calories burned.
The relationship between weight, total active hours & steps taken
The number of overweight users

Let’s have a quick look at the average steps taken, sedentary hours, very active minutes & total hours of sleep using “summary()”.

> summary(daily_activity_cleaned$total_steps)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0    4920    8053    8319   11100   36019 
      
      
> summary(daily_activity_cleaned$sedentary_hours)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   12.02   17.00   15.87   19.80   23.98 
   
   
> summary(daily_activity_cleaned$very_active_minutes)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.00    7.00   23.21   36.00  210.00 
   
   
> summary(daily_sleep$hours_asleep)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.970   6.020   7.220   6.992   8.170  13.270

With a brief view of the outputs above:

The average number of steps per day was 8319, which is within the 6000–8000 recommended steps per day, however, 25% of people do not hit that recommended quota.
The average sedentary hours were 15.87 hours, which is absurdly high, shattering the recommended limit of 7–10 hours.
The average very active minutes also falls short of the recommended 30 minutes of vigorous exercise every day. Only 25% of people manage to hit this quota
The average hours spent asleep (6.9) also barely hits the quota of the recommended sleep time of 7–9 hours

Now let’s have a look at which days are users most active:

# options(scipen=) will remove any scientific notations

options(scipen = 999)
ggplot(data = daily_activity_cleaned) +
  aes(x = day_of_week, y = total_steps) +
  geom_col(fill =  'red') +
  labs(x = 'Day of week', y = 'Total steps', title = 'Totap steps taken in a week')
ggsave('total_steps.png')

ggplot(data = daily_activity_cleaned) +
  aes(x = day_of_week, y = very_active_minutes) +
  geom_col(fill =  'green') +
  labs(x = 'Day of week', y = 'Total very active minutes', title = 'Total activity in a week')
ggsave('total_activity.png')

ggplot(data = daily_activity_cleaned) +
  aes(x = day_of_week, y = calories) +
  geom_col(fill =  'blue') +
  labs(x = 'Day of week', y = 'Calories burned', title = 'Total calories burned in a week')
ggsave('total_calories.png')

which produces the following:

As we can see, the most active days for Fitbit users were Sundays, with a slow decline throughout the week. This could be due to motivation levels being fairly high during the end of the week.

Next, let’s investigate the relationship between total active hours, total steps taken, and sedentary hours against calories burned by using the following:

#We use geom_smooth() to aid in seeing pattern in the presence of overplotting (when all the plots are too scattered and or too closed)

ggplot(data = daily_activity_cleaned) +
  aes(x= total_active_hours, y = calories) +
  geom_point(color = 'red') +
  geom_smooth() +
  labs(x = 'Total active hours', y = 'Calories burned', title = 'Calories burned vs active hours')
ggsave('calories_burned_vs_active_hours.png')

ggplot(data = daily_activity_cleaned) +
  aes(x= total_steps, y = calories) +
  geom_point(color = 'orange') +
  geom_smooth() +
  labs(x = 'Total steps', y = 'Calories burned', title = 'Calories burned vs total steps')
ggsave('calories_burned_vs_total_steps.png')

ggplot(data = daily_activity_cleaned) +
  aes(x= sedentary_hours, y = calories) +
  geom_point(color = 'purple') +
  geom_smooth() +
  labs(x = 'Sedentary hours', y = 'Calories burned', title = 'Calories burned vs sedentary hours')
ggsave('sedentary_hours_vs_calories_burned.png')

Which produces the following:

At a glance, we can tell that there is a positive correlation between calories burned and total steps taken/total active hours. However, in the last chart, we can see that the correlation could be clearer.

I was expecting an inverse relationship with the first 2 charts however I was wrong. The relationship between sedentary hours and calories burned was fairly positive up till about the 17-hour mark.

For the relationship between weight & physical activity, we would use:

I am inspiring this part of the analysis to the internet as I didn’t know how to create a violin chart.

#Lets merge the tables so we can carry out plotting. I will be ignoring the date issue as for some reason its wiping my dataframe clean even after formatting and renaming.
activity_weight <- merge(daily_activity_cleaned, weight_log, by=c('id'))

ggplot(data = activity_weight) +
  aes(x = very_active_minutes, y = weight_kg) +
  geom_violin(fill = 'blue') +
  labs(x = 'Very active minutes', y = 'Weight(kg)', title = 'Relationship between weight and physical activity')
ggsave('weight_physical_activity.png')
ggplot(data = activity_weight) +
  aes(x = total_steps, y = weight_kg) +
  geom_violin(fill = 'purple') +
  labs(x = 'Total steps', y = 'Weight(kg)', title = 'Relationship between weight and physical activity')
ggsave('weight_physical_activity.png')

Which would produce:

From the chart above, we can infer that users weighing around 60kg & 85kg are the most active.

We will carry out descriptive analysis to observe how many overweight & healthy users by using the following

#The amount of healthy users
nrow(filter(distinct(weight_log, id, .keep_all = T),bmi2 == 'Healthy'))
[1] 3

#The amount of underweight users
nrow(filter(distinct(weight_log, id, .keep_all = T),bmi2 == 'Underweight'))
[1] 0
#The amount of overweight users
nrow(filter(distinct(weight_log, id, .keep_all = T),bmi2 == 'Overweight'))
[1] 5

#For example what .keep_all does, it basically keeps the rest of the rows
distinct(weight_log, id, .keep_all = T)
          id                date weight_kg weight_pounds   bmi is_manual_report        log_id       bmi2
1 1503960366 2016-05-02 23:59:59      52.6      115.9631 22.65             TRUE 1462233599000    Healthy
2 1927972279 2016-04-13 01:08:52     133.5      294.3171 47.54            FALSE 1460509732000 Overweight
3 2873212765 2016-04-21 23:59:59      56.7      125.0021 21.45             TRUE 1461283199000    Healthy
4 4319703577 2016-04-17 23:59:59      72.4      159.6147 27.45             TRUE 1460937599000 Overweight
5 4558609924 2016-04-18 23:59:59      69.7      153.6622 27.25             TRUE 1461023999000 Overweight
6 5577150313 2016-04-17 09:17:55      90.7      199.9593 28.00            FALSE 1460884675000 Overweight
7 6962181067 2016-04-12 23:59:59      62.5      137.7889 24.39             TRUE 1460505599000    Healthy
8 8877689391 2016-04-12 06:47:11      85.8      189.1566 25.68            FALSE 1460443631000 Overweight

Out of the 30 users, only 8 submitted their responses regarding weight. 5 users are overweight and only 3 are within the healthy BMI range of 18.5–24.9.

Share (Tableau)

Here are the visualizations I’ve made from Tableau. My findings are shown below:

Distribution of total steps taken

Distribution of time spent sedentary

Distribution of time spent engaged in vigorous activity

Distribution of time spent asleep

Above are the distributions of the selected variables. As shown:

The majority of users have taken a total of 5000–10,000 steps with a sharp drop off after that.
While a bit confusing with the multiple spikes in recorded counts, most users just spend too much time sedentary, mainly 10–21 hours.
Most users barely exercise as we can see in the huge spike in recorded counts near the 0 on the x-axis. Even then, most users spend about 20 minutes exercising and even see a sharp drop after 70 minutes
While the average user gets a good amount of sleep, we have quite a few records in which users only get about 5.5 to barely 7 hours of sleep.

Total steps taken by weekday

Total minutes engaged in vigorous activity by weekday

Total calories burned by weekday

Something to take note of, these are a collection of information collected throughout a month, which is then grouped by the day .

These are the visualizations to find out the activity of users to identify which days they spend the most time being active.

Here we can see that they spend a lot of time engaged in physical activity starting from Sunday, slowly trails lower and lower. This could be because motivation levels were higher on the weekends.

Relationship between calories burned and minutes engaged in vigorous activity

Relationship between calories burned and total steps taken

Relationship between calories burned and sedentary hours

Here we can see a positive trend with the first 2 charts, which indicate that the more time you spend engaged in physical activity, the more calories you tend to burn.

For the last chart, I was expecting an inverse relationship with the first 2 charts. However I was proven wrong, the data speaks for itself.

As a disclaimer, this only displays the relationship between 2 variables. We do not have height data, which means we cannot calculate BMR hence we cannot claim that walking x steps burns x calories, and can only hypothesize that walking more steps burns more calories.
I suspect that the calories column, is calories burned THROUGHOUT the day which would be TDEE. I’ve come to this conclusion because to burn 3000 calories, you would need to walk an equivalent of 100k steps.
More information here

Relationship between weight and time spent engaged in vigorous activity

Relationship between weight and total steps taken

The thicker the lines, the more recorded counts of activity
As you can see here, while the 2 violin charts are plotted differently, It is, in fact, exactly the same over here.
Violin charts often “smooth” the distribution of data to make it look more pleasing to the eye. The width of the violin plot doesn’t always equate to a bigger count, in fact, it will often mean that there is a “wider” distribution (min max).

As we can see from the 2 charts above, the most active users are within 50kg–85kg. We also see a sharp decline in activity (physically and in count) for users over 90kg.

In the last chart, we have the BMI of users. Out of the 30 users, only 8 submitted their weight records of which 5 of them are overweight and only 3 have a healthy BMI.

Act

In the previous section of Analyze & Share, we have covered the 1st and 2nd business tasks which are:

What are some trends in smart device usage
How could these trends apply to Bellabeat customers (I believe that displaying the trends would already indicate how Bellabeat customers would follow suit.)

Based on my findings after my analysis, I would like to share my hypothesis on this matter.

Users spend more time engaged in physical activity specifically on Sundays, which then proceeds to wane throughout the week with a slight peak on Thursdays which then sees a slow climb on Saturdays.
I suspect that:

Motivation levels & free time are higher on the weekends, which would provide an opportunity for users to sneak in a workout.
As work load decreases, a window of opportunity to exercise would present itself in the midweek (Thursdays)
We see an alltime low of recorded activity on Friday’s due to the possibility of social engagement with friends/coworkers after working hours.

To answer the final business task, I would like to share my recommendations based on my findings to help influence Bellabeat’s marketing strategy.

Bellabeat could host events limited to those who are enrolled in their Bellabeat memberships which would reward users who engage in a healthy lifestyle(IE 8k steps a day, less than 7 hours sedentary, etc.) with points. With enough points, users could then use points to purchase products that help supplement a healthy lifestyle.
Bellabeat could partner with brands (IE wellness, sports, health) to reward users who consistently engage in a healthy lifestyle with coupons/store discounts.
With the 2 previous points combined, Bellabeat could select previously unhealthy individuals (who are now healthy), interview them, and publish motivational videos as to how Bellabeat encouraged them to have a lifestyle change.

Next, I would provide some general recommendations to further improve Bellabeat’s products:

Bellabeat could implement personalized milestones, to encourage users to slowly engage in a more healthy lifestyle. A simple way of doing this is to create some AI companion on the app/product that would be grumpy/sad if the user does not hit the milestone.
Bellabeat could implement a simple reminder to inform users that they’ve been sedentary for too long by indefinitely vibrating the device until the device picks up movement/increase in heart rate, which would indicate that they’ve engaged in some sort of physical activity.

Additional remarks:

Bellabeat should require users to input their height and activity levels so that BMR calculations and a more accurate calculation of TDEE would be possible. This would aid future analysis as well.
Bellabeat should create devices that would track sleep more sophisticatedly (IE REM sleep tracking, deep sleep tracking) to provide more insights into sleep health, as in the dataset provided, we only had the quantity of sleep, not the quality of sleep.

Here’s the R code file → R code File.

If you want the R Markdown PDF File — → R Markdown PDF.

Here is the raw Tableau Woorbook → Tableau Woorbook.

In this workbook, some visuals are different from the graphs which are used in this blog.
But, this will not impact the overall conclusion.

If anyone wants to use these files, feel free to use them in their capstone projects.

Happy analyzing!

If you’re interested in connecting with me, you can find me through the following link: Click here to get in touch