Google Capstone Project: How Can Bellabeat, A Wellness Technology Company Play It Smart?

Sandeep Singh
11 min readMay 9, 2023

--

In this an optional case study of capstone project from the Google Data Analytics Course , I will perform data analysis for Bellabeat, a high-tech manufacturer of health-focused products for women. I will analyze smart device data to gain insight into how consumers are using their smart devices. The analysis will help us to guide future marketing strategies for bellabeat team. Along the way, I will perform numerous real-world tasks of a junior data analyst by following the steps of the data analysis process: Ask, Prepare, Process, Analyze, Share, and Act

Step-1 Ask

In the first phase of analysis is called “problem definition” or “problem identification.” It involves defining the problem, identifying objectives and constraints, and specifying the desired outcome or solution. This phase sets the foundation for the rest of the case study and helps to ensure clarity and alignment among stakeholders.

  1. What are some trends in smart device usage?
  2. How could these trends apply to Bellabeat customers?
  3. How could these trends help influence Bellabeat marketing strategy?

To gain insights into the usage of the FitBit app and identify trends that can inform Bellabeat’s marketing strategy, the business task is to analyze data collected from FitBit fitness trackers.

Deliverables:

  • A clear summary of the business task
  • A description of all data sources used
  • Documentation of any cleaning or manipulation of data
  • A summary of the analysis
  • Supporting visualizations and key findings
  • High-level content recommendations based on the analysis

Key Stakeholders

  • Urška Sršen: Bellabeat’s co-founder and Chief Creative Officer
  • Sando Mur: Mathematician, Bellabeat’s cofounder and key member of the Bellabeat executive team
  • Bellabeat marketing analytics team: A team of data analysts guiding Bellabeat’s marketing strategy.

Step-2 Prepare

During the Prepare stage, we recognize the information being utilized and any constraints associated with it.

2.1 Information on data source:

`Data is publicly available on Kaggle: Fitbit Fitness Tracker Data and stored in 18 csv files.

Generated by respondents from a survey via Amazon Mechanical Turk between 12 March 2016 to 12 May 2016.

30 FitBit users consented to the submission of personal tracker data.

2.2 Limitations of Data Set:

Data is collected 5 years ago in 2016. Users’ daily activity, fitness and sleeping habits, diet and food consumption may have changed since then. Data may not be timely or relevant.

Sample size of 30 FitBit users is not representative of the entire fitness population. As data is collected in a survey, we are unable to ascertain its integrity or accuracy.

2.3 Is Data ROCCC?

A good data source is ROCCC which stands for Reliable, Original, Comprehensive, Current, and Cited.

A good data source is ROCCC which stands for Reliable, Original, Comprehensive, Current, and Cited.

  • Reliable — LOW — Due to the small sample size of only 30 respondents,
  • Original — LOW —It was provided by a third-party provider, Amazon Mechanical Turk
  • Comprehensive — MED — Parameters match most of Bellabeat products’ parameters
  • Current — LOW — Data is 5 years old and may not be relevant
  • Cited — LOW — Data collected from the third party, hence unknown

Overall, the dataset is considered bad quality data and it is not recommended to produce business recommendations based on this data.

2.4 Data Selection

The chosen file is being copied for the purpose of analysis.

dailyActivity_merged.csv

We are using R Programming language for data cleaning, transformation and visualization.

2.5 Tools

We are using R for data cleaning, transformation and visualisation.

Step-3 Process

In this step, the data will be processed by performing various actions such as cleaning, verifying its correctness, relevance, and completeness, and removing any errors and outliers.

  • Explore and observe data
  • Check for and treat missing or null values
  • Transform data — format data type
  • Perform preliminary statistical analysis

3.1 Preparing the Environment

```{r warning=TRUE, include=FALSE}
library('tidyverse') #Transform and better present data.
library('skimr') #Provide summary statistics about variables in data frames, tibbles, data tables and vectors
library('janitor') #Used for Examining and cleaning dirty data
library('dplyr') #Used for data manipulation
library('lubridate') #Functions to work with date-times and time-spans
library('ggplot2') #Data visualization package for the statistical programming language R
library('treemap') #Function to displays hierarchical data as a set of nested rectangles
```

3.2 Importing data set

#read_csv function to read CSV file
dataset<- read.csv("dailyActivity_merged.csv",header=TRUE,sep = ",")

3.3 Data Cleaning and manipulation

To become acquainted with the data, perform following step.

  • Observe it carefully and familiarize with data
  • Check for any null or missing values in the data.
  • Additionally, performing a sanity check of the data is also recommended.

To gain familiarity with the data, it is advised to preview the first 10 rows of the data.

#Previewing using str function on  daily_activity to familiarise with the data
head(dataset, 10)
This is only a screenshot of the table. Full table is here, applies to all tables and visual below

Cleaning the Column Names.

dataset<-janitor::clean_names(dataset)
result<-data.frame(colnames(dataset))
result
Column name after cleaning the dataset

Next, I’ll check if there are any missing or null values present in the data.

#Find out Null Values using colSums and is.na function,
#It will sum up all the missing value in column
result<-data.frame(colSums(is.na(dataset)))
result

Below the screenshot , It shows output of above given code chunk

Count of Null Values Present in each column

To obtain fundamental details about the dataset, we can determine:

  • The number of rows and columns
  • The names of the columns
  • The count of non-null values
  • The data type of the variables in the dataset.
#Obtaining basic detail of dataset
non_null_count <- colSums(!is.na(dataset))
data_type <- sapply(dataset, class)
result <- data.frame(non_null_count, data_type)
result
Table Shows Non Null Count and data type of each Column

Finding Out Unique IDs

#Finding Out Unique ID
unique_id<-unique(dataset$id)
unique_ct<-length(unique_id)
print(paste("Unique IDs:", unique_ct))
Table Shows Number of Unique Id Present in dataset

Based on the above observations, we can conclude that:

The dataset contains Zero ,null or missing values, as indicated by the ‘Non-Null Count’.

  1. The dataset has 15 columns and 940 rows.
  2. The ‘ActivityDate’ column is incorrectly classified as an object data type and needs to be converted to a datetime64 data type.
  3. There are 33 unique IDs instead of the expected 30, which may be due to users creating additional IDs during the survey period.

Now that we have identified the problem areas in the data, we can perform data manipulation/transformation. This includes:

  1. Converting the ‘ActivityDate’ column to a datetime64 data type.
  2. Reformatting the ‘ActivityDate’ column to yyyy-mm-dd.
  3. Creating a new ‘DayOfTheWeek’ column by generating the day of the week from the ‘ActivityDate’ column for further analysis.
  4. Creating a new ‘TotalMins’ column that sums up the ‘VeryActiveMinutes’, ‘FairlyActiveMinutes’, ‘LightlyActiveMinutes’, and ‘SedentaryMinutes’.
  5. Creating a new ‘TotalHours’ column by converting the ‘TotalMins’ column to the number of hours.
  6. Rearranging and renaming columns as needed.

To start with, we will convert the ‘ActivityDate’ column from an object to a datetime64 data type and then reformat it to yyyy-mm-dd. After this step, we will confirm whether the changes have been made successfully.

#converting *Activity Date*  to datetime64 and format to YYYY-MM-DD
dataset$activity_date <- as.Date(dataset$activity_date, format = "%m/%d/%Y")
#Reprint to confirm the datatype of Activity Date
data_type <- data.frame(sapply(dataset, class))
data_type
Confirming the Datatype of Activitydate Column
#Creating Seprate List for Activity Day 
activity_Day <- wday(dataset$activity_date, label=TRUE)
dataset['activity_Day']<-activity_Day

#Creating Seprate List for Total Activity Minutes
totactive_Minutes<-(dataset$very_active_minutes + dataset$fairly_active_minutes +dataset$lightly_active_minutes +dataset$sedentary_minutes)
dataset['totactive_Minutes']<-totactive_Minutes

#Creating Seprate List for Total Activity Hour
totactive_Hours<-ceiling((totactive_Minutes/60))
dataset['totactive_Hours']<-totactive_Hours

View(dataset)

The data has undergone the necessary cleaning and manipulation processes and is now prepared for analysis.

Step 4 Analysis

4.1 Perform calculations

Pulling statistics for analysis:

  • count — no. of rows
  • mean (average)
  • std (standard deviation)
  • min and max
  • percentiles 25%, 50%, 75%
# Pull general statistics
summary(dataset)

The statistical analysis reveals that

  1. The average number of steps taken by users was 7,637, or 5.4km, which falls short of the recommended daily goal of 10,000 steps or 8km for an adult female, according to the CDC. Meeting this goal is important for improving overall health, weight loss, and fitness, as per a Medical News Today article.
  2. Furthermore, the study found that the majority of users were leading sedentary lifestyles, as they logged an average of 991 minutes or 20 hours of sedentary behavior, accounting for 81% of total average minutes. This indicates that there is room for improvement in terms of physical activity levels.
  3. Finally, the average calories burned by users were 2,303, equivalent to 0.6 pounds. However, it’s important to note that several factors such as age, weight, daily tasks, exercise, hormones, and daily calorie intake can impact individual calorie burn. Therefore, it is challenging to draw detailed conclusions without additional information, as stated in a Health Line article.

Step 5 Analysis

In this step, we are creating visualizations and communicating our findings based on our analysis.

#Plotting the histogram
p <- ggplot(data = dataset) +
geom_bar(mapping = aes(x = activity_Day, fill = activity_Day)) +
coord_flip() +
theme_classic() +
labs(title = "Number of times users logged in app across the week",
x = "Activity Day",
y = "Frequency",
fill = "Day of the week") +
guides(fill = guide_legend(title = "Days of the week", ncol = 2))

p + theme(panel.grid.major = element_line(color = "grey"),
panel.grid.minor = element_line(color = "lightgrey"))

The histogram represents the usage frequency of the FitBit app throughout the week.

  1. The data suggests that users tend to use the app more frequently on weekdays, specifically from Tuesday to Friday.
  2. There was a slight decrease in usage on Fridays and this trend continued into the weekends and on Mondays. It’s possible that users may forget to track their activity on the weekends, resulting in the lower usage frequency.
#Plotting a scatter plot

median_calories = 2303
median_steps = 7637
ggplot(dataset, aes(x = total_steps, y = calories, color = calories)) +
geom_point(alpha = 0.8) +
scale_color_gradient(name = "Calories burned", low = "blue", high = "red",
breaks = seq(2000, 4000, by = 500)) +
geom_vline(xintercept = median_steps, color = "blue", linetype = "dashed",
size = 0.8) +
geom_hline(yintercept = median_calories, color = "red", linetype = "dashed",
size = 0.8) +
labs(x = "Steps taken", y = "Calories burned",
title = "Calories burned for every step taken") +
theme_classic() +
theme(legend.position = "right")

Based on the scatter plot, we can draw the following conclusions:

  • There is a positive correlation between the number of steps taken and the calories burned.
  • The plot shows that the calories burned increase as the number of steps taken increases, up to a maximum point of around 15,000 steps. Beyond this point, the rate of calorie burn tends to decrease.
  • We also observed a few outliers in the data, such as observations with zero or minimal calories burned despite a non-zero number of steps taken, and one observation with more than 35,000 steps but less than 3,000 calories burned. These outliers could be due to natural variation, changes in user behavior, or errors in data collection (such as miscalculations, contamination, or human error).
#Plotting a Scatter Plot
p <- ggplot(data = dataset)
p + geom_point(mapping = aes(x = totactive_Hours,y=calories, color = calories)) +
scale_color_gradient(low = "blue", high = "red") +
labs(title="Calories burned for every hours logged", x = "Total Hours Logged ", y="Calories burned")

In the scatter plot, we can observe the following:

  • There is a weak positive correlation between hours logged and calories burned, which suggests that logging more hours does not necessarily lead to a significant increase in the number of calories burned. This is likely due to the fact that the average sedentary hours are plotted around the 16 to 17 hours range.
  • The plot also shows a few outliers, such as zero value outliers and a single red dot at 24 hours with zero calories burned. These outliers may be caused by similar factors as previously mentioned, including natural data variation, changes in user behavior, or errors in data collection.
# calculating total of individual minutes column

value<-c(sum(dataset$lightly_active_minutes) , sum(dataset$fairly_active_minutes), sum(dataset$very_active_minutes), sum(dataset$sedentary_minutes))
group <- c(" Light active minutes" ,"Fairly active minutes" , "Very active minutes" , "Sedentary minutes")
activity_data <- data.frame(group,value)
treemap( activity_data,
index="group",
vSize = "value",
type = "index",
title="Percentage of Activity"
)

The treemap shows that:

  1. The largest portion, at 81.3%, is represented by sedentary minutes. This implies that users mainly use the FitBit app to log daily activities such as daily commute, inactive movements, or running errands.
  2. The app is hardly being used to monitor fitness activities such as running, as indicated by the small percentage of fairly active activity (1.1%) and very active activity (1.7%). This is disheartening considering that the FitBit app was designed to promote fitness.

Step 6 Act

In the final stage, we will be presenting our findings and providing suggestions based on our analysis.

Let’s revisit our initial business questions and share our high-level business recommendations:

  1. What are the observed patterns?
  • The majority of users (81.3%) are using the FitBit app to track their sedentary activities instead of tracking their fitness and health habits.
  • Furthermore, users prefer to track their activities during weekdays rather than weekends, perhaps because they tend to be more active on weekdays and less active on weekends.

2. How could these patterns be relevant to Bellabeat customers?

  • Both companies aim to provide women with data regarding their health, habits, and fitness, as well as motivate them to comprehend their current habits and make healthier decisions.
  • As a result, these shared trends in health and fitness can be applied to Bellabeat customers as well.

3. How could these trends assist in shaping Bellabeat’s marketing strategy?

  • Bellabeat’s marketing team can motivate users by offering education and information on fitness benefits, recommending various forms of exercise (such as 10-minute workouts during weekdays and more intense workouts on weekends), and providing information on calories intake and burnt rate through the Bellabeat app.

Additionally, on weekends, the Bellabeat app can send notifications to encourage users to exercise.

The dataset and complete code can be found here.

--

--