Google Capstone Project: How Can Bellabeat, A Wellness Technology Company Play It Smart?

Katie Huang Xiemin
Analytics Vidhya
Published in
12 min readJun 30, 2021

This is an optional capstone project from the Google Data Analytics Course no: Capstone Project which is posted on GitHub and Kaggle.

The analysis follows the 6 steps of Data Analysis taught in the Google course: Ask, Prepare, Process, Analyse, Share and Act.

Step 1: Ask

In this step, we define the problem and objectives of our case study and its desired outcome.

1.0 Background

Bellabeat is a high-tech manufacturer of beautifully-designed health-focused smart products for women since 2013. Inspiring and empowering women with knowledge about their own health and habits, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for females.

The co-founder and Chief Creative Officer, Urška Sršen is confident that an analysis of non-Bellebeat consumer data (ie. FitBit fitness tracker usage data) would reveal more opportunities for growth.

1.2 Business Task:

Analyze FitBit fitness tracker data to gain insights into how consumers are using the FitBit app and discover trends for Bellabeat marketing strategy.

1.3 Business Objectives:

  • What are the trends identified?
  • How could these trends apply to Bellabeat customers?
  • How could these trends help influence Bellabeat marketing strategy?

1.4 Deliverables:

  • A clear summary of the business task
  • A description of all data sources used
  • Documentation of any cleaning or manipulation of data
  • A summary of analysis
  • Supporting visualizations and key findings
  • High-level content recommendations based on the analysis

1.5 Key Stakeholders:

  • Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
  • Sando Mur: Mathematician, Bellabeat’s cofounder and key member of the Bellabeat executive team
  • Bellabeat marketing analytics team: A team of data analysts guiding Bellabeat’s marketing strategy.

STEP 2: PREPARE

In the Prepare phase, we identify the data being used and its limitations.

2.1 Information on Data Source:

  1. Data is publicly available on Kaggle: FitBit Fitness Tracker Data and stored in 18 csv files.
  2. Generated by respondents from a survey via Amazon Mechanical Turk between 12 March 2016 to 12 May 2016.
  3. 30 FitBit users consented to the submission of personal tracker data.
  4. Data collected includes physical activity recorded in minutes, heart rate, sleep monitoring, daily activity and steps.

2.2 Limitations of Data Set:

  • Data is collected 5 years ago in 2016. Users’ daily activity, fitness and sleeping habits, diet and food consumption may have changed since then. Data may not be timely or relevant.
  • Sample size of 30 FitBit users is not representative of the entire fitness population.
  • As data is collected in a survey, we are unable to ascertain its integrity or accuracy.

2.3 Is Data ROCCC?

A good data source is ROCCC which stands for Reliable, Original, Comprehensive, Current, and Cited.

  • Reliable — LOW — Not reliable as it only has 30 respondents
  • Original — LOW — Third party provider (Amazon Mechanical Turk)
  • Comprehensive — MED — Parameters match most of Bellabeat products’ parameters
  • Current — LOW — Data is 5 years old and may not be relevant
  • Cited — LOW — Data collected from third party, hence unknown

Overall, the dataset is considered bad quality data and it is not recommended to produce business recommendations based on this data.

2.4 Data Selection

The following file is selected and copied for analysis.

dailyActivity_merged.csv

2.5 Tool

We are using Python for data cleaning, transformation and visualisation.

STEP 3: PROCESS

Here, we will process the data by cleaning and ensuring that it is correct, relevant, complete and free of error and outlier by performing:

  • Explore and observe data
  • Check for and treat missing or null values
  • Transform data — format data type
  • Perform preliminary statistical analysis

3.1 Preparing the Environment

The Python libraries are installed and aliased for easy reading.

# import packages and alias
import numpy as np # data arrays
import pandas as pd # data structure and data analysis
import matplotlib as plt # data visualization
import datetime as dt # date time

3.2 Importing data set

Reading in the selected file.

# read_csv function to read CSV file
daily_activity = pd.read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")

3.3 Data cleaning and manipulation

  1. Observe and familiarize with data
  2. Check for null or missing values
  3. Perform sanity check of data

Previewing the first 10 rows to familiarise with the data.

# preview first 10 rows with all columns
daily_activity.head(10)
This is only a screenshot of the table. Full table is here, applies to all tables and visuals below.

Then, I’m going to find out whether there is any null or missing values in the data.

# obtain the # of missing data points per column
missing_values_count = daily_activity.isnull().sum()

# look at the # of missing points in all columns
missing_values_count[:]
Id 0
ActivityDate 0
TotalSteps 0
TotalDistance 0
TrackerDistance 0
LoggedActivitiesDistance 0
VeryActiveDistance 0
ModeratelyActiveDistance 0
LightActiveDistance 0
SedentaryActiveDistance 0
VeryActiveMinutes 0
FairlyActiveMinutes 0
LightlyActiveMinutes 0
SedentaryMinutes 0
Calories 0
dtype: int64

Finding out basic information of the data:

  • no. of rows and columns
  • columns names
  • non-null count
  • data type
# show basic information of data
daily_activity.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 940 non-null int64
1 ActivityDate 940 non-null object
2 TotalSteps 940 non-null int64
3 TotalDistance 940 non-null float64
4 TrackerDistance 940 non-null float64
5 LoggedActivitiesDistance 940 non-null float64
6 VeryActiveDistance 940 non-null float64
7 ModeratelyActiveDistance 940 non-null float64
8 LightActiveDistance 940 non-null float64
9 SedentaryActiveDistance 940 non-null float64
10 VeryActiveMinutes 940 non-null int64
11 FairlyActiveMinutes 940 non-null int64
12 LightlyActiveMinutes 940 non-null int64
13 SedentaryMinutes 940 non-null int64
14 Calories 940 non-null int64
dtypes: float64(7), int64(7), object(1)
memory usage: 110.3+ KB

We are also going to count unique IDs to confirm whether data has 30 IDs as claimed by the survey.

# count distinct value of "Id"
unique_id = len(pd.unique(daily_activity["Id"]))

print("# of unique Id: " + str(unique_id))
# of unique Id: 33

From the above observations, we noted that

  1. There are Null or missing values as stated under the ‘Non-Null Count’.
  2. Data has 15 columns and 940 rows.
  3. ActivityDate is wrongly classified as object dtype and has to be converted to datetime64 dtype.
  4. There are 33 unique IDs, instead of 30 unique IDs as expected. Some users may have created additional IDs during the survey period.

Now that we have identified the dirty data, we will then perform data manipulation/transformation.

  1. Convert ActivityDate to datatime64 dtype.
  2. Convert format of ActivityDate to yyyy-mm-dd.
  3. Create new column DayOfTheWeek by generating date in the form of day of the week for further analysis.
  4. Create new column TotalMins being the sum of VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes and SedentaryMinutes.
  5. Create new column TotalHours by converting new column TotalMins in no. 4 to number of hours.
  6. Rearrange and rename columns.

First off, we will convert ActivityDate from object to datatime64 dtype and then, set out to convert ActivityDate to yyyy-mm-dd.

Then, we confirm whether it has been updated to datatime64 dtype and ActivityDate to yyyy-mm-dd.

# convert "ActivityDate" to datatime64 dtype and format to yyyy-mm-dd
daily_activity["ActivityDate"] = pd.to_datetime(daily_activity["ActivityDate"], format="%m/%d/%Y")
# re-print information to confirm
daily_activity.info()
# print the first 5 rows of "ActivityDate" to confirm
daily_activity["ActivityDate"].head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 940 non-null int64
1 ActivityDate 940 non-null datetime64[ns]
2 TotalSteps 940 non-null int64
3 TotalDistance 940 non-null float64
4 TrackerDistance 940 non-null float64
5 LoggedActivitiesDistance 940 non-null float64
6 VeryActiveDistance 940 non-null float64
7 ModeratelyActiveDistance 940 non-null float64
8 LightActiveDistance 940 non-null float64
9 SedentaryActiveDistance 940 non-null float64
10 VeryActiveMinutes 940 non-null int64
11 FairlyActiveMinutes 940 non-null int64
12 LightlyActiveMinutes 940 non-null int64
13 SedentaryMinutes 940 non-null int64
14 Calories 940 non-null int64
dtypes: datetime64[ns](1), float64(7), int64(7)
memory usage: 110.3 KB
0 2016-04-12
1 2016-04-13
2 2016-04-14
3 2016-04-15
4 2016-04-16
Name: ActivityDate, dtype: datetime64[ns]

Creating new list with rearranged column names and renaming daily_activity to a shorter name df_activity.

#r create new list of rearranged columns
new_cols = ['Id', 'ActivityDate', 'DayOfTheWeek', 'TotalSteps', 'TotalDistance', 'TrackerDistance', 'LoggedActivitiesDistance', 'VeryActiveDistance', 'ModeratelyActiveDistance', 'LightActiveDistance', 'SedentaryActiveDistance', 'VeryActiveMinutes', 'FairlyActiveMinutes', 'LightlyActiveMinutes', 'SedentaryMinutes', 'TotalExerciseMinutes', 'TotalExerciseHours', 'Calories']

# reindex function to rearrange columns based on "new_cols"
df_activity = daily_activity.reindex(columns=new_cols)

# print 1st 5 rows to confirm
df_activity.head(5)

Creating new column by separating the date into day of the week for further analysis.

# create new column "day_of_the_week" to represent day of the week 
df_activity["DayOfTheWeek"] = df_activity["ActivityDate"].dt.day_name()

# print 1st 5 rows to confirm
df_activity["DayOfTheWeek"].head(5)
0 Tuesday
1 Wednesday
2 Thursday
3 Friday
4 Saturday
Name: DayOfTheWeek, dtype: object

Rearranging and renaming columns from XxxYyy to xxx_yyy.

# rename columns
df_activity.rename(columns = {"Id":"id", "ActivityDate":"date", "DayOfTheWeek":"day_of_the_week", "TotalSteps":"total_steps", "TotalDistance":"total_dist", "TrackerDistance":"track_dist", "LoggedActivitiesDistance":"logged_dist", "VeryActiveDistance":"very_active_dist", "ModeratelyActiveDistance":"moderate_active_dist", "LightActiveDistance":"light_active_dist", "SedentaryActiveDistance":"sedentary_active_dist", "VeryActiveMinutes":"very_active_mins", "FairlyActiveMinutes":"fairly_active_mins", "LightlyActiveMinutes":"lightly_active_mins", "SedentaryMinutes":"sedentary_mins", "TotalExerciseMinutes":"total_mins","TotalExerciseHours":"total_hours","Calories":"calories"}, inplace = True)

# print column names to confirm
print(df_activity.columns.values)
df_activity.head(5)
['id' 'date' 'day_of_the_week' 'total_steps' 'total_dist' 'track_dist' 'logged_dist' 'very_active_dist' 'moderate_active_dist' 'light_active_dist' 'sedentary_active_dist' 'very_active_mins' 'fairly_active_mins' 'lightly_active_mins' 'sedentary_mins' 'total_mins' 'total_hours' 'calories']

Creating new column total_mins being the sum of total time logged.

# create new column "total_mins" containing sum of total minutes.
df_activity["total_mins"] = df_activity["very_active_mins"] + df_activity["fairly_active_mins"] + df_activity["lightly_active_mins"] + df_activity["sedentary_mins"]
df_activity["total_mins"].head(5)
0 1094
1 1033
2 1440
3 998
4 1040
Name: total_mins, dtype: int64

Creating new column by converting total_mins to number of hours.

# create new column *total_hours* by converting to hour and round float to two decimal places
df_activity["total_hours"] = round(df_activity["total_mins"] / 60)
# print 1st 5 rows to confirm
df_activity["total_hours"].head(5)
0 18.0
1 17.0
2 24.0
3 17.0
4 17.0
Name: total_hours, dtype: float64

Data cleaning and manipulation is completed. Hence, data is now ready to be analysed.

STEP 4: ANALYSE

4.1 Perform calculations

Pulling statistics for analysis:

  • count — no. of rows
  • mean (average)
  • std (standard deviation)
  • min and max
  • percentiles 25%, 50%, 75%
# pull general statistics
df_activity.describe()

Interpreting statistical findings:

  1. On average, users logged 7,637 steps or 5.4km which is not adequate. As recommended by CDC, an adult female has to aim at least 10,000 steps or 8km per day to benefit from general health, weight loss and fitness improvement. Source: Medical News Today article
  2. Sedentary users are the majority logging on average 991 minutes or 20 hours making up 81% of total average minutes.
  3. Noting that average calories burned is 2,303 calories equivalent to 0.6 pound. Could not interpret into detail as calories burned depend on several factors such as the age, weight, daily tasks, exercise, hormones and daily calorie intake. Source: Health Line article

STEP 5: SHARE

In this step, we are creating visualizations and communicating our findings based on our analysis.

5.1 Data Visualisation and Findings

# import matplotlib package
import matplotlib.pyplot as plt
# plotting histogram
plt.style.use("default")
plt.figure(figsize=(6,4)) # specify size of the chart
plt.hist(df_activity.day_of_the_week, bins = 7,
width = 0.6, color = "lightskyblue", edgecolor = "black")
# adding annotations and visuals
plt.xlabel("Day of the week")
plt.ylabel("Frequency")
plt.title("No. of times users logged in app across the week")
plt.grid(True)
plt.show()

Frequency of usage across the week

In this histogram, we are looking at the frequency of FitBit app usage in terms of days of the week.

  1. We discovered that users prefer or remember (giving them the doubt of benefit that they forgotten) to track their activity on the app during midweek from Tuesday to Friday.
  2. Noting that the frequency dropped on Friday and continue on weekends and Monday.
# import matplotlib package
import matplotlib.pyplot as plt
# plotting scatter plot
plt.style.use("default")
plt.figure(figsize=(8,6)) # specify size of the chart
plt.scatter(df_activity.total_steps, df_activity.calories,
alpha = 0.8, c = df_activity.calories,
cmap = "Spectral")
# add annotations and visuals
median_calories = 2303
median_steps = 7637
plt.colorbar(orientation = "vertical")
plt.axvline(median_steps, color = "Blue", label = "Median steps")
plt.axhline(median_calories, color = "Red", label = "Median calories burned")
plt.xlabel("Steps taken")
plt.ylabel("Calories burned")
plt.title("Calories burned for every step taken")
plt.grid(True)
plt.legend()
plt.show()

Calories burned for every step taken

From the scatter plot, we discovered that:

  1. It is a positive correlation.
  2. We observed that intensity of calories burned increase when users are at the range of > 0 to 15,000 steps with calories burn rate cooling down from 15,000 steps onwards.
  3. Noted a few outliers:
  • Zero steps with zero to minimal calories burned.
  • 1 observation of > 35,000 steps with < 3,000 calories burned.
  • Deduced that outliers could be due to natural variation of data, change in user’s usage or errors in data collection (ie. miscalculations, data contamination or human error).
# import matplotlib package
import matplotlib.pyplot as plt

# plotting scatter plot
plt.style.use("default")
plt.figure(figsize=(8,6)) # Specify size of the chart
plt.scatter(df_activity.total_hours, df_activity.calories,
alpha = 0.8, c = df_activity.calories,
cmap = "Spectral")

# adding annotations and visuals
median_calories = 2303
median_hours = 20
median_sedentary = 991 / 60

plt.colorbar(orientation = "vertical")
plt.axvline(median_hours, color = "Blue", label = "Median steps")
plt.axvline(median_sedentary, color = "Purple", label = "Median sedentary")
plt.axhline(median_calories, color = "Red", label = "Median hours")
plt.xlabel("Hours logged")
plt.ylabel("Calories burned")
plt.title("Calories burned for every hour logged")
plt.legend()
plt.grid(True)
plt.show()

Calories burned for every hour logged

The scatter plot is showing:

  1. A weak positive correlation whereby the increase of hours logged does not translate to more calories being burned. That is largely due to the average sedentary hours (purple line) plotted at the 16 to 17 hours range.
  2. Again, we can see a few outliers:
  • The same zero value outliers
  • An unusual red dot at the 24 hours with zero calorie burned which may be due to the same reasons as above.
# import packages
import matplotlib.pyplot as plt
import numpy as np

# calculating total of individual minutes column
very_active_mins = df_activity["very_active_mins"].sum()
fairly_active_mins = df_activity["fairly_active_mins"].sum()
lightly_active_mins = df_activity["lightly_active_mins"].sum()
sedentary_mins = df_activity["sedentary_mins"].sum()

# plotting pie chart
slices = [very_active_mins, fairly_active_mins, lightly_active_mins, sedentary_mins]
labels = ["Very active minutes", "Fairly active minutes", "Lightly active minutes", "Sedentary minutes"]
colours = ["lightcoral", "yellowgreen", "lightskyblue", "darkorange"]
explode = [0, 0, 0, 0.1]
plt.style.use("default")
plt.pie(slices, labels = labels,
colors = colours, wedgeprops = {"edgecolor": "black"},
explode = explode, autopct = "%1.1f%%")
plt.title("Percentage of Activity in Minutes")
plt.tight_layout()
plt.show()

Percentage of Activity in Minutes

As seen from the pie chart,

  1. Sedentary minutes takes the biggest slice at 81.3%.
  2. This indicates that users are using the FitBit app to log daily activities such as daily commute, inactive movements (moving from one spot to another) or running errands.
  3. App is rarely being used to track fitness (ie. running) as per the minor percentage of fairly active activity (1.1%) and very active activity (1.7%). This is highly discouraging as FitBit app was developed to encourage fitness.

STEP 6: ACT

In the final step, we will be delivering our insights and providing recommendations based on our analysis.

Here, we revisit our business questions and share with you our high-level business recommendations.

1. What are the trends identified?

  • Majority of users (81.3%) are using the FitBit app to track sedentary activities and not using it for tracking their health habits.
  • Users prefer to track their activities during weekdays as compared to weekends - perhaps because they spend more time outside on weekdays and stay in on weekends.

2. How could these trends apply to Bellabeat customers?

  • Both companies develop products focused on providing women with their health, habit and fitness data and encouraging them to understand their current habits and make healthy decisions. These common trends surrounding health and fitness can very well be applied to Bellabeat customers.

3. How could these trends help influence Bellabeat marketing strategy?

  • Bellabeat marketing team can encourage users by educating and equipping them with knowledge about fitness benefits, suggest different types of exercise (ie. simple 10 minutes exercise on weekday and a more intense exercise on weekends) and calories intake and burnt rate information on the Bellabeat app.
  • On weekends, Bellabeat app can also prompt notification to encourage users to exercise.

The dataset and complete code can be found here.

--

--

Katie Huang Xiemin
Analytics Vidhya

Self-taught Data Analyst | SQL Enthusiast | GitHub @katiehuangx