Google Fit Data Analysis
A simple way to explain the main steps of data exploration.
Explanation
A while ago I started to study data analysis, and after some learning exercises, I desired to make my own data exploratory project. I thought about the topic and decided that I don’t want to use common and generally available data resources such as Kaggle or other popular platforms with datasets. I wanted to use something personal and interesting to me. And my choice fell on data from the Google Fit Application I installed on my phone several months ago to push myself to do some sports activities.
So I had two purposes in mind when I started my project. The first was to go through the whole process of data exploration from the beginning to the end using the main concepts of data analysis: cleaning, exploring, and visualization. And the second was to evaluate my progress in everyday sports activities: do I become more sportive, how much time I invest in it, and what is the best kind of sport for me?
These two goals are really interesting and important for me, and now I want to share with you the results of my small research.
Some theory
Usually, the process of data analysis includes these five steps:
- Research objectives
- Gathering data
- Data preparation
- Data exploration
- Interpreting results
Setting the purpose of the research means that we have to know, what question(s) need to be answered. What we really want to find out, what is our point of interest? This affects what data, parameters, variables we need for the research. Depending on this, we will choose data sources where we will look for them.
After we have determined the purpose of the study and the kind of data is needed we should collect them from any resources available for us. The main thing is that data have to be as much as possible accurate and complete. Otherwise, all our research can lead to incorrect results and will be useless.
But it’s a very rare situation when we have nice clean data we can explore immediately. That’s why we take what we have and then clean it and prepare for exploration.
Only after all the previous steps, we can explore our data. It may include the usage of descriptive statistics, modeling, and the most useful tool — visualization. At this stage, we can confirm or deny our assumptions and get answers to the questions we asked setting the goals at the first stage of the study.
And in the end, we come to conclusions that we can share with our target audience, stakeholders, or ourselves.
Ok, let’s start and go through all these stages step by step.
1. Research objectives
As I’ve said before, there are two purposes of this project: to train in data analysis and to receive some feedback from my Google Fit Application about my sports activities. I have an assumption that by using this application on an everyday basis, I become more committed to sports, and over time my sports activity increases.
2. Gathering data
The raw data for this research are taken from Google Fit Application which I installed on my phone in September 2019. To download the data from this application, we should use the Google Takeout System. This system is designed to provide data from various google sources to its users. You have to do is register there and specify what kind of data from which application you want to receive, and google will send them to your email in CSV format.
For Fit Application, Google provides a set of files for each day of the given period and the file with aggregated data. We will work with the last one as it has all information we need. The period of time we are going to explore is from the 1 of November to the 31 of December 2019.
For our study, we will use Jupiter Notebook with Python modules Pandas and Matpoltlib.
Let’s start by importing the necessary modules…
import pandas as pd
import matplotlib.pyplot as plt
…and reading data from the file.
data = pd.read_csv('google-fit-data-file.csv')
Let’s take a quick first look at our data:
data.info()
We see that our data set has 92 rows and 25 columns. We have pretty many empty cells. Some data are absent at all (like Height and Heart Points). We’ll think about what to do with it later.
Now let’s take a glance at the first few rows of the data.
data.head()
So messy and uninformative. And by the way, this is why I love data science: it lets us get interesting and sometimes unexpected insights from this mess of numbers. Soon we will make an order from this mess.
Now as we know how our data looks like, we can start to prepare it for further exploration.
3. Data preparation
3.1. Cleaning
First of all, we should clean the data. What does it mean? It means we should get rid of everything that has no value or significance for the research. For example, we don’t need some columns in our data set, because they are not relevant to me now.
As we saw in data info, columns “Height” and “Heart Beats” are empty. We will not work with geographic data in this analysis, so we don’t need columns with latitude and longitude as well. We don’t want to discuss weight, so we delete all the columns with this data. And we don’t need a column with a time of rest and sleeping, because we talk about sport, not about rest.
We can delete these columns with the drop function in Pandas:
axis=1 means that we work with columns, not rows. inplace=True allows us not to duplicate the table to store new data, but change the dataset itself.
3.2. Formatting
Sometimes (frankly speaking, almost always) we have to change the data we have. We may have different reasons to do this, but at the end of the process, we have a more appropriate data format or data itself that can be easier to understand, describe and visualize. You’ll see what I mean in the examples below.
3.2.1. Type Converting
First of all, I want to convert the type of date column to the datetime format. It’ll give us an opportunity to work with dates as dates, not as strings. And it can be done with the built-in pandas to_datetime function:
data['Date'] = pd.to_datetime(data['Date'], dayfirst=True)
3.2.2. Dealing with Missing Data
Also, we want to fill all NaN (empty) values with 0 so that the pandas perceive them as numbers, not missing values. It’s necessary for further analysis and visualization. It can be done by the fillna pandas function.
data.fillna(0, inplace=True)
3.2.3. Changing the Dimensions
And as we look at the data, we see that some values have an inappropriate dimension. It’s very difficult to deal with time in milliseconds or with speed in meters per second. So we will convert the numbers to more familiar units and change the column names accordingly.
Now the dataset looks much better and clearer.
3.3. Adding information
Sometimes for some purpose, we need to add information to our dataset. For example, I want to add one column to our data table: the day of the week (it’s really interesting, do my sports activities depend on it or not). In pandas, every datetime object has an inbuilt property dayofweek. Applying it, we receive for each date a weekday numbered from 0 (Monday) to 6 (Sunday). We don’t want to use numbers: they are not informative. So we convert them into words and change the order of columns a little bit.
After all these transformations our dataset looks much better:
4. Data Exploration
And now let’s start the most interesting part — understanding the data we have. Here I’ll explore and visualize only a part of it for the purposes of brevity and simplicity. But for those who are interested, there will be a link to the GitHub repo at the end of the text where you can find the complete version of this project.
4.1. General activity
Ok, our dataset is clean and ready for exploration. Let’s dive into it.
We’ll begin with the general indicator of my daily activity: active minutes, that is, how many minutes of sports I did every day. Let’s draw a simple graph.
Interesting. Looks like hard beets. The number of minutes that I’m active is neither constant every day nor increases from day to day, from month to month. It seems to be random.
Ok, if there is no logic here at least we can answer one important question: do I do enough to stay as healthy as possible. We know that the World Health Organization recommends being active at least 150 minutes per week. Let’s check whether my activities fall within this framework or not.
First, we need to calculate the sums of minutes for each week. We see that the last week includes only one day, so we’ll discard this value.
And now we can make a plot for this data. We add a straight line for the value of 150 minutes to see whether my weekly activity is more than it or not.
Looks fine. At least I can say that I do enough effort to stay healthy.
But really how many minutes of sport I do usually? A histogram can help us to answer this question.
We see that this histogram has an almost normal distribution. The most common interval lies between 90 and 120 minutes. Not so bad. But after two and a half houses the number of days drops sharply. And I never walked for more than three hours.
And the last question that I have about these general values: do they depend on the day of the week or not? Exactly for this question, we added the column “Day of Week” to our dataset.
First, we count all sums of minutes per day of the week. Then we may calculate the average activity in minutes per day of the week. But we should be accurate: there may be a different number of different days of the week in a month. So we need to count how many times each day of the week appears in our dataset. And then we can calculate the average active minutes per day of the week.
We can draw a graphic that shows these results visually.
It looks very natural. Like many other people, I do a lot of stuff on Mondays and prefer to stay at home and rest on Saturday (you should know that Sunday is a working day where I live).
4.2. Kinds of activities
And now it’s time to talk about all the different kinds of activities I did during those three months.
Impressive, ah? I almost didn’t do anything except just walking around. It’s more than 75% of all my activities. On the second place is swimming, then a little calisthenics and then almost invisible running, gymnastics and pilates.
Ok. It’s not the time to be upset. I can improve it later.
4.3. Walking
As far as walking is my main kind of activities let’s talk about it more specifically.
We see that the average walking time is 75 minutes in a day. More than an hour. When we go every day to work or store we even don’t think how much time we spend moving from one place to another. But it’s important because has an impact on whole our life.
But what’s the distance I walked through all those days?
The minimum distance was 0. This is logical. Of course, it occurred that days I stayed at home. And the maximum distance I walked was 12 and a little bit more kilometers. It’s pretty much. Let’s look at when it happened.
The 28 of November. Oh, I remember that day. I traveled with my family. It was a really cool day.
And the one more output from this data is that usually, I walk more than four kilometers every day. Not so much, but not so little, ah?
And the last question about the distance: how many kilometers did I walk during those three months in general?
Wow! More than 400 kilometers!
And there is one more metric that is related to distance is the number of steps. The World Health Organization advises taking not less than 10 000 steps every day. Let’s see if I succeed or not.
Frankly speaking, not. I rarely do as many steps as I should. Let’s count what percentage of days did I fulfilled this condition.
Not enough, I think. I should work on it.
5. Interpreting results
Here we should answer the question(s) we asked starting the research and either confirm or reject the theory that was assumed. As far as you remember I had an assumption that using the Google sports application on an everyday basis will make me more sportive.
Well, it’s wrong. None of the graphs doesn’t have a tendency to increase constantly during the time. So, using the app — any sort of app — will not make your job. But it can show you what really happens. And having this data you can decide what to do. It’s only up to you, and it is good news.
Now I know that walking is my strong side. But I really have to make a huge effort to push myself into regular training. And I shouldn’t rely on external stimuli. The motivation is inside. I just need the courage to admit it and follow this way.
Conclusion
So we explored data from the Google Fit application for a period of three months. We saw which data we have, cleaned, reorganized, described, and visualize them. We identified relationships and patterns and drew graphics to illustrate them.
Working with data allows us to learn more about the phenomena that surround us, draw scientific conclusions that shed light on what was previously considered unknowable, and help to make informed decisions in all areas of application. That’s why I like data and data analysis. I hope to continue my studies and if I find something interesting, I’ll share it with people.
And here the link to the GitHub repository with the complete version of the project: https://github.com/shebeolga/Google-Fit-Data-Analysis. I’ll appreciate your comments and additions to the topic.