Exploratory Data Analysis about “my behavior” over a week

Kexin Zhai
Spring 2019 — Information Expositions
5 min readMar 16, 2019

Using Fake data to do data analysis

This week, we were going to combine a diary study with a speculative analysis about how you could be re-identified. One difference in this week’s analysis was that the data we used were fake.

Load Data

My research question was what factors will affect your performance on study? I used a website called Mockaroo to generate my fake dataset. All the data were generated in a random order based on custom settings.

This was what my dataset looked like:

First five rows of the dataset

It had 14 columns and 1000 rows. Here is the explanation of the columns.

  • Study time — Study hours in a day
  • Sleep time — Sleep hours in a day
  • Time spent on phone — Hours spent on phone a day
  • Breakfast — Eat breakfast or not during a day
  • Coffee — Drink coffee or not during a day
  • Buff Swipe — Times swipe Buff OneCard during a day
  • Diet type — Diet during a day is having more vegetable, more meat or equally balance
  • Mood — Mood during a day
  • Weather — Weather during a day
  • Study place — Study space
  • Use of planner — Use a planner for planning study or not
  • Number of course studying — Number of classes study for during a day
  • Website used when studying — Website I click on during studying
  • Email account — Email account that used for the websites log in

Then, I wanted to release a “personal” dataset that had my sleep hours, whether I have coffee and breakfast or not, the number of times using buff swipe, my diet type during the day and my mood.

“Personal” dataset

Next, there was a “factor” dataset that included time spent on the phone a day, weather condition and study place.

“Factor” dataset

Last, I wanted to release a “study” dataset that had study hours during a day, study place, whether using a planner or not, mood, number of course studying, website used during studying.

“Study” dataset

Join in data sets and EDA

I used “study_df” as the left DataFrame to merge in the “personal_df” and “factor_df” DataFrames using right, left, inner, and outer join, trying to explore the best re-identified dataset. All four types of joins ended up with the exact same number of rows with different structures. I decided to do exploratory data analysis on the datasets which used the inner join.

Dataset 1 — “personal_study_inner_df”

I used “study_df” as the left DataFrame, merge in “personal_df” DataFrames using inner join to create “personal_study_inner_df”. This dataset had 334112 rows and 12 columns.

The first five rows in “personal_study_inner_df”

Checked some columns value_counts.

When I had 6 hours of sleep time, I would study longer. The graph showed that having a 6–8 hours sleeping time would lead a longer study during the day. If my sleeping time longer than I need, it would also slightly affect my study time length.

Sleep time v.s. Study time

The mood was not a big factor in study performance. But we could still see if feeling tired during the day, it would lead to less study time than the other two types of mood.

Drinking Coffee would not influence study performance. Having breakfast also was not a factor affecting study performance. (These two graphs looked similar so I just put one of them.)

Diet type was not the factor influencing study performance.

Using a planner to organize the study plan would slightly lead to longer study time.

The graph showed if I study for three class a day, I would study longer during that day.

Dataset 2 — “factors_inner_df”

I used “study_df” as the left DataFrame, merge in “factor_df” DataFrames using inner join to create “factors_inner_df”. This dataset had 29718 rows and 9 columns.

Check value_counts.

In this dataset, be productive would have better study performance.

Here was the hard part to find out if using phone would lead to a positive influence on study or a negative influence on study.

Weather condition was not the factor influencing study performance.

In this dataset, it showed that study for three class a day would have longer study time.

Overall, diet-related factors would not influence study performance. Sleeping hours would be a critical factor affecting study hours. Lack of sleep would

Reflection

Part of the data could be collected online and part of the data could also be collected through diary study. If I don’t dine in a cafeteria on campus, the dining system would not have my record and data. I would change the time I use my phone during the study and have enough sleep every day to ensure the quality of learning.

--

--