Merging My Personal Data to Find Revealing Information About Myself

Most people go through life interacting with technology and very rarely stop to consider what kind of data is being collected on them and their every move in the digital world. Honestly, just thinking about the amount of data that countless amounts of companies have on me begins to stress me out because I don’t know what information they know or what they plan to do with that data. I was recently asked to consider the questions: What are the traces of data you generate in a week and how could these different pieces of data be combined to reveal aspects of your behavior that you wouldn’t necessarily want people to know?

I wanted to analyze my social media, Google, and credit card data over the span of three weeks to explore what I could find out about myself from joining these separate datasets. Credit card data alone is a vulnerable and potentially revealing piece of data, but when combined with social media and Google data I think it could be even more revealing. To begin my exploratory analysis, I used a website that would generate fake data, Mockaroo, to craft a digital diary study. I created the mock datasets by considering how the data would be collected and organized if I was in charge of designing them. Many companies today collect an excessive amount of data on their users so that they can sell it to advertisers for a profit, but in my mock datasets I wanted to collect only the data that I deemed most important. The resulting datasets are shown below in Figure 1.

Figure 1: Column names of mock datasets

For the purposes of performing a data analysis and joining the datasets, I crafted up 100 instances of credit card data, 300 instances of social media data, and 300 instances of Google data. After much trial and error, I ended up using an outer join for the Google and social media dataframes, and then used a right join for the resulting dataframe and the credit card history dataframe. The datasets were joined on the columns that contained the date of the digital footprint because it was present in all three datasets. I found the combination of these two types of joins helped to limit the number of duplicates and null values present in the final, three-way joined dataframe.

Now that the data was joined, I thought of three questions that I wanted to try and answer by analyzing the three-way joined dataset. These questions would have the possibility to expose aspects of my personality that could be used against me by advertisers or other people who would benefit from my digital behavior. The three questions are:

  1. Does the amount of time I spend on Google and social media daily correlate with where I use my credit card?
  2. Is there any correlation between the number of times I click on a Google page or the number of ads I am exposed to in a day and the amount of money I spend?
  3. On average, is there a correlation between the social media platform I use and the amount of money I spend that day?

Even though I performed the join methods that limited the number of duplicates and null values, there was still a good amount present in the final merged dataframe. The dataframe was composed of 11,015 rows with 32 columns which was significantly larger than the three original datasets as seen in Figure 2.

Figure 2: Shape of the four dataframes (rows, columns)

Although the final dataframe was so much bigger than the original ones, it didn’t have a big effect on the descriptive statistics and data analysis. For example, in the original credit card history dataset there were 100 different transaction instances with the average transaction being $92.48. In the three-way joined dataframe, there were records for 11,015 different transactions with the average transaction being only a little higher at $94.83. The biggest difference was that the original three datasets contained a combined 700 digital instances over three weeks, but after the three-way join the dataset contained 11,015 digital instances over the same time span. After the join, the average number of digital instances per day increased from 33 to 525 which is completely unrealistic. This increase is best seen in Figure 3 where the number of recorded instances is broken down by day.

Figure 3: Number of transaction before join (left) vs. after join (right)

After creating a mock diary study of my usage of Google, social media, and credit cards over a three week span, joining the three datasets, and performing some initial exploratory analysis on the three-way joined dataframe, it was time to finally answer the three revealing questions I had posed above.

1. Does the amount of time I spend on Google and social media daily correlate with where I use my credit card?

After doing some analysis I found that yes, the duration of time I spend on Google and social media have a correlation with the locations I use my credit card that day. My first finding was that the days I am on Google and social media for a longer amount of time, the more likely I am to shop at multiple stores. I also became aware that the I tend to shop at stores that sell beauty supplies, tools, and shoes when I spend a lot of time on social media that day; and I shop for tools, gardening supplies, and baby stuff the most when I spend more time on Google.

Figure 4: Graphs that show the correlation between duration of time on Google per day and where credit cards are used
Figure 4: Graphs that show the correlation between duration of time on social media per day and where credit cards are used

2. On average, is there a correlation between the social media platform I use and the amount of money I spend that day?

As a result of merging the datasets I was able to discover that on average the more I click and interact with a Google page, the more I spend on tools and games. It also revealed that the more ads I am exposed to while on the Internet, the more money I will spend on tools, gardening supplies, and baby stuff. This finding further supports question one because the more time I spend on social media, the more ads I am exposed to, and the more money I spend at the same three categories — tools, gardening, and baby.

Figure 5: Graphs that show the correlation between the clicks and ads on a Google page and where my credit cards are used

3. On average, is there a correlation between the social media platform I use and the amount of money I spend that day?

Lastly, I found a correlation between the social media platform I use and the amount of money I spend that day. The data revealed that, on average, I spend the most amount of money when I use YouTube and Reddit. This finding may not be something that I want advertisers to know because they could target me even more when I am using those platforms because they know that is when I am the most vulnerable to spend money. Meanwhile, they would opt to show me less advertising on the social media platforms that don’t influence me to spend as much money such as Instagram, Twitter, and Buzzfeed.

Figure 6: Graph that shows the correlation between social media platform usage and average transaction amount that day

When reflecting on this experience of collecting my personal data, joining the datasets, and performing an analysis, it was a very tedious process. I can’t think of many scenarios where a person would be willing to take the time to collect data and perform an exploratory analysis in an attempt to find out some juicy details about a person’s behavior. However, if you are still concerned about someone performing this join, I would just make sure to always keep in mind that there is very little privacy left in this digitally-connected world. Just know that if a service is free, it isn’t really “free” because your currency for the service is your data and they are selling it to make a profit off of you. That means you should be cautious about how you search, post, and interact with the digital space because people have the potential to identify who you are, even if it is seemingly anonymized data. Nowadays there is just so much data being collecting on a person that it would be hard to make yourself completely anonymous because there is such a variety of data being collected that can be used to identify you if a person was ambitious enough.

--

--