Generating Fake Data and Examining Grouping and Aggregating

Published in

Spring 2019 — Information Expositions

4 min readMar 16, 2019

Something that can be concerning to many people who feel their data could be linked to them could potentially be connected to their personal information through their car companies. When purchasing a car, you provide the company with personal information such as name, car type, credit card info, email, gender, and many others. This data could also be used to examine trends and patterns in the car industry. In order to demonstrate, here is an example of a dataset from Mockaroo that could be thought to be data collected from car sales. The dataset is made up of 1,000 rows and has the headers of name, city, state, email, gender, credit card, credit card type, money in the bank, company, college, car make, car model, and car make year.

From here, there are many ways that a person could be tracked via this data frame. One example that could be explored would be a grouped comparison of the State a person is from, matched with the average amount of money that is in their bank. The first step in this process would be to examine the value counts for the states that are in this data frame. This is seen below:

From this first step, we already know that in this data set the highest number of subjects came from California, Texas, and Florida. From here we want to get a grasp on the average amount of money a person has in the bank. In order to do this, we will first need to put the states into a state type and then create a dictionary to store all of the averages from the money section. Once this is complete, we can view the series that was just created.

From this, we can initially observe that the highest average is in Maine, with an average of $96,728.59, and the lowest average is in Mississippi, with an average of $13,315.18. In addition however, we know from the value count that Maine only had one participant, and Mississippi had 2, so we must take into consideration that these numbers are not based on enough data to be used to make any claims. After this has been completed, we can then group by the state and the average amount of money a person has.

Thanks to the groupby, we now have a combination of the state and average amount of money, contained in mock_df_gb. When it comes to connecting this data to a person, one way could be through examining the type of car a person has compared to the money that they have in the bank. This can be done using the state that a person is from as the X value, the amount of money as the Y value, and the color determined the type of car that they have. This can be seen in the visualization created from this data below:

A trend that we notice based on this data is that California, Texas, and Florida have the most spread out amounts of money. Based on this data, that would lead the person using the data to stay away from targeting people in these states due to the fact that there is a large fluctuation of the money range. This would lead to the targeting of a state that has a more noticeable trend such as states like Tennessee, Utah, Kansas, Ohio, Illinois, Minnesota, and a few other arguments could be made. These states have a noticeable average based on the data visualization, and this could help for car companies to create a marketing plan for a certain state and income.

When it comes to preventing this data from being connected to you, there aren’t many options that would stop someone from connecting the dots. One possibility could be to use a debit or credit card solely for the purchase of the car and nothing else. Another option could be to move from the city that you lived in when your data was recorded. And finally, the last option would be to buy a different car that is a different model and brand. These options would not necessarily prevent the data from being connected, but it would at least make the process more challenging. Data is recorded when we do just about anything, and it is hard to take preventative measures against something like purchasing a car because it is something that we all do.

Generating Fake Data and Examining Grouping and Aggregating

Written by Brandon Roten