The Data You Generate… and the Dangers that Lurk Within It

Zach Lyons
Spring 2019 — Information Expositions
5 min readMar 16, 2019

With the big tech revolution beginning to hit its peak in 2019, humans are generating an unprecedented amount of data every day. Just about everything we do generates a digital record somewhere, whether it is a Facebook like, Tinder swipe, or Netflix sign-up. In fact, around 90% of the world’s data has been generated in just the past two years alone.

However, this increased connectivity comes with its own set of privacy concerns. If the data you are generating daily falls into the wrong hands, you could be put seriously at risk for identity theft.

Most data is anonymized to some degree when it is published online. Data tables like voter registration records, real estate information, and others are posted without any personally identifiable information (or PII). However, if the appropriate joins and merges are made with other data tables, your data can be re-identified and used to read into your behavior.

The Data We Generate

As I stated earlier, just about everything we do online creates some sort of digital footprint. The data we generate will have varying degrees of PII within it, so it’s important to always make sure you trust the website you’re using when you click ‘yes’ on the Terms and Conditions. All of this data has a purpose: banks need to keep transaction records, websites need to keep your password on file, and shopping websites like to track your preferences to more accurately predict products you’ll like.

A sample of some mock data

For this assignment, I’ve used the website Mockaroo.com to make some realistic-looking mock data that reflects a data-capturing scenario that users frequently experience in the real world. Here are the tables I made, with their attributes listed as well:

  • shopping_account_df: phone, username, password, credit card number (if saved), merchant, time zone, country, and state.
  • voter_records_df: first, last, phone, state, city, and postal code.
  • bank_account_df: sign-in email, password, credit card number (if saved), social security, and billing address.
  • transactions_df: credit card number, credit card type, amount spent, transaction type, merchant, and billing address.

Each of these reflects a different real world scenario. For example, transactions_df was created to be like a list of purchase records from the credit card company or merchant’s perspective. In the next section, we’ll use these to demonstrate what can happen if your data falls into the wrong hands.

Examples of Data Re-Identification and Malpractice

Even though these data tables I’ve created are intended to have varying degrees of anonymity, it is still easy to re-identify individuals by using the proper joins and merges. If two tables have a column of data in common, we can use the .merge() function within pandas to join them together into a data table that reveals more about each individual entry.

Lets start with the transactions_df data frame. This one is mostly anonymized, as it only contains billing address as a piece of identifiable information. However, when we join this table with shopping_account_df using credit card numbers, we start to learn a little more about who is making the purchase.

A merge between shopping_account_df and transactions_df

Now that we have merged the data to turn this anonymous credit card record into a purchaser profile with a little more context, we can find some more personal data, potentially. For example, if a person uses a username that is similar to their email, or uses the same username/password combo across multiple sites, that’s when things can start to get really dangerous. In the sample image above, I used a fake user’s shopping account username to find his bank account profile in bank_account_df, which is reflected in line 23 in the image above. This bank account profile has his social security and address, two pieces of information that could be used to steal his identity.

Merchants can use this sort of info as well. For example, if they have access to transactions_df, they can filter the dataset down to transactions that took place at their store. From there, they can use the billing address listed in the data to merge with other tables, such as voter records, to see who is actually doing the buying.

Finding the merchant’s top spenders in this dataset

Merchants can use this type of research to identify who their top buyers are, and what type of products they buy. They can also use it to tailor marketing to you, a practice that has been made common by Google AdSense.

Takeaway

To put a wrap on things, it’s important to consider the information you give out online and where it is going. Just about every click, swipe, or data entry you make on a device leaves some sort of digital footprint on a server somewhere. The samples I’ve shown you are just two of the infinite ways your personal data can be re-identified online.

Based on my analysis, I have two main recommendations. First, vary your usernames and passwords online, so that your data from two different sources cannot be easily joined. Secondly, avoid putting highly personal information (credit cards, social security, etc.) into a website unless it is absolutely necessary. It is probably necessary to give this information to banks, landlords, employers, etc., but avoid circulating it anywhere else if possible.

Stay safe out there, friends.

Note: all data used is completely fictional and does not reflect any real people. Any similarities are merely coincidental.

--

--