Poor Data Ethics: Using automotive customer survey with pharmaceutical data

Published in

Spring 2019 — Information Expositions

4 min readMar 16, 2019

For this module assignment I combine two Mockaroo datasets that represent a user customer purchasing survey and private pharmaceutical information as a diary study with a section on speculative EDA for how this data could be misused. The first dataset contains basic user information regarding prior purchasing patterns such as the type of car that they purchased, what make, model, and year the automobile. The survey also collects the user email, first name, last name, city, and state which gives further background information that we will use to link user information in the EDA process.

The second dataset contains much more private data, such as the type of drugs they are prescribed, the doctor’s overall diagnosis, personal address, and even credit card information from payment. I will be merging and combining these two dataset to attempt to uncover and link an individuals public information to personal data in order to illustrate unethical information science practices. To begin this assignment, I will start by loading up both generated datasets individually and initially looking through the rows and columns.

Here is the first dataset for the automotive user survey:

You can see that this data provides information regarding the users car model, make, and year.

Pharmaceutical dataset:

This data has much more detailed information regarding personal insights for the individuals who are within this medical database. It provides credit card information as well at the end. Next, we will start looking through both datasets to see if there are any duplicates or interesting values that we would want to investigate further. For this first trial, I decide to check how many unique and individual values are in each dataset for prescribed drugs and user emails. I will be using basic pd functions with a “len” count to generate the sum value of unique objects.

As expected, we have 1,000 individual values for user email addresses, and the number of unique drugs throughout this dataset ranges around 857. The number of common pairing for drugs will likely be difficult as there is such a wide variety of medical products which will make the overlap less prominent. Before we continue, I wanted to look through our datasets and do a quick EDA to see the overall values in the data that overlap. To do this I will be using basic value_counts from each mock data columns that I’m interested in investigating.

From here, my next steps are to begin the joining process across the datasets. I will be using a right join for this in order to remove unwanted data and focus on joining the email and drug comparisons. The basic idea for this is to provide insight into how even an email could be used to link you to your private information easily if it is breached or unethically accessed. Now that we have seen some of the basic values, we can now move on to our joining process. To begin, we will just use a very simple pd.merge function to start off and get a better feel for our data.

After this process, we can now start combining the datasets by specific value sections, and this will comprise of creating two values and then using the set indexing to select the email and drug column for both except for the driver survey data which is supplemented with a random column since it will be removed during the right join.

The ending result looks something like this for right now, as you can see that you can link an individuals pharmaceutical information even through email addresses. This process could also be repeated by using first name, or last name. There are many ways that combining two dataset could be beneficial in linking mass user information for targeted advertisements or etc. I wanted to do more with my dataset for plot/visualization although most of my data columns had multiple non-numerical values that made my EDA process for plotting difficult. By joining multiple datasets together it allows the user to curate massive information as once and this info could be sold to advertisement companies or drug corporations in order to have more personalized advertisements to the consumers. Comparing the two datasets together would allow an individual access to link user information across multiple sources, even through basic survey information such as email if the private dataset also contains the information that would connect you to your medical history. There are many unethical data science practices that must be deeply considered and thought through fully before the information scientist decides to curate the information.

Poor Data Ethics: Using automotive customer survey with pharmaceutical data

Written by Justin Klemperer