Bad Big Data Ethics : Overlooking User Privacy to Boost Business Profits

Published in

Spring 2019 — Information Expositions

9 min readMar 14, 2019

In this module I combine a diary study with a section on speculative EDA findings. The diary study contains two main datasets. The first dataset is generated from scraping cell phone application users data. The data contains the email address they sign up with, the gender displayed on their profile, state they reside in, marriage info based on status and language from settings. The second dataset was acquired from a clothing store which records user sales, and more detailed membership data combined with customer surveys. This data contains store sales per user such as shirt size, shoe size and records credit card company type transactions. The customer survey data exposes users favorite app, how frequent they use it, and their car make and model. The store conducts these surveys to better serve the customer based on their own user metrics and store sales trends.

Understanding these customers will help benefit this clothing business by using customized target customer advertisement. The store can now target their customers based on location, language, gender, most frequently used app and user sales trends. By showing new items or sales merchandise in the customers size or style on their commonly used apps can promote stronger sales outside the store. Furthermore understanding these customers financial status allows you to better serve customers shopping for standard or modest clothing and other niche customers who are looking to pay more for top brand fashion goods. Separating these users financial status can be based on credit card types, car model and year. Forming these user tiers of financial status and spending on luxuries or necessities allows the clothing store to advertise exclusive name brand fashion merchandise to its high tier customers and standard appeal to the average shopper. Doing this will boost high value good sales while not wasting advertising revenue and unit sales trying to advertise high product cost to an average user. On the other hand advertising average goods to a luxury customer can dwindle potential to sell luxury merchandise in stock. To better evaluate and serve these customers I added a column that estimates the users height in inches based on their shoe size. The function to find shoe size based on user height is (4.5 * shoe size + 140) / 2.54 to get height in inches. Knowing these user dimensions and clothing sizes are very optimal to clothing stores advertising and stocking success. Through this user diary study I was able to generate speculative analysis questions then pivot, join, and combine the two separate datasets to prepare me for my final EDA process.

EDA: Speculative Analysis(Shows customer behavioral patterns)

To begin my analysis I randomly explored both of my datasets looking for interesting connections and customer behavior trends. I simply wanted to look at users shoe size to see what the most common size was being purchased. Turns out size 7 is most popular among the customers with 39 sales, size 11 with 37, size 14 with 36, and surprisingly size 15 with 35. This shows the shoe department they must stock the full range of shoe sizes because you have both small customers around size 7 and large who consistently buy size 15’s. Looking at the average shoe size overall uncovers size 10.5 would be a good size to have in stock across all shoe models.

Looking into a bit of the user habits and advertising side I gathered the top used phone applications by users and plotted usage trends based on shoe size, language, and gender. This gives the advertising team critical understanding on where to focus based on which target customer they are trying to reach. Looking at shoe size and app usage they could predict from the top selling shoes, selling new size 7 inventory shoes would be best to advertise on Instagram, size 11 Facebook and Instagram, size 14 Twitter and size 15 Twitter or Instagram.

Targeting users per language the store would want to advertise to English speaking customers on Instagram, French on Twitter or Facebook, German all apps but Reddit, and Spanish on Twitter or Tinder. I also got the general idea that males use Facebook and Snapchat more than females while females use Instagram, Reddit, Tinder, and Twitter primarily. The store should advertise male products more on Facebook and Snapchat while female products are focused more on the other apps.

I wondered to myself if a users shoe size could predict the type of car make they drive. Based on the plot Hummers being a large car show no one under a size 13 shoe drives one, where as the Ferrari is compact which 1 driver who is small also has a size 8 shoe. Finally the Lamborghini being a bit larger than Ferrari allows driver with size 12 shoes to operate comfortably.

Since I now had a very keen understanding of my datasets and the users within them I could begin my final EDA process. Before I began I checked a simple join checklist which showed me I could join both my datasets on either user id and gender. I checked both data frames for unique values to ensure I didn’t get any additional or lose one during join. Based on how the clothing company collected this data there were no unique user id’s that were in one dataset but not the other. To further complete my join checklist I ensured the years were consistent based the car model data metrics.

My strategy for performing this EDA was on a left join. I picked left join because it preserved all the main user shopping data metrics while dropping any missing user data within the secondary data set customer surveys. After the left join I recheck my data and confirmed it did not add any repeating data and still covered the proper year ranges of cars. My first EDA questions I want to explore with my new data is what states are the top shoppers from? This analysis will give the store an idea of what amount of overhead is needed per state, based on user visits. My second question was what is the average shoe size per state? This way each store knows what sizes to keep in stock per state. The final questions I look into was what the stores customers average shirt size is, this way they can keep these in stock over other sizes to improve sales.

The first plot shows the store what state has the most amount of customers. With this plot they can understanding the amount of overhead merchandise they need per states to remain in stock and reach optimal sales. For example I will examine the top 3 states from the 10 states in order to expose customer behavior. Texas customers are most persistent they have a mean value around 170 total users. Interestingly Colorado shoppers have the least amount of people on average visiting the store. While California also has a lower customer base they have higher outliers than Texas customers can produce. Beyond the top three states it was surprising to see a city store like New York’s mean be comparable to Texas’ but New York’s customer performance overall is underwhelming.

The second plot shows the store what the average shoe size of their customers are per state. With this plot the store can sufficiently stock shoes in sizes that are most common for their stores state. They could examine this data to determine in Texas stores to best serve their customer it must have shoes in stock at least from size 9.5 to 11.5 where outliers lay, and overstock the mean customer size of 10.5. To retain sales in Colorado stores they must stock a larger range from size 10 to 14 with top stock in size 12. California store customers will generally be stocked the same as Colorado for shoe size. Colorado customers have the biggest feet by comparison to other states. Interestingly DC customers have the smallest feet on average with the lowest size of 5 and highest 11 but most DC customers purchase a 9.5.

The third plot shows the store what the customers most common shirt size is. With this plot the store can sufficiently stock proper shirt sizes across all stores to gain the most sales with absence of overstock in unpopular sizes. This plots results are pretty self explanatory but you can see the clothing business overall can predict they should retain the most stock in Large shirts followed by XL, Small, 3XL, 2XL, XS, and least bought shirt size was Medium. To further benefit these storefronts on what types of style or size merchandise to stock to optimize sales I averaged the size of all customers. This clothing stores average customer height is 74 inches or 6.1 feet. They could further explore this data if they need to look into stocking pants in the future.

Reflection:

Since I was able to gather personal user metrics from a social media application and combine them easily with their customer shopping/spending habits using two separate data sets there is a clear indication of identity or privacy concerns. Having the user id identical to each person on both data collection servers does help personal advertising as exposed in my EDA but it voids customer trust and anonymity. This would be a long shot but for example having this data someone could target a specific customer by identify their state, the clothing store in the state and go stalk the store attempting to match customers by the cars they arrive in, what size clothing they purchase and what credit card they end up paying with. While that scenario sounds far fetched it would be possible given the lack of the stores privacy allowing simple data joins. To prevent the discovery of users identity based on combined survey results the clothing store needs to generate a random user id and not remain consistent based on email they used to sign up for their social media accounts, membership at store and completed survey online with. Having anonymous user id’s in different database collections will save user identity but allow the company to retain their analysis on efficient store stock and advertising tactics based on consumer actions. If I was to change my habits to avoid this happening to me I should be more proactive on what personal data I am posting and giving away on my social media accounts. I should also create a few throw away non personal email accounts to sign up for surveys and social media apps so my email can’t simply link data from one to the other.

Bad Big Data Ethics : Overlooking User Privacy to Boost Business Profits

Written by Steven Rothaus