You Probably Don’t Need Personally Identifiable Info (PII) in Your Data Warehouse

James Densmore
4 min readMar 20, 2019

--

It’s no secret that regulations such as GDPR are impacting how companies are handing personal data. It’s not just impacting marketers and those in charge of production databases either. The legal and ethical issues in handing such data has made its way into the domain of data science, machine learning and data warehousing.

I’ve worked for and advised several companies confronting the dual challenges of using a modern data warehouse to support data analysts and data scientists while ensuring they protect customer data. The challenge of protecting private data in a warehouse that’s used for modeling and analysis is a different challenge than a production app database. First, more people typically get access to a data warehouse than a production database. Second, the activities of data scientists and analysts often involve downloading raw data for use in Python scripts, Juypter Notebooks and so on. That means data being transferred to a laptop or a VM.

In the past, I and others have preached the “only take what you need” rule to data scientists. If you don’t need to know an email address, don’t query it. That was years ago, and though many people are still using that approach and genuinely mean well, it’s time to get serious about protecting user data while keeping your team productive.

What do you actually need?

What’s interesting is that in nearly every use case I’ve encountered, a data scientist or analyst doesn’t need any PII. That includes emails, usernames, granular geographies, and even IP addresses. What they need is a way to associate an attribute or event to a single user, a high level geography, a signup date, and so on.

When I suggest that an organization remove PII, or never add in it first place, from a data warehouse I get several pushbacks.

“I need to know the user’s email/username so we can personalize their experience on the site or in emails.”

This is true in the end, but you don’t need it to build a model. One way to deal with this problem is to generate a unique ID for each user and store only that in your data warehouse. Build your models using this and pipeline your results back to your production database where you can map this id back to your users. This is in no way a new concept, but it’s one more step so it’s often not taken.

The same goes for the use case of building a dashboard or other in app reporting at the user level. I’d argue that’s not a job for the data warehouse but rather a production system. If you need to crunch some statistics and feed them back into production, that’s fine. You can map them back to the user on the production side in the same way.

“We’re storing PII in our production database anyway, why not store it in the data warehouse too?”

The simple answer is, the fewer places you store PII the less likely it will leak. It’s also easier to comply with a request to delete a user’s data if it’s in fewer places. On the political end of things, you’ll run into less pushback from your organization (and your customers) if you aren’t exposing PII in your analysis activities.

“Why not just use column level security and encryption to ensure access to sensitive columns is restricted and protected?”

Doing so limits risk, but it’s still there. Furthermore, you’re still exposing yourself to scrutiny, complex audits and meetings with your legal department (yay!) to ensure your policies are done right. It’s a lot easier to simply not bring that data in, show legal a list of columns you have and be done with it. It’s also going to send your customers a signal that you’re serious about protecting their data.

Is it really that simple?

Well, nothing’s simple when it comes to this kind of thing! For one, defining “PII” is more challenging than it sounds. Email addresses, names, etc. are obvious but non-obvious situations exist as well. For example, many websites add usernames to URLs which go into logs, which in turn get pulled in the warehouse. In reality, it’s best to consult with your internal legal and technical leadership to ensure that you’re handing cases like that in a manner that complies with your (and your clients) local regulations as well as any client contracts. There are creative ways to handle each case, it’s all about knowing with ones are worth the effort.

Eliminating PII from it is also not an excuse for improperly securing your warehouse or being loose about how and where you store data extracted from it. Just because it doesn’t contain PII doesn’t mean it’s not valuable (otherwise you wouldn’t have a job!) or confidential.

I can promise you one thing though, the less you expose your data team to PII the less time you’ll spend worrying about compliance. And that means more time doing the things that you love to do with data.

--

--