REAP: 4 principles for Data Security

Published in

KC AI Lab, LLC

3 min readOct 12, 2018

click here for an article about the impact of GDPR

It feels like our personal data gets less secure every day. Stories about data breaches are in the news alarmingly often.

As even more of the economy goes digital, and as companies collect ever larger oceans of data to drive new & deeper insights through data science, we face the risk of ever more breaches which may only continue to worsen in terms of scale, severity, and frequency.

As data scientists, each of us has a duty to protect the data we handle. The nature of our jobs requires that we sometimes have access to vast amounts of data. Very often this data will be proprietary to our employers or clients. It may personally identify millions of people. It may contain confidential business information and trade secrets. It may even contain government classified information. If data leaks to the public, or even just to the wrong person, it can ruin companies, destroy reputations, expose individuals to fraud, or worse. There are significant legal and regulatory risks for businesses as they collect data about their customers; the EU’s General Data Protection Regulation (GDPR) carries hefty penalties and more laws like GDPR and the California Consumer Privacy Act are certain to come about in the future.

We are the stewards of the data we collect and process. It is incumbent on data professionals to make sure we don’t handle it in a careless manner that could expose sensitive information to unauthorized parties.

There are four principles that ensure we reduce or eliminate the risk of leaking sensitive data:

Restrict data access to specific, authorized parties. It can be tempting to share data sets in a location that is widely (or even publicly) accessible for the sake of ease. For instance, it can be tempting to make an Amazon S3 bucket public, or to share a Dropbox folder by generating a link that anyone can use. Resist the temptation to relax security. Grant access only to the specific parties who need it.
Encrypt any sensitive data you have a legitimate need to store. Also, use private & encrypted means to send data to others — don’t transmit sensitive data in the clear over the Internet.
Anonymize data for analysis by scrubbing sensitive, identifying attributes, such as PII or financial account numbers. The data set need not contain these identifying attributes in order to gain valuable business insights. For example, a product recommendation model can be trained to predict countless valuable things about a person’s shopping behavior without ever needing to know that person’s name, address, phone number, or account number.
Purge sensitive data when you no longer need it. Think critically about what you genuinely need to keep. Analysts and engineers should not retain their own copies of sensitive data. Organizations should define the life cycle of their data, decide on a retention policy, put it in writing, and enforce it.

The world of the future is going to be even more data driven than the world today already is. However, we as data scientists get to help ensure that we have a data-driven world done right. In fact, it is our duty to be good data stewards. We can help humanity reap the benefits of big data and machine learning without sacrificing privacy and confidentiality to do so.

If you want to be a data superhero, drop us a line:

kcail.com

REAP: 4 principles for Data Security

Written by Alexs Thompson