Automating Cloud Storage Data Classification — Part 1

Get Cooking in Cloud

Priyanka Vergadia

Published in

Google Cloud - Community

5 min readMar 4, 2020

Authors: Priyanka Vergadia, Jenny Brown

Introduction

“Get Cooking in Cloud” is a blog and video series to help enterprises and developers build business solutions on Google Cloud. In this series we plan on identifying specific topics that developers are looking to architect on Google cloud. Once identified we create a mini series on that topic.

In this miniseries, we will go over the automation of data classification in Google Cloud Storage, for security and organizational purposes.

Overview of the Use Case and Overall Process (this article)
Deeper dive into Creating the Buckets and Cloud Pub/Sub topic and Subscription
Creating Cloud Functions with the DLP API and Testing

In this article we will cover the overall use case and define the problem.

Check out the video

Use Case and Changing the way we consider Security

Sometimes security goes beyond protection of external threats — maybe you need to keep part of your data separate from the rest of your dataset for confidentiality reasons. Organizational structure may necessitate a similar type of classification.

Either of these use-cases could be a manually-intensive processes, so we’ll take a look at how to make these situations easier by automating the classification of data you upload to Google Cloud Storage.

In the next few blogs, we’ll use Cloud Functions, Cloud Storage, and the Cloud Data Loss Prevention (DLP) API to help automate the classification of data uploaded to Google Cloud Storage.

To illustrate this scenario, we’ll work on helping our friends over at Dinner Winner (because we love food :)). Dinner Winner is an application that collects recipe submissions from users all over the world, and posts regular winning recipes for everyone to enjoy.

The recipe submissions need to be assessed for clarity and scrubbed for any identifiable information, before they are sent on to judging. And after the anonymous judging takes place, winners are contacted, and their recipes are posted to the application!

To get a solid understanding of where we might run into issues, let’s see how their system works today:

Example image to show manual extraction of confidential information from a file.

During the competition times, users upload their recipes in the text format, these recipes get stored in a blob store. From there, they’re manually checked for private information like emails and phone numbers of the participant. This private information is then removed, before sending to the judges to vote. And, as we know the winning recipe then makes it on the platform for everyone to see.

Problems with this scenario?

No room for growth — This is a manual process, and with a ton of volume, this system will scale poorly.
Security — It’s difficult to know what’s been reviewed and updated, what still has private information in it. Any issues here could lead to corruption of the contest, and brand damage for Dinner Winner.

So, the issue here is multi-layered. Submitted recipes can’t mix with reviewed recipes due to the potential private information from contestants, and corruption of the contest, and the current process is manual, which means problems across the board, especially when it comes to volume. Quarantining and classifying that data can be complicated and time consuming, especially given hundreds or thousands of files a day.

In an ideal world, we would be able to upload to a quarantine location, and have files automatically classified and moved to the appropriate location based on the classification result.

So we’ve established that Dinner Winner needs an automated process for sorting submissions and categorizing reviewed submissions. Let’s break down the ingredients and review the recipe we’ll be following.

How do we automate this process?

Architecture and flow diagram of the automated data classification using DLP API, Cloud functions, Google Cloud Storage and Cloud Pub/Sub

The user would upload a text file (including the recipe, name, date, email and phone number) in the web interface.
The uploaded file is sent to GCS, which triggers a Cloud Function to create the Data Loss prevention Job in Pub/Sub.
Once the event is received by pub/sub, it triggers another cloud function which calls the DLP API and looks for the confidential information in the recipe files.
Once found and removed, it moves the file to a separate storage bucket. If the file does not contain any identified confidential infotypes, then the function moves it to another bucket.

This request flow keeps the confidential data separate from the non-confidential and is fairly automated such that as soon as the file hits the Google Cloud Storage bucket, all the other processes kick off automatically to protect the data.

If you want to know more about constructing this setup, check out the next blogs in this series, where we’ll walk through step by step!

Conclusion

If you’re looking to classify data in an automated fashion, you’ve got a small taste of the challenges involved and the ingredients needed. Stay tuned for more articles in the Get Cooking in Cloud series and checkout the references below for more details.

Next steps and references:

Follow this blog series on Google Cloud Platform Medium.
Reference: Automating Cloud Storage Data Classification
Checkout the Codelab by Roger Martinez in collaboration with Jenny Brown: Automated Classification of Data Uploaded to Cloud Storage with the DLP API and Cloud Functions
Follow Get Cooking in Cloud video series and subscribe to Google cloud platform YouTube channel
Want more stories? Follow me on Medium, and on twitter.
Enjoy the ride with us through this miniseries and learn more about more such Google Cloud solutions :)