Automating Cloud Storage Data Classification — Part 3

Get Cooking in Cloud

Priyanka Vergadia
Google Cloud - Community
4 min readMar 16, 2020

--

Authors: Priyanka Vergadia, Jenny Brown

#GetCookingInCloud

Introduction

Get Cooking in Cloud” is a blog and video series to help enterprises and developers build business solutions on Google Cloud. In this series we plan on identifying specific topics that developers are looking to architect on Google cloud. Once identified we create a mini series on that topic.

In this miniseries, we will go over the automation of data classification in Google Cloud Storage, for security and organizational purposes.

  1. Overview of the Use Case and Overall Process
  2. Deeper dive into Creating the Buckets and Cloud Pub/Sub topic and Subscription
  3. Creating Cloud Functions with the DLP API and Testing (this article)

In this article we will create the Cloud Functions, set up DLP API and test our application for automated data classification.

What you’ll learn, and Use

How to automate the upload and classification of data with Google Cloud Storage.

  • AppEngine for the frontend
  • Google Cloud Storage to store recipe submission files
  • Pub/Sub for messaging
  • Cloud Functions for some quick automation in serverless fashion
  • DLP API to detect private information
  • This Solution for automating the Classification of Data Uploaded to Cloud Storage

Check out the video

Video: Automating Cloud Storage Data Classification: DLP API and Cloud

Review

In the last two blogs [1] [2] we have talked about Dinner Winner — an application that collects recipes from users all over the world, judges them, and then posts the winning recipe.

We know there might be personal information initially tied to these submissions that needs to be removed before anything else can happen, and we want this to be a seamless, automated process.

We saw an architecture of how they can automate and classify the uploaded data by using Cloud Storage, cloud functions, DLP and Pub/Sub.

This is the architecture for automated data classification using Cloud Pub/Sub, Cloud Functions and DLP API

Two Cloud Functions for Dinner Winner

For Dinner Winner, We need two cloud functions: One that is invoked when an object is uploaded to Cloud Storage and the other that is invoked when a message is received in the Cloud Pub/Sub queue.

Let’s start by creating the cloud function that triggers from the GCS bucket.

Create Cloud Function — “create_DLP_job”
  1. We will open Cloud Functions and create a new function. Name it as “create_DLP_job”
  2. In the trigger field, we will select Cloud Storage.
  3. And select the Quarantine bucket from the bucket browser.
  4. Select Python 3.7 for runtime.
  5. In the inline editor, copy and paste the code from the link in the description below in main.py and the requirements.txt.
  6. You will find that there is a create DLP job function which creates the DLP job from when a recipe is uploaded in the bucket.
  7. In the function to execute, we will replace the hello_gcs with create_DLP_job, and create the function

Now, we need to create the cloud function that gets invoked when a message is received in the Cloud Pub/Sub Queue.

  1. It’s pretty much the same, but this time we will choose Cloud Pub/Sub in the trigger field, enter our pub/sub topic.
  2. We will use the same function code as last time but, in the function to execute, we use resolve_DLP this time. And with that, we have both our cloud functions ready.

Testing the Automation

And now, for the moment of truth. Has our recipe for secure application automation done the trick for Dinner Winner? We need to test to ensure proper automation and detection of private info types in the content.

For the purposes of this exercise, we are defining the sensitive data info types as: name, email address, location and phone number. But you can change it in the code to however you wish for your own use case.

Dinner Winner
  1. Open Cloud shell
  2. Clone the git repo https://github.com/GoogleCloudPlatform/dlp-cloud-functions-tutorials.git
  3. Navigate to the sample data.
  4. Copy the files over to the GCS quarantine bucket.
  5. The DLP API inspects and classifies each file uploaded to the quarantine bucket and moves it to the appropriate target bucket based on its classification
  6. Open our buckets and review the uploaded files
  7. Check a file in the Quarantine bucket, to see if there is any sensitive data
  8. Check a file from a non-sensitive bucket to make sure it doesn’t contain any sensitive data

So that’s it! Dinner Winner has their upload and classification automated, and can return to their “gamified” recipe submission with confidence.

Conclusion

If you’re looking to classify data in an automated fashion, you’ve got a small taste of the challenges involved and the ingredients needed. Stay tuned for more articles in the Get Cooking in Cloud series and checkout the references below for more details.

Next steps and references:

--

--

Priyanka Vergadia
Google Cloud - Community

Developer Advocate @Google, Artist & Traveler! Twitter @pvergadia