Automating Cloud Storage Data Classification — Part 2

Get Cooking in Cloud

Authors: Priyanka Vergadia, Jenny Brown

#GetCookingInCloud

Introduction

Get Cooking in Cloud” is a blog and video series to help enterprises and developers build business solutions on Google Cloud. In this series we plan on identifying specific topics that developers are looking to architect on Google cloud. Once identified we create a mini series on that topic.

In this miniseries, we will go over the automation of data classification in Google Cloud Storage, for security and organizational purposes.

  1. Overview of the Use Case and Overall Process
  2. Deeper dive into Creating the Buckets and Cloud Pub/Sub topic and Subscription (this article)
  3. Creating Cloud Functions with the DLP API and Testing

In this article we will dive deeper into creating the different Cloud Storage Buckets we need and set up Pub/Sub.

What you’ll learn, and Use

How to automate the upload and classification of data with Google Cloud Storage.

  • App Engine for the frontend
  • Google Cloud Storage to store recipe submission files
  • Pub/Sub for messaging
  • Cloud Functions for some quick automation in serverless fashion
  • DLP API to detect private information
  • This Solution for automating the Classification of Data Uploaded to Cloud Storage

Check out the video

Video: Automated cloud storage classification: Setup

Review

Dinner Winner is an application that collects recipe submissions from users all over the world, and posts regular winning recipes for everyone to enjoy.

The recipe submissions need to be assessed for clarity and scrubbed for any identifiable information, before they are sent on to judging. And after the anonymous judging takes place, winners are contacted, and their recipes are posted to the application!

Based on their architecture, we’ve identified the primary issues as security and cross-contamination related, as well as huge efficiency gaps. The current, manual process will inevitably lead to system overwhelm and lapses in quality, so we’ve decided to seek automation as our main fix.

A Four Step Process

Automated data classification architecture using Cloud Storage, Cloud Pub/Sub, Cloud Functions and DLP API
  1. Create Cloud Storage buckets to be used as part of the quarantine and classification pipeline.
  2. Create a Cloud Pub/Sub topic and subscription to notify between the two cloud functions.
  3. Create two simple Cloud Functions, one that invokes the DLP API when files are uploaded and the second that uses the DLP API to inspect and classify the files and move them to the appropriate bucket.
  4. Upload some sample files to the quarantine bucket to invoke the Cloud Function.

In this blog, we’ll go over steps 1 and 2.

Setting up the Environment

Before we can get started on creating buckets, we need to set up our environment:

  • Create or select the GCP project and make sure the billing is enabled.
  • Then go to the “APIs and Services” tab, and enable APIs for Cloud Functions, Cloud Storage, and Cloud Data Loss Prevention APIs.
Step 2: Enable the APIs and services (1)
Step 2: Enable the APIs and Services (2)
Step 2: Enable the APIs and Services (3)
  • Enable the correct permissions so our app Engine Service account can connect with the DLP API.
  • For that, we will open IAM & Admin page in our project, locate the app engine service account and edit the roles.
  • Add Project “Owner”, “DLP Administrator” and “DLP API Service Agent” and save. NOTE: Since this is a demo, we are using “Owner” role here for the sake of simplicity, if you are developing a production app, specify more granular permissions than Project > Owner. For more information, see granting roles to service accounts.
Edit roles
  • Next, we grant the DLP service account the permissions it needs. Locate the “DLP API Service Agent” and add a Project “Viewer” role to it.
Edit permissions

Creating Storage Buckets

  1. We navigate to Cloud Storage and create them.
  2. We will need three buckets, one for the sensitive data, one for non-sensitive data and the last one for all the files as they come in. Name the buckets as you wish with a globally unique name. Once we have created all three buckets, you should be able to see them all in the storage browser.
Cloud Storage Buckets

Cloud Pub/Sub topic and subscription.

  • Navigate to Cloud Pub/Sub and create a topic, provide it a name and hit create.
Create a Pub/Sub Topic
  • Then to create a subscription corresponding to the topic, click on new subscription, provide it a name and create.
Create a Pub/Sub Subscription

Conclusion

If you’re looking to classify data in an automated fashion, now you know how to do it. Stay tuned for more articles in the Get Cooking in Cloud series and checkout the references below for more details.

Next steps and references:

--

--

--

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Recommended from Medium

Giorgio Ricca v Guy Orly Iradukunda “liveStream”(live)

Online live stream search engine

Bias vs Variance

Poisson Distribution and Poisson Process in Python — Statistics

I let it pass and then told the owner I was sorry but I just couldn’t risk touching it

Reinventing Histograms Part 2

How to Outsource Data Annotation: Effectively using project guidelines

A project-driven approach to learning PySpark

Anecdotes May Be Charming, But Are They Reliable?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Priyanka Vergadia

Priyanka Vergadia

Developer Advocate @Google, Artist & Traveler! Twitter @pvergadia

More from Medium

Aggregate Vertex AI model training logs in a BigQuery Table

Quickstarts : BigQuery and BigQuery ML

How to use Backfill: the Time Machine for Scheduled Queries in BigQuery

Get started with BigQuery ML & Cloud Composer(Airflow)