Automating Cloud Storage Data Classification — Part 2
Get Cooking in Cloud
Authors: Priyanka Vergadia, Jenny Brown
Introduction
“Get Cooking in Cloud” is a blog and video series to help enterprises and developers build business solutions on Google Cloud. In this series we plan on identifying specific topics that developers are looking to architect on Google cloud. Once identified we create a mini series on that topic.
In this miniseries, we will go over the automation of data classification in Google Cloud Storage, for security and organizational purposes.
- Overview of the Use Case and Overall Process
- Deeper dive into Creating the Buckets and Cloud Pub/Sub topic and Subscription (this article)
- Creating Cloud Functions with the DLP API and Testing
In this article we will dive deeper into creating the different Cloud Storage Buckets we need and set up Pub/Sub.
What you’ll learn, and Use
How to automate the upload and classification of data with Google Cloud Storage.
- App Engine for the frontend
- Google Cloud Storage to store recipe submission files
- Pub/Sub for messaging
- Cloud Functions for some quick automation in serverless fashion
- DLP API to detect private information
- This Solution for automating the Classification of Data Uploaded to Cloud Storage
Check out the video
Review
Dinner Winner is an application that collects recipe submissions from users all over the world, and posts regular winning recipes for everyone to enjoy.
The recipe submissions need to be assessed for clarity and scrubbed for any identifiable information, before they are sent on to judging. And after the anonymous judging takes place, winners are contacted, and their recipes are posted to the application!
Based on their architecture, we’ve identified the primary issues as security and cross-contamination related, as well as huge efficiency gaps. The current, manual process will inevitably lead to system overwhelm and lapses in quality, so we’ve decided to seek automation as our main fix.
A Four Step Process
- Create Cloud Storage buckets to be used as part of the quarantine and classification pipeline.
- Create a Cloud Pub/Sub topic and subscription to notify between the two cloud functions.
- Create two simple Cloud Functions, one that invokes the DLP API when files are uploaded and the second that uses the DLP API to inspect and classify the files and move them to the appropriate bucket.
- Upload some sample files to the quarantine bucket to invoke the Cloud Function.
In this blog, we’ll go over steps 1 and 2.
Setting up the Environment
Before we can get started on creating buckets, we need to set up our environment:
- Create or select the GCP project and make sure the billing is enabled.
- Then go to the “APIs and Services” tab, and enable APIs for Cloud Functions, Cloud Storage, and Cloud Data Loss Prevention APIs.
- Enable the correct permissions so our app Engine Service account can connect with the DLP API.
- For that, we will open IAM & Admin page in our project, locate the app engine service account and edit the roles.
- Add Project “Owner”, “DLP Administrator” and “DLP API Service Agent” and save. NOTE: Since this is a demo, we are using “Owner” role here for the sake of simplicity, if you are developing a production app, specify more granular permissions than Project > Owner. For more information, see granting roles to service accounts.
- Next, we grant the DLP service account the permissions it needs. Locate the “DLP API Service Agent” and add a Project “Viewer” role to it.
Creating Storage Buckets
- We navigate to Cloud Storage and create them.
- We will need three buckets, one for the sensitive data, one for non-sensitive data and the last one for all the files as they come in. Name the buckets as you wish with a globally unique name. Once we have created all three buckets, you should be able to see them all in the storage browser.
Cloud Pub/Sub topic and subscription.
- Navigate to Cloud Pub/Sub and create a topic, provide it a name and hit create.
- Then to create a subscription corresponding to the topic, click on new subscription, provide it a name and create.
Conclusion
If you’re looking to classify data in an automated fashion, now you know how to do it. Stay tuned for more articles in the Get Cooking in Cloud series and checkout the references below for more details.
Next steps and references:
- Follow this blog series on Google Cloud Platform Medium.
- Reference: Automating Cloud Storage Data Classification
- Checkout the Codelab by Roger Martinez in collaboration with Jenny Brown: Automated Classification of Data Uploaded to Cloud Storage with the DLP API and Cloud Functions
- Follow Get Cooking in Cloud video series and subscribe to Google cloud platform YouTube channel
- Want more stories? Follow me on Medium, and on twitter.
- Enjoy the ride with us through this miniseries and learn more about more such Google Cloud solutions :)