Handling Sensitive Data on the Google Cloud Platform

Victor Sonck
Google Cloud - Community
7 min readDec 14, 2018

By Victor Sonck and Stijn Decubber, ML engineers @ ML6

Privacy in the digital world is becoming more and more important in the public consciousness. We share an enormous amount of data to countless different companies and governments. They might be doing good things with that data, they might even use it responsibly most of the time, but privacy goes much further than just behaving well. It’s about trying to set up the system in such a way that no one, not even in the future, when laws, norms and ethics change, will be able to use this data against the original owner. It’s not about what you might have to hide, it’s about not allowing anyone to decide that for you.

That’s where the GDPR comes in. It doesn’t prohibit companies to use your data, it only demands you are to be given the choice. But even if a company is allowed to use this personal data, there are strong regulations to follow in terms of data protection and security. Rules like not being allowed to use personal data except in production environments and strong encryption requirements pushes companies to take a very good look at how they are handling their data.

At ML6, we frequently work on machine learning projects that require us to handle sensitive data. In recent years, breakthroughs in deep learning and the release of software such as TensorFlow have cleared the road for exciting new software applications, often involving unstructured data such as images, audio or text files. However, identifying and masking sensitive parts in unstructured data is not always straightforward, and is sometimes even unfeasible with rule-based approaches. In this article, we will illustrate how Google Cloud machine learning services can be used to identify and mask sensitive data in unstructured datasets. We’ll start with some Google Cloud Vision examples, and then highlight how to tackle a more complex problem by combining cloud Vision with Cloud Natural Language. As we will show, these services are powerful, leverage state-of-the-art models for highly accurate results, and can be easily integrated in existing services or as upstream processing pipelines.

To follow along, you’ll need to create a Google Cloud Platform (GCP) Project. If you don’t have an account yet: the trial period lasts 1 year and you get to spend $300. Create your first project here and make sure billing is enabled. Google cloud console is a general place from where to manage anything you might want to do in GCP. In our case we’ll need the Cloud Vision API, and Cloud Natural Language API.

For authentication, you’ll need to create a Service Account, which basically allows any software you write to access the GCP Project on its own. Select New Service Account from the dropdown menu and enter a name. For our testing purposes let’s just give the account the owner role, so that it has access to everything. As a last step, download the JSON keyfile, and set the environment variable GOOGLE_APPLICATION_CREDENTIALS to its path. Your program will use this keyfile as a form of authentication.

Set the GOOGLE_APPLICATION_CREDENTIALS environment variable to set up authentication

Masking licence plates with Google Cloud Vision
Worldwide, we collectively took about 1.2 trillion photos in 2017 alone. All this data is being used for machine learning applications, from object detection to self-driving cars. But in a lot of scenarios, parts of the image are both unimportant to the machine learning model and highly sensitive. A self-driving car doesn’t really care about a license plate, only that there is a obstacle. Google street view has the same problem: the faces of pedestrians aren’t necessary for a good street view, but they are there nonetheless visible and we want to protect those people’s privacy.

Google’s Cloud Vision API is capable of powerful object and text detection in images. You can try it yourself at https://cloud.google.com/vision/

Now for the fun part: the code!

We start with our imports. The vision module from the google cloud package to be able to talk to GCP’s vision API. We’re going to send it pictures of cars and want to detect their license plates. We use PIL to open the image, draw our black box hiding the license plate on it and save the image. We’re using io to open the image and os to make sure out paths work cross platform.

You can read the explanation of each of the components in the comments of the code, but basically we make a function that has an image’s path as input and saves the censored version of the image. Note that it’s also possible to call most of Google Cloud services via a REST API, including the ones discussed here. Below is an example of the script working.

Normal Image
License plate obscured!

Masking faces
Now that we have our base code and know what to do, we can apply this method to various other categories that are recognized by Cloud Vision: faces, landmarks, logos… Below is the code to extend our previous example with the ability to detect faces. You can play around with that and plug in each of the above categories as additional ways to anonymize your data.

Normal people having fun
Normal people having fun in private

This code looks almost the same as the one above! How cool is that? It’s incredibly easy to do and you can create as much of these functions as you want (even combine them) and run your pictures through them for a complete anonymization scan.

Anonymizing store receipts: Cloud Vision and Cloud Natural Language

In the license plate example, we used Cloud Vision to obfuscate license plates simply by detecting and masking all text in an image. What if we want to mask only certain types of text? Enter Cloud Natural language. The natural language API provides out of the box entity recognition, sentiment analysis and even syntax analysis, and currently supports 10 major languages.

Suppose we have dataset of store receipt scans, where we want to mask out names of customer or staff members for privacy reasons. Simply detecting and masking all text would lead us to throw away all information. An alternative is to extract all text from the scan first with the Vision API, and then use entity recognition to detect which pieces of text actually are names.

Cloud Natural Language offers semantic text analysis and entity recognition out of the box. For our store receipt, the API can help us to recognize the presence of a name in the document. Combined with Cloud Vision, it’s easy to build a pipeline that masks names in images. https://cloud.google.com/natural-language/

Once we have identified the names, it’s straightforward to mask it out using the code from above:

Left: our scanned store receipt. Right: with Cloud Vision and Cloud Natural Language, we can anonymize the receipt with just a couple of lines of code without losing any other information.

The main difference with the previous examples in terms of implementation is the additional call to Cloud Natural Language. In practice, this means just two extra lines of code:

Cloud Speech and Cloud Video
In addition to the vision and natural language services, GCP also offers API’s for doing speech-to-text and for video analysis and annotation. While we will not cover these services here, they could obviously also be used for identifying and masking sensitive data. Just one example is the automatic extraction and masking of names from speech samples (for instance, customer service phone calls), by converting first to text with Cloud Speech and then calling Cloud Natural Language for entity recognition.

Cloud Data Loss Prevention
One additional option is to make use of GCP’s Data Loss Prevention API. It is mainly aimed at text data and and allows to detect and redact sensitive data such as credit card numbers, phone numbers and names. Furthermore, it can handle text streams, such that data can be redacted as it’s coming in, before it is ever written to disk. As such it’s very suited for real-time applications.

But is my data safe when sending it to Google Cloud?
Yes. Sending data to GCP services is done via HTTPS, ensuring security through a TLS connection, which means that you can be sure that you are sending data to the correct receiver, and that your data is encrypted in transit. More generally, Google Cloud Platform by default encrypts all data before writing it to disk. In addition, the encryption keys for each chunk of data are themselves encrypted with a set of master keys, and GCP offers various ways to manage these keys, ranging from having them managed automatically to the possibility to have the master keys managed by the customer and kept on premises.

At ML6, we have extensive experience with integrating machine learning in software products, be it by using Google Cloud API’s or by building custom solutions to tackle more complex problems. Privacy protection ranks high on our list of priorities for any project, and we believe we are in pole position to tackle complex ML problems in the future.

To read more about the GCP machine learning API’s, refer to this blogpost where we discuss how we used Cloud Vision to digitize more than one million catalogue cards for Ghent University.

--

--