Getting Started with Google’s Data Loss Prevention API in Python

Stefan Gouyet
Jul 13, 2020 · 5 min read

Cloud Data Loss Prevention (DLP) is one of the many cloud security products that Google has to offer, allowing users to mask personally identifiable information (PII) in their data.

Google’s DLP can be used via the GCP console or the API; for the purpose of this article, I will focus on the latter.

If your data includes PII such as email addresses, DLP offers several ways to mask this information. For example, we can replace the PII with hashtags:

Alternatively, we can mask the information by replacing it with its [information-type]:

Let’s see how we can implement this with Python and the DLP API.

Step 1: Authenticate

Let’s first authenticate by providing our GCP credentials. An easy way to do this is by creating a service account and adding it as an environment variable (using Python’s OS module).

import osos.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'data-loss-prevention-test-74b082472d34.json'print('Credentials from environ: {}'.format(os.environ.get('GOOGLE_APPLICATION_CREDENTIALS')))

Alternatively, if you already have the gcloud command-line interface up and running, the following command will allow you to authenticate:

gcloud auth application-default login

Step 2: Consult Google’s documentation and find applicable DLP function

Now it’s time to use the DLP API. The following is a Python function found in Google’s documentation. The function uses the google.cloud.dlp library, reads a string of text (input_str parameter), and prints out the text masked based on the info_types parameter.

The one change I made to the above function is on the final line, highlighted in the screenshot below. This returns the response.item.value, instead of only printing it.

Here is the function, with the above edit:

def deidentify_with_mask(
project, input_str, info_types, masking_character=None, number_to_mask=0
):
"""Uses the Data Loss Prevention API to deidentify sensitive data in a
string by masking it with a character.
Args:
project: The Google Cloud project id to use as a parent resource.
input_str: The string to deidentify (will be treated as text).
masking_character: The character to mask matching sensitive data with.
number_to_mask: The maximum number of sensitive characters to mask in
a match. If omitted or set to zero, the API will default to no
maximum.
Returns:
None; the response from the API is printed to the terminal.
"""
# Import the client library
import google.cloud.dlp
# Instantiate a client
dlp = google.cloud.dlp_v2.DlpServiceClient()
# Convert the project id into a full resource id.
parent = dlp.project_path(project)
# Construct inspect configuration dictionary
inspect_config = {
"info_types": [{"name": info_type} for info_type in info_types]
}
# Construct deidentify configuration dictionary
deidentify_config = {
"info_type_transformations": {
"transformations": [
{
"primitive_transformation": {
"character_mask_config": {
"masking_character": masking_character,
"number_to_mask": number_to_mask,
}
}
}
]
}
}
# Construct item
item = {"value": input_str}
# Call the API
response = dlp.deidentify_content(
parent,
inspect_config=inspect_config,
deidentify_config=deidentify_config,
item=item,
)
# Print out the results.
print(response.item.value)

return response.item.value

The info_types parameter is where you specify which types of PII data you would like to mask. Google has a long list of possible infoTypes you can mask, from generic ones such as PERSON_NAME and EMAIL_ADDRESS to more specified ones such as country-specific passport number (i.e. FRANCE_PASSPORT).

For the purposes of this tutorial, my info_types are the following:

When we try running the deidentify_with_mask function on a string that includes sensitive information, we are returned with the following:

Step 3: Run DLP function on PII data

Now that we are correctly connecting to the DLP API, we can run it on a dataset that contains sensitive data.

Let’s read in some data for DLP to mask:

import pandas as pddf = pd.read_csv('test_pii_data.csv')

FYI: this dataset is a randomly generated dataset of names, phone numbers, and e-mail addresses, combined with a list of addresses of Fortune 500 headquarters (DLP will not recognize or mask fake addresses so randomly generated locations would not work for this demonstration).

As we can see above, the Unmasked_Text column contains our name, address, e-mail, and phone number data, concatenated together to imitate actual text data. Now let’s apply our DLP function to the dataset:

df['Content_masked'] = df.apply(lambda row: deidentify_with_mask(row['Content']) ,axis=1)

We are returned with a dataset in which a new column, Content_Masked, contains our original text data, with all sensitive information masked.

That’s how easy it is to mask your PII with Google’s DLP API. The results are overall very good, with the data’s sensitive information correctly masked regardless of the sentence structure.

In practice, the generic infoTypes can sometimes lead to some PII not being captured, including less conventional names (i.e. Tiger) or misspelled addresses (i.e. 405 Lexingtone Ave). This can be prevented using customized infoTypes, which allow users to better optimize DLP for their use-case.

I hope you enjoyed this article and please let me know of any questions. The repo can be found on Github.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Stefan Gouyet

Written by

Data Engineering Freelancer | Cloud enthusiast | New York + SF Bay

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store