Getting Started with Google’s Data Loss Prevention API in Python

Published in

Analytics Vidhya

5 min readJul 13, 2020

Cloud Data Loss Prevention (DLP) is one of the many cloud security products that Google has to offer, allowing users to mask personally identifiable information (PII) in their data.

Google’s DLP can be used via the GCP console or the API; for the purpose of this article, I will focus on the latter.

If your data includes PII such as email addresses, DLP offers several ways to mask this information. For example, we can replace the PII with hashtags:

Alternatively, we can mask the information by replacing it with its [information-type]:

Let’s see how we can implement this with Python and the DLP API.

Step 1: Authenticate

Let’s first authenticate by providing our GCP credentials. An easy way to do this is by creating a service account and adding it as an environment variable (using Python’s OS module).

import osos.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'data-loss-prevention-test-74b082472d34.json'print('Credentials from environ: {}'.format(os.environ.get('GOOGLE_APPLICATION_CREDENTIALS')))

Alternatively, if you already have the gcloud command-line interface up and running, the following command will allow you to authenticate:

gcloud auth application-default login

Step 2: Consult Google’s documentation and find applicable DLP function

Now it’s time to use the DLP API. The following is a Python function found in Google’s documentation. The function uses the google.cloud.dlp library, reads a string of text (input_str parameter), and prints out the text masked based on the info_types parameter.

The one change I made to the above function is on the final line, highlighted in the screenshot below. This returns the response.item.value, instead of only printing it.

Here is the function, with the above edit:

def deidentify_with_mask(
    project, input_str, info_types, masking_character=None, number_to_mask=0
):
    """Uses the Data Loss Prevention API to deidentify sensitive data in a
    string by masking it with a character.
    Args:
        project: The Google Cloud project id to use as a parent resource.
        input_str: The string to deidentify (will be treated as text).
        masking_character: The character to mask matching sensitive data with.
        number_to_mask: The maximum number of sensitive characters to mask in
            a match. If omitted or set to zero, the API will default to no
            maximum.
    Returns:
        None; the response from the API is printed to the terminal.
    """# Import the client library
    import google.cloud.dlp# Instantiate a client
    dlp = google.cloud.dlp_v2.DlpServiceClient()# Convert the project id into a full resource id.
    parent = dlp.project_path(project)# Construct inspect configuration dictionary
    inspect_config = {
        "info_types": [{"name": info_type} for info_type in info_types]
    }# Construct deidentify configuration dictionary
    deidentify_config = {
        "info_type_transformations": {
            "transformations": [
                {
                    "primitive_transformation": {
                        "character_mask_config": {
                            "masking_character": masking_character,
                            "number_to_mask": number_to_mask,
                        }
                    }
                }
            ]
        }
    }# Construct item
    item = {"value": input_str}# Call the API
    response = dlp.deidentify_content(
        parent,
        inspect_config=inspect_config,
        deidentify_config=deidentify_config,
        item=item,
    )# Print out the results.
    print(response.item.value)
    
    return response.item.value

The info_types parameter is where you specify which types of PII data you would like to mask. Google has a long list of possible infoTypes you can mask, from generic ones such as PERSON_NAME and EMAIL_ADDRESS to more specified ones such as country-specific passport number (i.e. FRANCE_PASSPORT).

For the purposes of this tutorial, my info_types are the following:

When we try running the deidentify_with_mask function on a string that includes sensitive information, we are returned with the following:

DLP masks the person’s name, phone number, and street address

Step 3: Run DLP function on PII data

Now that we are correctly connecting to the DLP API, we can run it on a dataset that contains sensitive data.

Let’s read in some data for DLP to mask:

import pandas as pddf = pd.read_csv('test_pii_data.csv')

FYI: this dataset is a randomly generated dataset of names, phone numbers, and e-mail addresses, combined with a list of addresses of Fortune 500 headquarters (DLP will not recognize or mask fake addresses so randomly generated locations would not work for this demonstration).

As we can see above, the Unmasked_Text column contains our name, address, e-mail, and phone number data, concatenated together to imitate actual text data. Now let’s apply our DLP function to the dataset:

df['Content_masked'] = df.apply(lambda row: deidentify_with_mask(row['Content']) ,axis=1)

We are returned with a dataset in which a new column, Content_Masked, contains our original text data, with all sensitive information masked.

That’s how easy it is to mask your PII with Google’s DLP API. The results are overall very good, with the data’s sensitive information correctly masked regardless of the sentence structure.

In practice, the generic infoTypes can sometimes lead to some PII not being captured, including less conventional names (i.e. Tiger) or misspelled addresses (i.e. 405 Lexingtone Ave). This can be prevented using customized infoTypes, which allow users to better optimize DLP for their use-case.

I hope you enjoyed this article and please let me know of any questions. The repo can be found on Github.

This post was republished on my personal blog on September 9, 2020.

Getting Started with Google’s Data Loss Prevention API in Python

Written by Stefan Gouyet