De-identifying data using custom info types with DLP on Google Cloud

Paul Balm
Google Cloud - Community
7 min readOct 1, 2022

Google Cloud offers a service to find and de-identify sensitive data, through masking, encryption, or other transformations. This service is called Data Loss Prevention, or DLP. At the time of writing, it has 158 types of sensitive data, or “info types”, predefined. These range from simple things like genders, to Spanish national identity numbers, Japanese bank account numbers, and US vehicle identification numbers. But what if you have a list of items that are not covered by any of these? Like the schools in your country, or perhaps professions, or something else… A dictionary of a few words, or a massive one, or a regular expression maybe? You need a custom info type. This post will describe how that works, but first, the basics.

Data Loss Prevention Basics

A DLP job will always have an “inspection” part, with its corresponding “inspection config”. And optionally, it can do something with the findings from the inspection, such as encrypting it. This is done in the “de-identification”.

The inspection config will specify the info types to be identified, and the de-identification will define the actions to be applied where necessary. For example, we could identify bank account numbers and encrypt them (reversible), or hash them (not reversible), or mask them by overwriting them with a certain character like “X”. There are many other options too, for example, dates can all be shifted by some amount of time, to preserve the timeline of events in the dataset.

In the diagram below we show three predefined info types. The inspection job only looks for two of them, the ID number and the vehicle ID. Reducing the number of info types reduces the latency of the inspection jobs. In the de-identification step, we only apply a transformation to the ID number, for example, we could hash it. We can also define inspection and de-identification templates to ease the management of the configuration of different job, but this does not change the concepts.

Info types can be used by an inspection job. During the optional de-identification step, transformations can be applied to findings based on some of the info types from the inspection job.

Predefined info types, like these, have an ID, which we can see, and it has an implementation, that is not exposed. We do not know the algorithm to recognise Japanese bank account numbers and we can’t find out. We simply refer to the predefined info type by its ID.

Introducing Custom Info Types

If we have a need to that none of the predefined info types can meet, we may need to define an info type ourselves. There are a number of different custom types, but basically there is one based on regular expressions and there are a couple more, based on dictionaries. The dictionary can just be a list of words that we enter into the info type itself, or it can be a text file on Google Cloud Storage, or if we have a very large dictionary, it can be stored on BigQuery.

Note how we are discussing here how the different info types are implemented. This is an important difference with the predefined info types.

Custom info types can be defined as we create the inspection job, or we can store the custom info type definition separately as a “stored info type”, so that we can reuse it in another inspection job. Let’s say we want to mask information in our data about the schools that people have attended and the level at which they have studied, which can only be “high school” or “college”, for simplicity’s sake. The number of schools in the country is a big dictionary though, that we may want to reuse. So let’s create a stored info type for all the schools, and we can define the study level within the inspection job.

A mix of predefined and custom info types (marked by a star), that is used in an inspection job and a de-identification process.

In the inspection job, we have first added a custom info type for the level of studies, based on a word list of just two words, and we have not stored it externally.

The info type to identify schools is stored outside of the inspection config. The stored info type is assigned a name for easy identification, it gets a fully qualified resource ID, and it has an implementation: In this case, a dictionary on BigQuery. When we include the school info type in our inspect config, we are free to give it the name that we want. I have chosen “school-name” here, to distinguish from the stored info type name “school”. The inspect config refers back to the stored type through its fully qualified resource ID, not by its name.

The custom info types in the inspect config have another attribute, that doesn’t appear in the diagram: A likelihood. Findings based on predefined info types are assigned a likelihood depending on, for example, how likely a series of numbers refers to a Japanese bank account. For the stored info types, we can specify the probability that our findings are assigned, ranging from “very unlikely” to “very likely”. At the level of the job we can then specify that we only want to see the findings that are at least “likely”, for example. If we set the findings from our custom info type to “unlikely”, they will be filtered. This kind of tweaking can reduce noise, but you can also save all findings to BigQuery, and do any filtering there.

These are the concepts, let’s see how this is done in the Google Cloud Console!

Demo Time

We’re going to create a stored info type that detects school names based on a word list, define an inspection template using this info type, and a de-identification template to redact the schools from text.

First, we’ll hop over to the stored info types section of the DLP configuration for your project. Hit the “Create Stored InfoType” button.

The stored info types section of the DLP Configuration in the Google Cloud Console.

We said we’re going to create a stored info type on a word list, which in this demo, saves us the trouble of having to create an external dictionary. So select “List of words or phrases” for the type and enter “school” for the InfoType ID. We’ll call this info type “school” and the list of schools that we’ll recognise is going to be limited to “harvard”, “yale”, and “brown”. Note that this will recognise “Harvard” regardless of which letters are capitals and which are not. Create the info type.

Creating a stored infotype based on a word list.

If you go back to the stored infotypes section of the DLP configuration, you should now see an infotype called “school” there. You may want to click on it and copy the ID of the infotype, because we’ll need it later. It will be something like “projects/YOUR-PROJECT-ID/locations/global/storedInfoTypes/school”, with your project ID inserted at the right location of course.

Now let’s create an inspection template, that we can later use to quickly create inpection jobs to look for school names. In the inspection templates section of the DLP configuration, click “create template”.

Under step 1 of the template creation process, we’ll just set the template ID to “find-schools”. Under step 2, we click “manage infotypes” and in the side panel that opens, we choose “custom”. There are no types listed there, so let’s add our stored infotype to this template. Click “Add custom infotype”, select “Stored InfoType” for the type, use “school-name” to name it, and provide the ID of the stored infotype that we created earlier (remember to replace “YOUR-PROJECT_ID” with you actual project ID).

Adding our custom infotype to the inspection template.

Save the infotype and create the template. You should now see your inspection template configuration, as below.

The “find-schools” inspection template with a custom stored infotype.

Now finally, we’re going to create a de-identification template, and test it, to see if it works. Head over to the de-identification template section of the DLP configuration and hit “Create Template” there.

In the “Define template”, we’ll use “redact-schools” as the template ID. Under step 2, “Configure de-identification” we’ll choose “replace with infotype name” as the method. We could just apply this rule to all the infotypes from the inspection template, but let’s make this explicit. Select “specify infotypes”, and “manage infotypes”. A side panel opens where we can select any of the built-in infotypes, or we can select “name-only”. Click that, and “add infotype name”. Here, we have to add the name that we gave this infotype in the inspection template, “school-name”.

Adding the custom infotype to the de-identification template.

Click “done” and make sure that the “school-name” infotype is now listed. Click “create” to create the template.

Details of the de-identification template.

Now we finally arrive at the hour of truth. Hit “test”, and see if this works. Note that the “test” page specifies that any text that you enter, will be scanned with a default list of info types. Logically, our new custom info type is not part of this list. So we will have to provide our inspection template for this to work. Enter some text to see if it works!

Testing the de-identification template.

Using custom info types gives you great freedom to scan for and de-identify very domain specific information, such a local school names, but also departments or using regular expressions, typical identifiers for certain use cases, such as ticket numbers or booking references from the airline industry.

--

--

Paul Balm
Google Cloud - Community

I’m a Strategic Cloud Engineer with Google Cloud, based in Madrid, Spain. I focus on Data & Analytics and ML Ops.