Anonymize Results in Azure Cognitive Search

Published in

Microsoft Azure

6 min readNov 3, 2021

Azure Cognitive Search is a cloud search service that gives developers infrastructure, APIs, and tools for building a rich search experience. It is a basic need today to have the ability to search over your data and as time progress, more complicated needs arise from search mechanisms.

One of the issues with search mechanisms is the data itself, not all data can be retained, saved and searched over, data can easily contain Personally identifiable information (PIIs), which cause security issues or GDPR violations.

So how can we avoid this and protect our data while still offering search options over it?

well, there is more than one option for this issue:

Option 1:

Process the data before we insert it as a search data, it means that where ever the data comes from, we will need to process it before it is being inserted into the search DB. This method will work but will result in another hop the data needs to go through, which means another hop it can fail in, another hop we need to worry about data consistency.

Disadvantages:

This method can cause data loss.

2. Requires another process maintenance.

Option 2:

Use the indexing mechanism of the Cognitive Search. To have an effective search, a proper indexing process needs to happen over existing and new data, while the indexing process runs, we can manipulate the data.

The Azure Cognitive Search gives us two options to perform manipulation over the data while indexing.

Built in power skills:

Use Azure Cognitive Search PII detection built in power skill. Azure Cognitive Search offers you the option to process the data while it is being indexed, which means you won’t need to have another hop or process to maintain the data and the search mechanism will take care of it on its own.

To have more flexibility and agility over the search, we can use Azure Customized Power skills.

Customized power skills:

Power Skills are a collection of useful functions to be deployed as custom skills for Azure Cognitive Search. You can find working code samples and details on how to build your own power skill in the azure-search-power-skills repository.

We used the customized power skills repository to add Presidio as a new customized skill. Presidio is an open-source tool to recognize, analyze and anonymize personally identifiable information (PII). Using trained ML models, Presidio was built to ensure sensitive text is properly managed and governed.

Using the customized power skill will allow us to have leverage over the power skill itself, customize it, add more models or functionalities if we like. The sample repository uses either Azure functions or Docker containers to deploy a power skill.

We based Presidio power skill on the Python Fast API power skill and used docker container, app service and Terraform to deploy it. The azure-search-power-skills repository is actually using the Python package of Presidio and wraps it with FastAPI and this is deployed as a docker container.

So how does it work?

After creating an Azure search resource, we have several configurations to define in order for the Presidio anonymizer to work on our data while indexing.

First things first, we will need to deploy Presidio as a Docker container to an app service. A full guide can be found here on how to build, push and deploy Presidio as an Azure web app.

Once the application is up and running, your resource group should look something like this:

To make this all work we have many components:

Nice drawing but what does it mean? We will explain it step by step.

data source — where the search data and results will come from.
Search service which contains: skillset, index and indexer. All of which will help us to combine the power skill and index the search data and get the anonymized results.
App service with Presidio running on it.

A full guide on how to define all of them can be found in the readme file.

So how do they all play together?

Lets start with the data source, where our search data comes from, by selecting one of the following options:

In this case we used Azure blob storage with text files:

One of the text examples contained the name Buzz Hargrove, which is a PII. Our goal is to prevent the name to be indexed and appear in the search:

Cool, so we have the application, we have the data source connected and now we need to connect the search to the application. This is done by defining a skill set:

We define the URI of the web app:

POST /api/extraction
{
   "text": TEXT_TO_ANONYMIZE
}

This method will receive the text, analyze, anonymize it and return the anonymized text:

We defined the skillset and the next step is to create the index. The index will make sure the specific field we want is going to be indexed. In our case it is ‘content’:

For the last finishing touch, we will define the indexer. The indexer will connect the index to the skillset and will give us an option to invoke the process on specific time and even now:

Once the indexer ran, the text was indexed and inserted after anonymization into the search engine. We can view the results:

When we try to search for “Buzz Hargrove”, SUCCESS! nothing is found:

Just to make sure the search does work, when we search for the word “Deputy” it is being returned.

To sum up, the Azure Search Power Skill gives you a very powerful tool. It is very easy to create your own Power Skill or use an existing one to customize your search data including filtering, anonymizing and converting it to your own needs without having additional processes or overheads.

Anonymize Results in Azure Cognitive Search

So how does it work?

Written by Shiran Rubin