Anonymize Results in Azure Cognitive Search

Shiran Rubin
Microsoft Azure
Published in
6 min readNov 3, 2021

Azure Cognitive Search is a cloud search service that gives developers infrastructure, APIs, and tools for building a rich search experience. It is a basic need today to have the ability to search over your data and as time progress, more complicated needs arise from search mechanisms.

One of the issues with search mechanisms is the data itself, not all data can be retained, saved and searched over, data can easily contain Personally identifiable information (PIIs), which cause security issues or GDPR violations.

So how can we avoid this and protect our data while still offering search options over it?

well, there is more than one option for this issue:

Option 1:

Process the data before we insert it as a search data, it means that where ever the data comes from, we will need to process it before it is being inserted into the search DB. This method will work but will result in another hop the data needs to go through, which means another hop it can fail in, another hop we need to worry about data consistency.

Disadvantages:

  1. This method can cause data loss.

2. Requires another process maintenance.

Option 2:

Use the indexing mechanism of the Cognitive Search. To have an effective search, a proper indexing process needs to happen over existing and new data, while the indexing process runs, we can manipulate the data.

The Azure Cognitive Search gives us two options to perform manipulation over the data while indexing.

Built in power skills:

Use Azure Cognitive Search PII detection built in power skill. Azure Cognitive Search offers you the option to process the data while it is being indexed, which means you won’t need to have another hop or process to maintain the data and the search mechanism will take care of it on its own.

To have more flexibility and agility over the search, we can use Azure Customized Power skills.

Customized power skills:

Power Skills are a collection of useful functions to be deployed as custom skills for Azure Cognitive Search. You can find working code samples and details on how to build your own power skill in the azure-search-power-skills repository.

We used the customized power skills repository to add Presidio as a new customized skill. Presidio is an open-source tool to recognize, analyze and anonymize personally identifiable information (PII). Using trained ML models, Presidio was built to ensure sensitive text is properly managed and governed.

Using the customized power skill will allow us to have leverage over the power skill itself, customize it, add more models or functionalities if we like. The sample repository uses either Azure functions or Docker containers to deploy a power skill.

We based Presidio power skill on the Python Fast API power skill and used docker container, app service and Terraform to deploy it. The azure-search-power-skills repository is actually using the Python package of Presidio and wraps it with FastAPI and this is deployed as a docker container.

So how does it work?

After creating an Azure search resource, we have several configurations to define in order for the Presidio anonymizer to work on our data while indexing.

First things first, we will need to deploy Presidio as a Docker container to an app service. A full guide can be found here on how to build, push and deploy Presidio as an Azure web app.

Once the application is up and running, your resource group should look something like this:

Azure resources

To make this all work we have many components:

High level diagram

Nice drawing but what does it mean? We will explain it step by step.

  1. data source — where the search data and results will come from.
  2. Search service which contains: skillset, index and indexer. All of which will help us to combine the power skill and index the search data and get the anonymized results.
  3. App service with Presidio running on it.

A full guide on how to define all of them can be found in the readme file.

So how do they all play together?

Lets start with the data source, where our search data comes from, by selecting one of the following options:

Data source input options

In this case we used Azure blob storage with text files:

Data source example

One of the text examples contained the name Buzz Hargrove, which is a PII. Our goal is to prevent the name to be indexed and appear in the search:

PII text example for indexing

Cool, so we have the application, we have the data source connected and now we need to connect the search to the application. This is done by defining a skill set:

Define search Skillset

We define the URI of the web app:

POST /api/extraction
{
"text": TEXT_TO_ANONYMIZE
}

This method will receive the text, analyze, anonymize it and return the anonymized text:

Main logic for Presidio

We defined the skillset and the next step is to create the index. The index will make sure the specific field we want is going to be indexed. In our case it is ‘content’:

Define search index

For the last finishing touch, we will define the indexer. The indexer will connect the index to the skillset and will give us an option to invoke the process on specific time and even now:

Define search indexer

Once the indexer ran, the text was indexed and inserted after anonymization into the search engine. We can view the results:

All search results

When we try to search for “Buzz Hargrove”, SUCCESS! nothing is found:

Empty result for Buzz Hargrove

Just to make sure the search does work, when we search for the word “Deputy” it is being returned.

“Deputy” return search results

To sum up, the Azure Search Power Skill gives you a very powerful tool. It is very easy to create your own Power Skill or use an existing one to customize your search data including filtering, anonymizing and converting it to your own needs without having additional processes or overheads.

--

--