Anonymize PII Data in Spark using Presidio (ML Based)

Balamurugan Balakreshnan
Analytics Vidhya
Published in
2 min readSep 1, 2021

Using Azure Databricks anonymization of Text with PII

Use Case

  • Ability to Anonymize PII data in a dataset
  • Used for Data engineering
  • used for Machine learning

Pre Requisite

  • Azure Account
  • Azure Storage account
  • Azure Databricks
  • install libraries
presidio-analyzer 
presidio-anonymizer

Reference

Code in Spark

  • Confirm the above presidio libraries are installed
  • Now lets write the code
  • Bring all the imports
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities.engine import OperatorConfig
from pyspark.sql.types import StringType
from pyspark.sql.functions import input_file_name, regexp_replace
from pyspark.sql.functions import col, pandas_udf
import pandas as pd
import os
  • Load the sample Titanic data set
df = spark.sql("Select * from default.titanictbl")
  • Now display the data
display(df)
  • Initialize the analyizer
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
broadcasted_analyzer = sc.broadcast(analyzer)
broadcasted_anonymizer = sc.broadcast(anonymizer)
  • Create the UDF to anonymize
def anonymize_text(text: str) -> str:
analyzer = broadcasted_analyzer.value
anonymizer = broadcasted_anonymizer.value
analyzer_results = analyzer.analyze(text=text, language="en")
anonymized_results = anonymizer.anonymize(
text=text,
analyzer_results=analyzer_results,
operators={"DEFAULT": OperatorConfig("replace", {"new_value": "<ANONYMIZED>"})},
)
return anonymized_results.text
def anonymize_series(s: pd.Series) -> pd.Series:
return s.apply(anonymize_text)
# define a the function as pandas UDF
anonymize = pandas_udf(anonymize_series, returnType=StringType())
anonymized_column = "Name"
  • Now anonymize the data
# apply the udf
anonymized_df = df.withColumn(
anonymized_column, anonymize(col(anonymized_column))
)
display(anonymized_df)

Originally published at https://github.com.

--

--