Automating data redaction from PDF files using GCP

Published in

Google Cloud - Community

7 min readOct 31, 2023

PDF Redaction

PDF files have become an indispensable part of our daily lives and the digital world. Whether it’s a document, a resume, a report, or a simple invoice, PDFs offer a standardized and secure way to share information. The popularity of PDF files has grown exponentially in recent years, and millions are created each day.

At the same time, data privacy and security have become a critical concern for companies as they store and manage increasing amounts of sensitive and personally identifiable information (PII). From financial records to health information, the amount of data that organizations need to protect is growing rapidly, and the consequences of a data breach can be severe.

That is why, redacting sensitive information is crucial for protecting privacy and complying with data protection regulations, such as the General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA), just to name a few. Regulations require that sensitive information is securely stored and protected from unauthorized access and accidental exposure.

In this guide, we will show you how to leverage Google Cloud Platform to automate the redaction and masking of sensitive data from PDF files. This solution leans on Serverless offerings from Google Cloud to support easy automation and scalability while keeping operational overhead at a minimum.

The Challenge

Redacting PDF documents isn’t a straightforward task. From the top of my head, a few steps come to mind when trying to tackle such a daunting task:

Understand the document: Decode the PDF document and extract the text from the document using an OCR library or alike. PDFs can be made of text, but you might as well have images on it, which might in turn contain some text as well.
Identify sensitive text (PII, PHI, Credit Cards, or whatever fields you are looking for): Some fields can be easily spotted using regex expressions (ie. emails, credit cards, phone numbers, IPs etc) but some others are simply harder to find or identify (think about people’s names, no regex expression will help you with that).
Redact the sensitive text: Once we’ve found the piece of text that needs to be redacted, we need to be able to mask it out or replace it with placeholder text.
Rebuild the PDF document: Once we’ve redacted all sensitive information we need to be able to rebuild the PDF document while honouring its structure, layout, and format.
Automation: We obviously don’t want to do all of the above by hand, we want to be able to rely on a solution that will automate the process for us with little to no human intervention.
Scalability: We need to ensure that we can handle more than a dozen of PDF files, ideally we’d be able to run it on millions of documents at the same time.
(Bonus) Make it generic: Ideally, we would like for this solution to support a wide range of data types and fields, different regionalizations (US, Europe, etc), and also allow users to leverage traditional rules or regex expressions. For instance, company X might be interested in redacting SSN, while other use cases might just need to redact all company emails which can be covered with a simple regex expression (*@company.com).

All of the above sound like a daunting task, even more so if I tell you that, in our case, we needed to redact more than 100,000 documents in less than 3 months!

The Solution

This is how we solved it. Take a look at the following architecture diagram.

Reference Architecture using CloudRun, Workflows, DLP, GCS, and BigQuery

Let’s quickly unpack it:

EventArc to automatically trigger the execution of a Workflow when a file is uploaded to the GCS bucket.
Cloud Workflows to orchestrate the redaction process and provide resiliency, retries, and scalability.
Data Loss Prevention (DLP) to redact sensitive information leveraging DLP Templates.
Cloud Run to execute each step and manipulate the PDF files.
Cloud Storage to store the PDF files to be redacted and the results and intermediate files.
BigQuery to save metadata about the redacted files for further analysis and auditing purposes.

How does it work?

The redaction workflow consists of the following steps:

Split the PDF into single pages, convert pages into images, and store them in a working GCS bucket.
Redact each image using DLP Image Redact API.
Assemble the redacted pages into the final PDF and store it on the output GCS bucket.
Write the redacted PDF back to GCS.
Write metadata to BigQuery for audit and analytical purposes.

More about DLP

DLP is a cloud-based service that provides advanced data scanning and detection capabilities to identify and classify sensitive information, such as personal identifiable information (PII), financial information, and confidential business data. DLP offers several features, including data discovery, data redaction, data masking, data transformation, and integration with other security tools for a comprehensive data protection solution.

DLP offers 150+ built-in Info Types. Info Types are predefined categories of sensitive information such as names, addresses, social security numbers, credit card numbers, etc. They are used to identify and classify sensitive information within a dataset.

There are two types of Info Types in DLP:

Built-in Info Types: These are predefined Info Types provided by GCP DLP that can be used without the need for customization. Examples include names, addresses, phone numbers, email addresses, etc.
Custom Info Types: These Info Types are defined and created by the user to meet their specific needs. Custom Info Types can be used to identify specific types of sensitive information that are not covered by the built-in Info Types. Examples are regex expressions and custom dictionaries (lists of words known upfront that Cloud DLP matches on).

One of the nice things about this solution is that it leverages DLP Templates. DLP Templates allow you to group and standardize the list of INFO_TYPES that you are looking to redact by simply editing the template or even creating a new template. By simply changing the value of dlp_template field in the workflow config file, you can use your own DLP template to redact the fields that are specific to your use case.

Deploying the application

We have open sourced the solution and it is available in our Google Cloud github repository:

GitHub - GoogleCloudPlatform/dlp-pdf-redaction: This solution provides an automated, serverless way…

This solution provides an automated, serverless way to redact sensitive data from PDF files using Google Cloud Services…

github.com

You can follow these simple steps to deploy it on your own GCP account:

Note: The following steps should be executed in Cloud Shell in the Google Cloud Console.

1.Create a project in GCP and enable billing. Follow the steps in this guide.

2. Run the following commands on your Cloud Shell console (or your favorite terminal)

git clone https://github.com/GoogleCloudPlatform/dlp-pdf-redaction
cd dlp-pdf-redaction
export TF_VAR_project_id=$PROJECT_ID
terraform -chdir=terraform init
terraform -chdir=terraform apply -auto-approve

3. Take note of Terraform Outputs

terraform -chdir=terraform output

4. Test it out!
Use the command below to upload the test PDF file into the input_bucket (you can also manually add any .pdf file to the Input Bucket gs://pdf-input-bucket-xxxx).
Alternatively, you can also upload files into the input_bucket by drag and dropping files using the web console.

gsutil cp ./test_file.pdf [INPUT_BUCKET_FROM_OUTPUT e.g. gs://pdf-input-bucket-xxxx]

5. After a few seconds, you should see a redacted PDF file in the output_bucket.

Sample input and output files where Email, Phone, and Name were redacted automatically

6. If you are curious about the behind the scenes, try:

Checkout the Redacted file in the output_bucket.

gsutil ls [OUTPUT_BUCKET_FROM_OUTPUT e.g. gs://pdf-output-bucket-xxxx]

Download the redacted pdf file, open it with your preferred pdf reader, and search for text in the PDF file. See how PDF is still searchable!
Looking into Cloud Workflows in the GCP web console. You will see that a workflow execution was triggered when you uploaded the file to GCS.
Explore the pdf_redaction_xxxx dataset in BigQuery and check out the metadata that was inserted into the findings table.

Some considerations

Size explosion — The solution converts each page into an image and runs it through DLP image redaction API. While doing so, a simple text-based PDF would exponentially expand in size.
On the flip side, the PDF will retain all its structure and will still be searchable and selectable.
Text will be converted into images — As explained above, all pages are converted into an image, redacted, and assembled back together. This might not be the best solution for some use cases. Also, the dpi and quality of the images can be adjusted by a setting in the workflow.
PDF compatibility (relies on 3P services) — Third party python libraries are used to parse and convert the PDF into images and to assemble it back together. There is no guarantee that all PDF documents will be properly parsed/processed by these libraries. Issues have been reported with DocuSign PDFs that are signed.
Non-production ready — this solution is provided as a demonstration of what is possible, but is not production ready as-is. Before deploying this for production workloads please ensure that you have added proper error handling, retrials, throttling, etc.

Conclusion

In conclusion, the automation of redacting sensitive information from PDF documents is not an easy task. However, with the help of Google Cloud’s serverless ecosystem, we were able to solve the challenge efficiently. We leveraged services such as Data Loss Prevention (DLP), Cloud Run, Cloud Storage, and BigQuery to create a solution that not only supports a wide range of data types and fields but also allows users to specify regex expressions, create a list of words to look for, support regionalizations, and more importantly, it scales to hundreds of thousands of documents without minimal operational overhead.

The solution is now available as an open-source code in the Google Cloud Github repository. The code is provided as-is, but we believe that with the right setup, it can help you redact sensitive information from your PDF documents with ease and at scale.

Credits

Special mention to the co-developer of this solution, and my dearest colleague, Grace Hoogendoorn. Grace was crucial in the design, development, implementation, and open sourcing of this project.