Analyze and Sanitize Gigabytes of Text Corpus for Privacy Vulnerability in Minutes

Yash
ThirdAI Blog
Published in
2 min readJun 9, 2024

Introduction

Generative AI is revolutionizing enterprises by enabling them to extract value from vast amounts of raw, unstructured text, which accounts for an estimated 80% of total data volumes. However, before utilizing Generative AI or ChatGPT on these text corpora, it is crucial to ensure that sensitive, proprietary, and business-critical information remains secure and is not exposed outside the environment.

Additionally, many applications generate substantial amounts of textual data in real-time. If the rate of text processing lags behind the rate of data generation, the system will be unable to keep up. Existing popular cloud solutions are often prohibitively expensive and slow for analyzing and sanitizing gigabytes (or terabytes) of text in these scenarios.

In this blog, we will examine popular solutions for redacting unstructured text and compare them with the capabilities of the ThirdAI Platform for the same purpose.

What is PII Redaction?

PII redaction involves identifying and removing sensitive information from documents and datasets to prevent unauthorized access and protect individual privacy. This includes details like names, addresses, social security numbers, phone numbers, and more. Effective PII redaction is crucial for compliance with data protection regulations such as GDPR and CCPA, and for maintaining the trust of clients and users.

Comparisons of Different Solutions in Market

The following table compares the performance of ThirdAI’s PII Service against AWS Comprehend and Azure PII. To stress test these systems, our test bed consists of 5GB of raw text files stored in an S3 bucket, containing 40 million text chunks and 830 million tokens, with each token averaging 5 characters. We tabulate the end-to-end throughput of sanitizing the entire raw corpus (including read and write time), along with the associated costs of these services.

For AWS Comprehend, we submitted a job to detect PII directly through the UI and recorded the completion time. In contrast, for Azure PII due to lack of UI ability, we used the Python API, submitting requests with each having a maximum limit of 5000 characters, and recorded the respective times. This comparison helps us evaluate the efficiency and cost-effectiveness of ThirdAI’s PII Service against the other offerings.

Cost and Throughput. ThirdAI scales horizontally with the number of machines, allowing you to achieve faster processing times without increasing costs.

Conclusion

The table clearly demonstrates that ThirdAI is significantly cheaper and faster, by several orders of magnitude. It enables real-time text processing at scale, making it a viable solution for large-scale and high-speed data processing applications. In contrast, the latency of existing solutions is prohibitive for such demanding tasks.

For more information contact https://www.thirdai.com/contact/

Important Links

Here is the link to bash script to try out the service.

--

--