Automated PII Catalog powered by Datahub for real-time sensitive data observability

And how it helps Privacy Teams, Data Engineers, and Data Owners solve compliance gaps

Hemant Kumar
Borneo
7 min readFeb 17, 2022

--

Achieve data governance at scale with Borneo and Acryl Data (DataHub)’s joint “Automated Governance Catalog”. A DataOps-led Governance solution for fast-moving data teams using real-time privacy observability integrated with metadata-orchestrated business workflows.

To achieve Governance at scale in the modern data stack, there are a few fundamental prerequisites:

  • Understanding what data lives where along with lineage information about how the data got there
  • Semantic categorization of data based on Business terminology and classification based on Governance/Compliance requirements
  • Automated data management using semantic labels with appropriate human-in-the-loop workflows

Given the scale and diversity of the modern data stack, categorizing data through a manual process of tagging certain datasets does not scale. Data pipelines are constantly evolving, replicating and morphing data into different formats and locations. A manual, tedious process of trying to keep up with all the changes simply burdens the team and creates a huge risk of major datasets going out of compliance.

A quick data catalog primer

A modern data catalog enables creating a complete data graph through the collection of technical, business and operational metadata across many data sources, including APIs, datasets, pipelines, dashboards, AI models, features. Acryl Data powered by the open source project DataHub enables three main use cases:

  1. Data discovery
  2. Automated data governance
  3. Data Observability

How is Datahub used by Privacy Engineers?

  1. Understand the privacy structure and remain compliant with the ever changing data privacy regulations.
  2. Facilitate data discovery, understanding and reuse through features like business glossary.
  3. Support data sharing and collaboration.

What is the missing piece?

The real work begins after all those thousands of resources have been discovered by the Data Catalog system in your account. The process of manually inspecting the dataset’s metadata and tagging it with standardized terms from a business glossary is not only inefficient but can also be inaccurate. Automation is key here with the appropriate human-in-the-loop workflows for approvals.

Current Process and Problems

  • The current process relies on someone manually looking at the metadata and tag it with a term defined in the catalog system.
  • In many cases, this takes many hundreds of hours of effort in a mid to large sized company.
  • Looking at just the metadata is NOT enough. The field name could be as generic as event_data and it might contain millions of email addresses or even credit card numbers (Developers love logging stuff, don’t they?).

What is Borneo’s Data Discovery?

In simple words, Borneo looks at the underlying data in the datasets and detects all the sensitive info-types (or even custom one’s specific to your company). Borneo automatically inspects the actual data present in the RDS, Presto Tables, Redshift, S3 Buckets and many more different sources using Machine Learning and some voodoo magic to identify the info-types present in those resources.

Using these results you can quickly have a full picture of your data and it’s type. Is there a public S3 bucket containing Credit Card Numbers or SSNs? A huge red flag, that Borneo will tell you about. But that’s more like a side quest for the scope of this post, you can pursue it here.

Detecting Sensitive Data Across multiple SaaS and Cloud resources

Now, What is Automated Governance Catalog?

Right now, we have these two separate services:

  • Borneo: Scans for sensitive data present in your cloud resources.
  • Acryl Data: Scans your cloud infrastructure’s metadata and makes it available in a catalog that you can manually tag with terms.

Suppose the data these two services interoperated seamlessly, providing you with a way to automatically tag all the resources present in the catalog with their respective sensitive data found using business glossary terms defined in the Data Catalog. Further, humans should be able to spot-check and approve the automated proposals.

That’s what we did.

Whatever Borneo finds in its real-time and scheduled data scanning, will now be automatically pushed to Acryl, saving you hundreds of hours of manual work.

Benefits of a Smart Catalog

A catalog based on metadata inspection alone will result in missing 33% of sensitive information

  • You don’t have to rely on the metadata to identify what kind of data is present in the system.
  • It’s all automated and happens in real time, no manual labor required to go through vague field names to tag the resources with correct terms. We do that even when you sleep.
  • Reduced scope of error, since a manual process can be tedious, it’s easy to miss out on a lot of unstructured kind of fields like a column of type Raw Text, or JSON.
  • No on-going work required for newly added resources to the system, they’re automatically detected by both, Acryl and Borneo and are tagged as soon as anything is detected.
Sensitive privacy related Info-types detected by Borneo

What differentiates us from others?

Open ecosystem and DataHub community-led product development

  • Having complete control over your metadata by avoiding vendor lock-in is important. It will be always possible to extract core metadata from the Acryl offering into open-source DataHub.
  • Data Discovery, Data Governance, Data Observability are not solved by software alone. The large DataHub community is generating best practices about how to improve ownership, how to achieve compliance outcomes etc. and the product is continuously evolving to reflect the learnings of a large community of data practitioners.

Compliance monitoring and active data management

  • Acryl DataHub allows defining compliance constraints at the dataset or column-level (e.g: Mandatory presence of glossary terms from a certain compliance taxonomy). Constraints being met/not-met give you a simple test of compliance/not-in-compliance.
  • Human-in-the-loop approval workflows support easily plugging ML based classifiers with sufficient safety.
  • Metadata analytics allow you to monitor datasets that are out of compliance at a domain, platform, team level/
  • Automated actions can be triggered in response to key events (e.g: PII term was attached to a column, schema changed etc.) to perform activities like retention, GDPR deletion etc./

Proven platform capabilities of DataHub architecture

Want to see it in Action?

Let’s take this Hive table and see what it looks like in the catalog without the Borneo integration.

A table containing log events on Acryl Dashboard prior to Borneo Integration

Notice that there are no terms right now in the right most column, and you’d have to manually tag this table with the correct terms by looking at the field names.

What tags would you choose for the event_data field? Looks like a generic field that could contain a JSON object, which could be anything, correct? This scenario occurs more and more when you’re dealing with Big Data and have multiple streams pouring into a single dataset.

Let’s take a look at the data in this table.

The raw data in the table logging_events

Notice something? In the JSON value of the field there’s a nested object called user containing really sensitive information. This nested object might not be present on all the rows of this table and a person checking this data manually could easily miss it, but Borneo won’t.

Now let’s take a look at how the catalog looks after the integration with Borneo.

The table automatically tagged with relevant terms after integrating Borneo

As is evident from this image, Borneo has automatically discovered the correct terms for this field and has tagged the fields with their corresponding terms automatically.

Borneo and Acryl Data’s Smart Data Catalog

Ready for a test drive? Reach out to us for a demo!

--

--