Gaining insights relevant to the governance of unstructured data

Published in

IBM Data Science in Practice

4 min readMay 25, 2021

Are you a data steward in your enterprise who needs better insights into your unstructured data? Do you need analytics relevant to the governance of your unstructured data?

All of this is available with IBM Cloud Pak for Data as a Service-Watson Knowledge Catalog. In this blog post, I will show you how these things work in Cloud Pak for Data as a Service by using an example on how you can leverage the newly available analytics capability and what the benefits for you can be.

Watson Knowledge Catalog (WKC) is well-known for profiling and curating structured data. Profiling is the enrichment of data with metadata. This helps data stewards to gain more insight and a better understanding of the data. Well-curated data is also much easier to find and use wherever needed. Profiling for governance purposes is also newly available in WKC for unstructured data.

Think about this use case: a data steward has significant documents that they want to share with the business users by using a data catalog. Before they can do that, they have to ensure that such content is free of any Personally Identifiable Information (PII). The content they want to share can be reference reports, annual reports, or client reference descriptions.

In the past, a lot of manual effort was required to identify and read through the documents to check for PII. With the new capabilities in Watson Knowledge Catalog, such an analysis runs out of the box and only the results need reviewing.

We call this process profiling. The profiling results are present on any individual document and comes out of the box with more than twenty so-called data classes for governance. WKC has data classes which define the logic or expression used to identify and classify the type of data in a document. The focus of the predefined data classes for unstructured data is on PII detection for many languages which cover all EU countries, as well as Great Britain, Switzerland, USA, China, and Japan.

To leverage these new profiling capabilities, the data steward imports documents as data assets into a project in Watson Knowledge Catalog. In the project, those data assets are automatically profiled, and the data steward reviews the results per data asset.

A screenshot of a document profiling result review screen showing the present data classes and how often they were assigned as well as the frequency distribution of the values that contributed to the classification. — Profiling result review screen

The data steward sees profiling results including statistics about the present data classes and how often they were assigned as well as the frequency distribution of the values that contributed to the classification. With this information, they can decide if the document is free of PII or if information is present in the documents that must not be shared.

The data steward can then publish any eligible document to the data catalog to make it available to business users.

A screenshot showing the action how to publish a PII free document to the data catalog. — Publish a document to a data catalog

Documents in the data catalog can be enhanced with business terms related to a document. Business terms are another set of metadata assigned to the document as part of document curation. This additional metadata helps business users to find relevant data by using business language. Business users might enhance a document further with ratings, a comment, or a classification to help others easily find relevant content.

A screenshot how to curate a published document in the catalog with additional metadata like business terms or comments. — Curate a document in the data catalog

When data stewards make use of such a procedure and such capabilities within WKC, they can keep their minds on the details of data governance and can still ensure that only content free of PII is added to the data catalog. They keep their focus on the business value and on the curation for that rather than dealing with time consuming tools or manual content classification with a large likelihood of not catching all PII and potentially harm the enterprise.

Get started here: https://www.ibm.com/cloud/watson-knowledge-catalog

Find additional information and details about profiling in the product documentation:
https://dataplatform.cloud.ibm.com/docs/content/wsj/getting-started/profile.html

Gaining insights relevant to the governance of unstructured data

Written by Michael Baessler