Trusted Insight of Data using Data Governance

Sanjit Chakraborty
Cloud Pak for Data
Published in
6 min readMay 15, 2020

Often Data Governance is resembled with data control and protect data. But it is also about enablement and externalizing insights of data. In today’s world data governance is a mandatory for fast-growing and highly competitive enterprise. Data can improve performance, create value, enhance competitiveness as well as cut cost of the enterprise. As organizations rapidly capturing massive amount of data, they need a mechanism to maximize the value, control the risks, and reduce the cost of its management. The concept of governing data derives from this point of angle; to establishes the processes and responsibilities that provide the quality and security of the data used across the organization.

The definition of data governance is still evolving. One can describes this discipline as being a facilitator for managers to take control over all aspects of their data resource. The data governance refers to the overall management of the availability, usability, integrity, and security of the data used in an organization. You can think it’s a collection of processes, roles, policies, standards, and metrics that ensure the effective and efficient use of data.

Establishing fundamental of effective data governance depends on; where data can be found, who can access to it and how data is being used. IBM Cloud Pak for Data (CPD) can help you to easily implement necessary processes, roles, policies, standards, metrics associated to data governance and achieve the overall management of the data across the organization.

In this blog I will introduce a tutorial on IBM Watson Knowledge Catalog within CPD that let you discover different data governance features.

Knowledge Catalog

Watson Knowledge Catalog provides a secure enterprise catalog management platform that is supported by a data governance framework. A catalog connects data and knowledge with the people who need to use it. The data governance framework ensures that data access and data quality are compliant with your business rules and standards. It helps your data users quickly find, curate, categorize and share data, analytical models and their relationships with other members of your organization. It serves as a single source of truth for data engineers, data stewards, data scientists and business analysts to shop for data they can trust, accelerating the implementation and value of DataOps for your organization. With active policy management, it helps your organization protect and govern data, so it’s ready for AI at scale.

Discover and Catalog Data Assets using Auto Discovery

As you add or update a connection in a catalog or project, you can discover assets from the connection. All user tables and views accessible from the connection are added as data assets to the project that you select. From the project, you can evaluate each data asset and publish the ones you want to the catalog. Discovery operation can be run unlimited number of times manually or it can be triggered automatically. You can discover and catalog data asserts that are structured, as well as unstructured. When you add discovered assets to a catalog, each of the assets is automatically assigned tags and data assets can published to knowledge catalog.

Discover Assets using Quick Scan

A quick scan can come to great help when you don’t know your data very well, and you want to analyze large datasets to see a general overview of the quality of the data. Quick scan performs analyze columns, analyze data quality and automatically assign term. The analyze columns operation examines the properties and characteristics of columns in the dataset and finds matching data classification. The analyze data quality identifies the common data quality problems and computes a data quality score for data sets and columns. You can assign terms to discovered assets based on name similarity and data classification. Optionally term assignment can manage by machine learning model to get more accurate results.

When you run a quick scan, only a sample dataset is analyzed, and assets aren’t added to the default catalog. By default, the sample size is 1000 records but it can be configurable. You can edit discovered term assignments, when reviewing the scan results. When you want to add the discovered data sets to the default catalog, you must approve them. They are then loaded to the workspace, where you can run further analysis, edit the results which contain more information, and publish the analysis results to the default catalog.

Implement Business Glossary

Cloud Pak for Data enables you to structure your enterprise information in a logical way, discover relationships between assets, and keep your data always up-to-date. You can import existing glossary with categories, terms, information governance policies and rules. You can maintain glossary assets outside of the catalog and import them from a CSV file format. This file can be generated from another software application such as a spreadsheet program. Or, you can import a CSV file that you originally exported from IBM InfoSphere Information Governance Catalog.

Discover assets add data to the default catalog. During the discovery the data is imported, analyzed, and classified. Earlier you ran quick scan, but in this task you will re-run discover assets, so data can imported, analyzed, and classified according the glossary you imported/created earlier.

Data Assets Sentiments

As data assets are cataloged, they are automatically profiled and classified so data consumers can have a better understanding of their content. They can then be enriched using Knowledge Catalog’s social capabilities.

Shop Data

Leverage Knowledge Catalog’s intelligent Shop for Data AI-powered Search and Suggest experience that guides you to the most relevant assets in the catalog, based on understanding of relationships between assets, usage of those assets and social connections between the users of those assets.

You will also use the Filter section of the Knowledge Catalog that is automatically built and Organized by Asset Type and Tag as you catalog assets. Tagging is essential when cataloging assets, it expedites the process for consumers to easily search and find what they are looking for.

You can easily search for data using the suggestion categories to find relevant data. These categories are automatically populated by Knowledge Catalog as you catalog, curate and enrich data assets. For example: Highly Rated displays previously reviewed assets.

Refine Data

Data cleaning or shaping is the key requirement of a successful data analysis. In general data cleaning takes a significant amount of the time of data analysis process. Often data scientist has to wait on data engineer during this data preparation period. The process of data cleaning deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data. At the same time filter, sort, combine, or remove columns give necessary shape to the data for analysis. Data cleaning plays major role during decision-making process or data analysis.

The Data Refinery tool in CPD aim to reduce the pain associated with creating good quality data. The tool has an intuitive user interface and templates enabled with powerful operations to shape and clean data. It’s a self-service data cleaning and shaping tool that can help data scientist and business analyst to prepare data on the fly and use the data for analysis and modeling. Data Refinery also provides metrics and data visualization which aid in every step of the process.

One can use the following materials to quickly get started with Watson Knowledge Catalog within Cloud Pak for Data:

Tutorial: Use of Data Governance

This tutorial will give a jump start to manage data governance within the CPD. There are more features available in Watson Knowledge Catalog that can help you with data control, enablement and externalizing insights of data.

--

--

Sanjit Chakraborty
Cloud Pak for Data

Sanjit enjoys building solutions that incorporate business intelligence, predictive and optimization components to solve complex real-world problems.