Illuminating Dark Data III — Automated Discovery

Namit Kabra
IBM Data Science in Practice
7 min readJun 2, 2021

This blog is written in collaboration with Albert Maier, STSM, Chief Architect, Information Governance and Gowreeswari Bongu, Test Specialist.

In the previous post, we discussed Quick Scan, where a user would examine millions of columns of data in a shallow manner. Here are the use cases where a tool like Quick Scan would fit:

1. An initial, one-time risk assessment for PII data for larger information system landscapes. For example, if you don’t know whether you got PII risks in your data sources, you can run Quick Scan to identify which columns appear to contain PII data.

2. For quick identification of interesting data containers. For example, if you have a large number of unknown tables in various schemas, you can run a quick scan analysis on each schema to see an analysis of each table and column. Thus, you can:

a. Use the proposed data class and business term assignments to check which tables and schemas are relevant and suited to be shared with a broader group of people in a data catalog

b. Check whether certain tables are not relevant (they might be empty, contain only data that is not relevant for the business, etc.)

c. Use the quality score and determine if the quality of the data seems to be too poor for sharing it without further investigation and improvement

d. Look at the column metadata found for each table and analyze whether the metadata is correct and helpful for consumers

In this post, we will discuss the opposite end of the spectrum. For some use cases, there is a need for a very deep investigation of a more limited number of data elements that an enterprise would define as critical to their business. For any business, the number of critical data elements is much smaller than the large amount of data that a business would typically store. Automated Discovery offers the features needed for a deep analysis and investigation of these critical data elements within an enterprise.

Screenshot of a dashboard for online customer data showing time series over a seven day period and a list of columns on the side.
Figure 1: Illuminating Dark Data using Automated Discovery

Auto Discovery:
Automated Discovery provides detailed analysis results for all assets in scope of an Automated Discovery run. This type of discovery is suitable for smaller numbers of tables and files. It performs a deeper scan than Quick Scan. Automated discovery works by a user selecting one or several schemas or individual data assets, then selecting analysis options, and finally a target data quality project that controls the depth of analysis. After you initiate the scan, Automated Discovery will automatically import all the selected schemas, tables, and columns directly into the data catalog. The corresponding data analysis results and proposed data class and business term assignments are pushed into a data quality project where you can review and edit these results. Curation changes can be pushed into the data catalog.

When running an Automated Discovery job, a typical scope is about 200 tables and 10,000 columns (assuming an average number of 50 columns per table). This is much smaller than the typical size of 20,000 tables and 1,000,000 columns for Quick Scan. Given the in-depth analysis offered in Automated Discovery, it is a reasonable assumption that most of the data in a large enterprise data system landscape with many petabytes of data does not need to be scanned in depth to this extent. The “full analysis” offered by Automated Discovery includes the following:

• A data classification against hundreds of data classes that might comprise complex custom data classes, as well as checks against very large reference data sets

• A variety of term assignment services that compare data and its metadata against a business terminology that could consist of 100,000 business rules

• A data quality assessment that might have a very fine granularity, as well as checking tens of data quality dimensions and hundreds of data rules

Automated Discovery is a resource-intensive process, and this can impact other people working on the system. A successful strategy to using Automated Discovery can be to work iteratively and with smaller scopes. Doing this also allows the Machine Learning (ML) in the system to find insights from early results while working with already completed results.

Additionally, before using Automated Discovery, it is highly recommended that you do a quick investigation (using Quick Scan) to identify the data sets and columns that have critical business value so that you can focus using Automated Discovery on those.

Features of Automated Discovery:

· Automated Discovery provides detailed analysis which includes data quality score, suggested data classes and business terms, data types, formats, frequency distributions, detailed column analysis, and more.

· Any assets discovered by Automated Discovery are automatically persisted into the data catalog to make people aware of new data assets. At the same time, these assets are added to a data quality project. Analysis results from an Automated Discovery run are available only in the data quality project, where they can be reviewed and edited before being made available to catalog users via a “publish to catalog” operation.

· If you rerun Automated Discovery multiple times on the same data source, Automated Discovery checks for deltas on the technical metadata such as new tables, or added or dropped columns, synchronizes these changes into the data catalog, and can restrict re-analysis to changed assets.

· If published results are changing because of parallel activities in the system, e.g. if published term assignments are changed by another project, these changes will be automatically considered in the curation process.

· You can run Automated Discovery for Db2, Db2 Warehouse on Cloud, HDFS, Hive, Microsoft SQL Server, MongoDB, Oracle, PostgreSQL, and Teradata data sources connected through a JDBC connector. In addition, you can run it for Amazon S3, Greenplum, Netezza, and Snowflake data sources that you connect to through connections created through metadata import if metadata import is enabled.

Using Automated Discovery

Here is how you trigger the Automated Discovery job :

screen shot of the start screen for an automated discovery job with the following fields: connection, discovery root, and project, followed by a list of discovery options with checkboxes. On this list is analyze columns, analyze data quality, assign terms, publish results to catalog, use data sampling. Use data sampling asks for a size of records to include and the type of method to be used for sampling.
Figure 2: Triggering Automated discovery job

On triggering, the Automated Discovery is in a running state. The below screen shot shows that Metadata import is over for BANK1 and running for BANK2.

screenshot showing progress of an Automated Discovery job. In the list of assets, it shows one has been imported and the second asset import is in progress.
Figure 3: Metadata import for selected schema

After import, the analysis starts as shown below:

a screenshot showing the running analysis status on a list of assets in an automated discovery job
Figure 4: Running analysis on the schema

When it is through, it shows that it is in a finished state.

a screenshot showing the finished state of analysis on a list of assets in an automated discovery job
Figure 5: Analysis finished in seven minutes

Once it is done, you can examine it in further detail to see what schemas and names are present. Let’s click on BANK3 to get the schema level asset’s summary results.

screenshot showing results of an automated discovery job in a schema titled BANK3. In three tables, Account_holders, checking_accounts, and savings_accounts, it looked at data quality, found terms, gives the date it was last analyzed and gives an option of downloading the asset.
Figure 6: Schema level summary

You can click on the dataset name and get the summary of dataset.

screenshot of the summary of the dataset after an automated discovery job, showing published terms with assigned terms and rejected terms, and changes to be published with new assigned terms, suggested terms, new rejected terms and published terms to be rejected.
Figure 7: Summary of a dataset

Additionally, you can view the dataset level details or edit the Terms and Data Classes from here and can even publish the dataset from here.

a screenshot of a list of details on a dataset after an automated discovery job.
Figure 8: Details of dataset

You can view the project level details by clicking on the Project details from Figure 5:

Figure 9: Project level details

The image below shows a dashboard of a project after an Automated Discovery job has been run on a dataset.

a screenshot of a dashboard on a project after an automated discovery job has been ran, showing recent analytics such as data quality threshold, built run status, data quality score distributions, and more.
Figure 10: Dashboard

Even within these analytics, further drill downs are available to view. Here, you can see rules that defined quality violations on the data set.

a screenshot of data quality dimension results on a column — showing rule violations, suspect values, values out of range, and inconsistencies.
Figure 11: Column level details

Drilling down even further, you can see which values were violated.

screenshot of a list of which data rows were in violation of rules found with automated discovery with the fields nor_years_cli, zip, account_type, account_id, marital_status, online_access, and more.
Figure 12: Rule Violations

You can see which rules were defined for use in the iteration of Automated Discovery.

screenshot of rules which were used in an automated discovery job for a column titled client_id
Figure 13: Rules defined

You can also see which Data Classes were assigned and Data Types discovered.

a screenshot of results from an automated discovery job for a column called client_id, which found no data classes and Boolean data type.
Figure 14: Data classes and data types identified

Finally, Automated Discovery has a frequency distribution where you can see the number of records and the cardinality of their distinct values.

a screenshot showing the frequency distribution of records and cardinality of values discovered during an automated discovery job for a column titled client_id
Figure 15: Frequency Distribution

Conclusion

Automated discovery is a tool of choice for detailed analysis of your medium sized database/folder or schema. Automated Discovery is designed for repeated use because it detects deltas and synchronize changes across assets and projects. This is the discovery you should use if you have already identified the areas requiring deeper investigation in your data sources, which you can discover via Quick Scan or interviews with the data source owners.

You would have noticed that how the user can very easily navigate through simple screens to get a comprehensive picture of the dark data in the enterprise. This made Watson Knowledge Catalog an iF Design Award Winner in 2021. iF Design Foundation awards are some of the most coveted in the world. WKC joins companies like BMW, Braun, Samsung, Apple and others who differentiate their products, services and employee experiences through good design. For more details read here.

screenshot of Watson Knowledge Catalog with an image of an iF design award
Figure 16: WKC, winner of coveted iF design award

This wraps up our series on illuminating an organization’s dark data. If your organization faces data discovery challenges, the tools available within Watson Knowledge Catalog can help you find actionable insights within the dark deep lake of data. If you would like to learn more about WKC enterprise data governance, feel free to investigate the documentation

--

--

Namit Kabra
IBM Data Science in Practice

Namit Kabra is a Software Developer for the IBM Cloud and Cognitive Software. For more, visit his personal website: https://namitkabra.wordpress.com/about/