Illuminating Dark Data III — Automated Discovery
This blog is written in collaboration with Albert Maier, STSM, Chief Architect, Information Governance and Gowreeswari Bongu, Test Specialist.
In the previous post, we discussed Quick Scan, where a user would examine millions of columns of data in a shallow manner. Here are the use cases where a tool like Quick Scan would fit:
1. An initial, one-time risk assessment for PII data for larger information system landscapes. For example, if you don’t know whether you got PII risks in your data sources, you can run Quick Scan to identify which columns appear to contain PII data.
2. For quick identification of interesting data containers. For example, if you have a large number of unknown tables in various schemas, you can run a quick scan analysis on each schema to see an analysis of each table and column. Thus, you can:
a. Use the proposed data class and business term assignments to check which tables and schemas are relevant and suited to be shared with a broader group of people in a data catalog
b. Check whether certain tables are not relevant (they might be empty, contain only data that is not relevant for the business, etc.)
c. Use the quality score and determine if the quality of the data seems to be too poor for sharing it without further investigation and improvement
d. Look at the column metadata found for each table and analyze whether the metadata is correct and helpful for consumers
In this post, we will discuss the opposite end of the spectrum. For some use cases, there is a need for a very deep investigation of a more limited number of data elements that an enterprise would define as critical to their business. For any business, the number of critical data elements is much smaller than the large amount of data that a business would typically store. Automated Discovery offers the features needed for a deep analysis and investigation of these critical data elements within an enterprise.
Auto Discovery:
Automated Discovery provides detailed analysis results for all assets in scope of an Automated Discovery run. This type of discovery is suitable for smaller numbers of tables and files. It performs a deeper scan than Quick Scan. Automated discovery works by a user selecting one or several schemas or individual data assets, then selecting analysis options, and finally a target data quality project that controls the depth of analysis. After you initiate the scan, Automated Discovery will automatically import all the selected schemas, tables, and columns directly into the data catalog. The corresponding data analysis results and proposed data class and business term assignments are pushed into a data quality project where you can review and edit these results. Curation changes can be pushed into the data catalog.
When running an Automated Discovery job, a typical scope is about 200 tables and 10,000 columns (assuming an average number of 50 columns per table). This is much smaller than the typical size of 20,000 tables and 1,000,000 columns for Quick Scan. Given the in-depth analysis offered in Automated Discovery, it is a reasonable assumption that most of the data in a large enterprise data system landscape with many petabytes of data does not need to be scanned in depth to this extent. The “full analysis” offered by Automated Discovery includes the following:
• A data classification against hundreds of data classes that might comprise complex custom data classes, as well as checks against very large reference data sets
• A variety of term assignment services that compare data and its metadata against a business terminology that could consist of 100,000 business rules
• A data quality assessment that might have a very fine granularity, as well as checking tens of data quality dimensions and hundreds of data rules
Automated Discovery is a resource-intensive process, and this can impact other people working on the system. A successful strategy to using Automated Discovery can be to work iteratively and with smaller scopes. Doing this also allows the Machine Learning (ML) in the system to find insights from early results while working with already completed results.
Additionally, before using Automated Discovery, it is highly recommended that you do a quick investigation (using Quick Scan) to identify the data sets and columns that have critical business value so that you can focus using Automated Discovery on those.
Features of Automated Discovery:
· Automated Discovery provides detailed analysis which includes data quality score, suggested data classes and business terms, data types, formats, frequency distributions, detailed column analysis, and more.
· Any assets discovered by Automated Discovery are automatically persisted into the data catalog to make people aware of new data assets. At the same time, these assets are added to a data quality project. Analysis results from an Automated Discovery run are available only in the data quality project, where they can be reviewed and edited before being made available to catalog users via a “publish to catalog” operation.
· If you rerun Automated Discovery multiple times on the same data source, Automated Discovery checks for deltas on the technical metadata such as new tables, or added or dropped columns, synchronizes these changes into the data catalog, and can restrict re-analysis to changed assets.
· If published results are changing because of parallel activities in the system, e.g. if published term assignments are changed by another project, these changes will be automatically considered in the curation process.
· You can run Automated Discovery for Db2, Db2 Warehouse on Cloud, HDFS, Hive, Microsoft SQL Server, MongoDB, Oracle, PostgreSQL, and Teradata data sources connected through a JDBC connector. In addition, you can run it for Amazon S3, Greenplum, Netezza, and Snowflake data sources that you connect to through connections created through metadata import if metadata import is enabled.
Using Automated Discovery
Here is how you trigger the Automated Discovery job :
On triggering, the Automated Discovery is in a running state. The below screen shot shows that Metadata import is over for BANK1 and running for BANK2.
After import, the analysis starts as shown below:
When it is through, it shows that it is in a finished state.
Once it is done, you can examine it in further detail to see what schemas and names are present. Let’s click on BANK3 to get the schema level asset’s summary results.
You can click on the dataset name and get the summary of dataset.
Additionally, you can view the dataset level details or edit the Terms and Data Classes from here and can even publish the dataset from here.
You can view the project level details by clicking on the Project details from Figure 5:
The image below shows a dashboard of a project after an Automated Discovery job has been run on a dataset.
Even within these analytics, further drill downs are available to view. Here, you can see rules that defined quality violations on the data set.
Drilling down even further, you can see which values were violated.
You can see which rules were defined for use in the iteration of Automated Discovery.
You can also see which Data Classes were assigned and Data Types discovered.
Finally, Automated Discovery has a frequency distribution where you can see the number of records and the cardinality of their distinct values.
Conclusion
Automated discovery is a tool of choice for detailed analysis of your medium sized database/folder or schema. Automated Discovery is designed for repeated use because it detects deltas and synchronize changes across assets and projects. This is the discovery you should use if you have already identified the areas requiring deeper investigation in your data sources, which you can discover via Quick Scan or interviews with the data source owners.
You would have noticed that how the user can very easily navigate through simple screens to get a comprehensive picture of the dark data in the enterprise. This made Watson Knowledge Catalog an iF Design Award Winner in 2021. iF Design Foundation awards are some of the most coveted in the world. WKC joins companies like BMW, Braun, Samsung, Apple and others who differentiate their products, services and employee experiences through good design. For more details read here.
This wraps up our series on illuminating an organization’s dark data. If your organization faces data discovery challenges, the tools available within Watson Knowledge Catalog can help you find actionable insights within the dark deep lake of data. If you would like to learn more about WKC enterprise data governance, feel free to investigate the documentation