Illuminate Dark Data — Part II (Quick Scan)

Namit Kabra
IBM Data Science in Practice
6 min readMay 7, 2021

This blog is written in collaboration with Albert Maier, STSM, Chief Architect, Information Governance.

A data lake is a set of one or more data repositories that have been created to support data discovery, analytics, ad hoc investigations, and reporting. Their highly scalable environment supports extremely large data volumes, collecting petabytes of structured, semi-structured and unstructured data in its native format from a variety of sources, including data from Internet of Things (IoT) devices and social media. They provide the content for machine learning and real-time advanced analytics in a collaborative environment.

a horizontal stream of green symbols in a black space
Figure 1 : Without appropriate governance data lakes quickly turn into unmanageable data swamps

But without appropriate governance or quality, data lakes can quickly turn into unmanageable data swamps. Issues regarding the data’s quality, appropriate use, reliability, and sensitivity can arise, which might lead to the inability to act on data up to outright regulatory penalties. If companies do not properly govern their sensitive data, a business can face penalties up to 20 million euros or 4% of their worldwide annual turnover under the EU’s General Data Protection Rule (GDPR).

In the first post, we gave an overview of data discovery within the IBM Watson Knowledge Catalog (WKC). This post is dedicated to the Quick Scan (QS) data discovery technology. Quick Scan is extremely fast and built for a shallow analysis of millions of data elements. Here we will give a brief overview of Quick Scan, its features and the data sources it supports, and then walk through how to use it.

Introducing Quick Scan

a pair of person’s hands typing on the keyboard of a laptop with a dashboard on the screen. Next to the laptop are a cup and two writing instruments
Figure 2: Quick Scan helps an enterprise get a quick feel of the dark data

Quick Scan helps an enterprise get a feel of the data that exists in its data lakes with respect to the nature of the data (“classification”) as well as its quality. One critical risk many companies face is understanding risks to the sensitive data they may be storing due to both legal consequences and potential violations of consumer trust. If an organization doesn’t know these dangers with respect to Personally Identifying Information (PII), they can run Quick Scan to identify which data elements appear to contain PII data. If an enterprise has for example a large number of unknown tables in various schemas, they can run a Quick Scan analysis on each schema to see an analysis of each table and column.

While running QS, a company can make use of data classes and business term arrangements defined earlier in a governance process, mapping them to the unknown data discovered during the scan. The data will also be given a quality score which allows a business to determine if the quality of data is good enough for sharing and wider usage or if should be improved before any sharing is possible. Quick Scan also gives a quick view of column metadata that might be helpful for consumers. Crucially for any sensitive data, QS allows teams to quickly ascertain if data is suitable for being shared with a broader group of people in a data catalog.

Features of Quick Scan include:

· Providing a very fast initial analysis that includes data quality scoring, suggested data classes and business term assignments
· Performing asset import in a separate sandbox
· Efficient filtering and reviewing of analysis results
· Allowing manual changes to proposed business term assignments
· Sharing and “publishing” of assets into arbitrary catalogs.

QS supports a large number of structured data sources, including:

· DB2
· DB2 Warehouse on Cloud
· Hive
· Microsoft SQL Server
· MongoDB
· Oracle,
· PostgreSQL
· Teradata

Example Workflow with Quick Scan

Below, we will give an example of a user workflow for QS:

  1. First, from within Watson Knowledge Catalog, the user will choose Quick Scan from the Data Discovery menu
screenshot of quick scan results under data discovery in Cloud Pak for Data. The summary tab is active showing the number of jobs in progress.

2. Secondly, choose New Discovery Job → Quick Scan

screenshot of “Quick Scan job” in Cloud Pak for Data, showing the fields of Connection, Discovery Root, and a Section called Discovery options, and another section called Staging Area

3. Next, fill in the required fields:

You may choose the Connection from a previously created Global

screenshot of a Quick Scan job page in IBM Cloud Pak for Data with fields filled in and options “Analyze data quality” and “use machine learning to assign terms” selected.

4. Then, to begin the discovery process, click on the “Discover” button. A Notification is displayed that the Discovery operation has started and the following page is displayed.

screenshot of the Quick Scan results pages in Cloud Pak for Data. The Pending analysis tab is active with a left hand nav section with status and filter options. There is a list of jobs in the main part of the page

5. After this, click on the refresh button to update the page.

6. Once the QS job is completed, the job is moved to the “Action required” tab. Latest runs will show up at the top. But the user can also search for their job using the Job ID.

screenshot of quick scan results in Cloud Pak for Data. The action required tab is active with a left hand section with status and filters options and the main section of the page consists of a list of jobs. No jobs are selected.

7. Now, the user can click on the Job ID, and then click on View Results to see the results.

screenshot of quick scan results in Cloud Pak for Data. The action required tab is active with a left hand section with status and filters options and the main section of the page consists of a list of jobs. One job is selected.

8. Now, to see the columns and tables, the user should click on the Explore assets button.

screenshot of the results for one quick scan job showing business term assignment and data class assignment

9. If the user clicks on the pencil icon, the user can view the Assigned Terms and Suggested Terms.

screenshot of assets for a quick scan job

10. The user is next able to select Table from the list of Asset Types to view discovered tables and select the view details.

screenshot of assets in quick scan job showing discovered tables

11. The user can select one of the tables to see the summary result for that table. They can also edit Business Term assignments etc. by clicking on the pencil buttons.

a screenshot of details of a table in a job using quick scan

12. Alternatively, the user can press “View details” to get the detailed results of the table. Users can check the column level results here. These results get published to the catalog on Publish.

screenshot of a detailed view of a table showing columns and the results of the quick scan analysis with the table

13. Once a user has reviewed the results, they can publish the assets to the specified catalogs. By selecting the Table from the list of Asset types, the user can view Tables and then select them to Publish to a WKC catalog.

a screenshot of “discovered tables” after a quick scan job with the “publish assets” button chosen

14. The user can then click on “Publish” to publish the Assets to the selected catalog(s).

a screenshot of the overlay popup where a user can select which assets within a table to publish to a catalog

15. The user is then able to navigate to the Catalog to see the published tables and review the profiling results.

a screenshot of a table within a data catalog in Cloud Pak for Data

16. Finally, the user can drill down into a table and can edit if it is necessary, delete the data profiling or re-do data profiling.

a screenshot of a drilldown into a column of a table in the data catalog

Conclusion

As you can see from the walk through above, Quick Scan is an intuitive and fast approach for companies to gain valuable insights into their large amounts of dark data. Quick Scan can take as little as two minutes to run and those two minutes can save companies millions of dollars in potential legal fines. We encourage you to learn more about Quick Scan and automated discovery in Watson Knowledge Catalog by visiting https://www.ibm.com/cloud/watson-knowledge-catalog

Sources of pictures used in the blog:
Figure 1: Photo by Markus Spiske on Unsplash
Figure 2: Photo by Myriam Jessier on Unsplash

--

--

Namit Kabra
IBM Data Science in Practice

Namit Kabra is a Software Developer for the IBM Cloud and Cognitive Software. For more, visit his personal website: https://namitkabra.wordpress.com/about/