Illuminate Dark Data — Part II (Quick Scan)
This blog is written in collaboration with Albert Maier, STSM, Chief Architect, Information Governance.
A data lake is a set of one or more data repositories that have been created to support data discovery, analytics, ad hoc investigations, and reporting. Their highly scalable environment supports extremely large data volumes, collecting petabytes of structured, semi-structured and unstructured data in its native format from a variety of sources, including data from Internet of Things (IoT) devices and social media. They provide the content for machine learning and real-time advanced analytics in a collaborative environment.
But without appropriate governance or quality, data lakes can quickly turn into unmanageable data swamps. Issues regarding the data’s quality, appropriate use, reliability, and sensitivity can arise, which might lead to the inability to act on data up to outright regulatory penalties. If companies do not properly govern their sensitive data, a business can face penalties up to 20 million euros or 4% of their worldwide annual turnover under the EU’s General Data Protection Rule (GDPR).
In the first post, we gave an overview of data discovery within the IBM Watson Knowledge Catalog (WKC). This post is dedicated to the Quick Scan (QS) data discovery technology. Quick Scan is extremely fast and built for a shallow analysis of millions of data elements. Here we will give a brief overview of Quick Scan, its features and the data sources it supports, and then walk through how to use it.
Introducing Quick Scan
Quick Scan helps an enterprise get a feel of the data that exists in its data lakes with respect to the nature of the data (“classification”) as well as its quality. One critical risk many companies face is understanding risks to the sensitive data they may be storing due to both legal consequences and potential violations of consumer trust. If an organization doesn’t know these dangers with respect to Personally Identifying Information (PII), they can run Quick Scan to identify which data elements appear to contain PII data. If an enterprise has for example a large number of unknown tables in various schemas, they can run a Quick Scan analysis on each schema to see an analysis of each table and column.
While running QS, a company can make use of data classes and business term arrangements defined earlier in a governance process, mapping them to the unknown data discovered during the scan. The data will also be given a quality score which allows a business to determine if the quality of data is good enough for sharing and wider usage or if should be improved before any sharing is possible. Quick Scan also gives a quick view of column metadata that might be helpful for consumers. Crucially for any sensitive data, QS allows teams to quickly ascertain if data is suitable for being shared with a broader group of people in a data catalog.
Features of Quick Scan include:
· Providing a very fast initial analysis that includes data quality scoring, suggested data classes and business term assignments
· Performing asset import in a separate sandbox
· Efficient filtering and reviewing of analysis results
· Allowing manual changes to proposed business term assignments
· Sharing and “publishing” of assets into arbitrary catalogs.
QS supports a large number of structured data sources, including:
· DB2
· DB2 Warehouse on Cloud
· Hive
· Microsoft SQL Server
· MongoDB
· Oracle,
· PostgreSQL
· Teradata
Example Workflow with Quick Scan
Below, we will give an example of a user workflow for QS:
- First, from within Watson Knowledge Catalog, the user will choose Quick Scan from the Data Discovery menu
2. Secondly, choose New Discovery Job → Quick Scan
3. Next, fill in the required fields:
You may choose the Connection from a previously created Global
4. Then, to begin the discovery process, click on the “Discover” button. A Notification is displayed that the Discovery operation has started and the following page is displayed.
5. After this, click on the refresh button to update the page.
6. Once the QS job is completed, the job is moved to the “Action required” tab. Latest runs will show up at the top. But the user can also search for their job using the Job ID.
7. Now, the user can click on the Job ID, and then click on View Results to see the results.
8. Now, to see the columns and tables, the user should click on the Explore assets button.
9. If the user clicks on the pencil icon, the user can view the Assigned Terms and Suggested Terms.
10. The user is next able to select Table from the list of Asset Types to view discovered tables and select the view details.
11. The user can select one of the tables to see the summary result for that table. They can also edit Business Term assignments etc. by clicking on the pencil buttons.
12. Alternatively, the user can press “View details” to get the detailed results of the table. Users can check the column level results here. These results get published to the catalog on Publish.
13. Once a user has reviewed the results, they can publish the assets to the specified catalogs. By selecting the Table from the list of Asset types, the user can view Tables and then select them to Publish to a WKC catalog.
14. The user can then click on “Publish” to publish the Assets to the selected catalog(s).
15. The user is then able to navigate to the Catalog to see the published tables and review the profiling results.
16. Finally, the user can drill down into a table and can edit if it is necessary, delete the data profiling or re-do data profiling.
Conclusion
As you can see from the walk through above, Quick Scan is an intuitive and fast approach for companies to gain valuable insights into their large amounts of dark data. Quick Scan can take as little as two minutes to run and those two minutes can save companies millions of dollars in potential legal fines. We encourage you to learn more about Quick Scan and automated discovery in Watson Knowledge Catalog by visiting https://www.ibm.com/cloud/watson-knowledge-catalog
Sources of pictures used in the blog:
Figure 1: Photo by Markus Spiske on Unsplash
Figure 2: Photo by Myriam Jessier on Unsplash