Learn IBM Watson Discovery During the COVID-19 Pandemic

Truong Bui
Voice Tech Podcast
Published in
4 min readMar 30, 2020

Being in quarantine can be challenging, but it would be useful for us to use a great amount of time to focus on self-development and boost up our skills as a software engineer.

I have built a small Web Application Project that I integrated IBM Watson Discovery into. In this article, I’m just focusing on describing fundamentals of the process that the IBM Watson Discovery uses through what I learned so far. (link GitLab of the project is at the end of the article).

Intro to IBM Watson Discovery

Let’s imagine, your boss gives you a ton of documents (JSON, HTML, PDF, Word, and so on) and told you to find the insights that are hiding in those documents. What are you gonna do in this situation? You’re gonna read and analyze each of the documents, right? Trust me, That is truly an overwhelming way.

Fortunately, IBM Watson Discovery was created to help us go through situations like this. IBM Watson Discovery is a cognitive search and content analytics engine that you can use to identify patterns, trends and actionable insights from structured and unstructured data (JSON, HTML, PDF, Word, and more) with speed and accuracy to drive better decision-making.

I can say that Discovery can handle every hardest thing for you. You just create an instance of Discovery, inject and persist data into collections (collection as a box where you store data in Discovery). Discovery will automatically ingest (convert, enrich, clean, and normalize), store, and query data to extract actionable insights. After that, you can easily search and query what you want from the original data. So cool! right?

The Automated Data Pipeline

Alright! So let’s dive into what I learned.

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

Ingestion

Ingestion is the progress that Discovery converts, enriches, clean, and normalizes your structured and unstructured data. This progress is automatically invoked after the data is successfully injected and persisted in Discovery.

  1. Convert, Clean, Normalize: Once the data has been ingested, Discovery first converts PDF and Microsoft Word to HTML, then converts HTML to JSON.
  2. Enrich: With built-in natural language processing (NLP) capabilities. Discovery can extract enrichments from a wide range of document types, such as JSON, HTML, PDF, and Microsoft Word. The following table shows the key enrichments.

Actually, after Ingestion is finished, ingested, enriched and normalized data is saved in collections. You can immediately start to run queries to get insights from the original data. As I told above “Discovery can handle every hardest thing for you”. So not many things to say in Storage and Query part, I’m gonna move on the Training part.

Training

Discovery is so powerful, but not everything. Such as you run a simple query and Discovery returns a large set result. However, not all results contain relevant information. So you must fine-tune the result set and teach Discovery to filter out invalid results. This process is called training.

To train Discovery, you must provide the following:

  1. Training queries. This is a set of natural-language queries that are representative of the queries the users enter.
  2. Ratings. that indicate which results for each query are relevant and not relevant.

You need to provide at least 49 unique training queries. Discovery provides feedback if it needs more queries so that it can train.

You start the training process by adding training queries to Discovery and rating results as either Relevant or Not relevant. After you are finished, it takes about 30 minutes for the system to complete the training and reflect it in the results. About 30 minutes after you complete this training, you search again and will see better results.

Definitely, when finishing reading my above post, you’ll have a question for me “Why could you describe the service of IBM in a short way like this”. Actually, I’m just talking about what I learned after finishing my own project, trying to describe as intelligible as possible. Because it’s a ton of documentation on the IBM website, you can take a look if you want to learn Discovery.

Have a look at my project https://gitlab.com/buingoctruong1508/auditcheck

Thank you all for reading!

Something just for you

--

--