How to Build Entity Recognition Models in a Jiffy using Watson Discovery (Part 1 of 3)

This article aims to provide a step by step guide to understanding how Watson Discovery can help business analysts jump-start entity recognition model building with zero coding.

Kunal Sawarkar
IBM Data Science in Practice

--

Written by Kunal Sawarkar & Jade Zhou

This is the first in a three part series. Part 2 will focus on how Watson Knowledge Studio helps empower subject matter experts with self-service domain-aware entity extraction capabilities. It discusses how Watson Knowledge Studio helps empower subject matter experts with self-service domain-aware entity extraction capabilities. Part 3 will also discusses how data scientists can upend their game with sophisticated but efficient handling of NLP issues — while exploiting the integration of Watson Discovery with Jupyter Notebooks.

The Entity Problem

Identifying entities in unstructured text can be challenging. A piece of text might contain a company, a person, location, or it may be further complicated by references that are specific to a particular domain. For example, a company name may be a vendor, supplier or a broker in the context of a given piece of text. The text, based on who sent it or referring to whom an action mentioned, can tell us what relationship one entity shares with another. Further still, a single entity may appear in different forms but still have the same intended meaning by the user.

Let us start tackling this problem through a real-life email example. We need to understand the content and context of the email so as to extract the relevant entities and then take some action(s). Company names might be easy to extract if they follow a pattern — for example, a Bank might begin with the letter “A”. In reality, people might refer to this entity slightly differently such as BOA, BOFA which are more common. However, less common representations may appear like BNKA. So, is it possible to train a model which will recognize all variations and then automatically route emails to the person(s) responsible for it? What’s more, the emails given may be for a particular intent, but can we recognize such relationships and intent and then take appropriate actions?

How can one solve all these Natural Language Understanding problems? Coding a dedicated NLP solution in python might be one option if you’re an experienced data scientist — but that doesn’t help a business analyst that cannot or does not want to code. Analysts tend to be domain experts — not data scientists. The head of a business unit may want to quickly know if there is value in a particular data set — and then extract it before investing too many resources. It’s a bit like mining for gold. If you get the location wrong you will waste a lot of time and effort digging.

How can business analysts and domain experts interact with data scientists on such NLP problems more easily and efficiently? How can we use knowledge and entity extraction algorithms, and transfer that learning to a model or a set of models?

Watson Discovery can help.

Please note: entity extraction is not the only function that Watson Discovery can perform, but to familiarize readers with the tool we will start with a simpler use case.

Why Discovery? Because “Speed is the New Currency”

The value of Watson Discovery is not just that it can build entity models. Data scientists can do that by coding in python.Watson Discovery enables business users, SME’s, analyst and anyone not trained in python coding to efficiently and accurately extract value from their data. Data scientists, can accelerate their time to value using pre-trained Natural Language Understanding (NLU) models and APIs. Models can be further enhanced through Watson Studio, be integrated with open source (NLTK, Scipy) or custom machine learning models using jupyter notebooks.

Think of this step-by-step guide as rungs on a ladder, that progress the reader in addressing many of the complexities associated with entity recognition problems.

Part I — Using out-of-the-box services like NLU in Watson Discovery for quickly building Entity Extraction models

Part II — Enhance the base model by advanced custom annotations created by domain experts and curated dictionaries using Watson Knowledge Studio

Part III- Extend the model by integration with any custom machine learning or python coding using Watson Studio via Jupyter notebooks

Watson Discovery is offered both under IBM Cloud as well as part of IBM Cloud Pak for Data (ICP4D). This guide will aim to cover its working on both platforms, with call-outs for any special actions needed for ICP4D.

Part I — Using NLU in Watson Discovery for quickly building Entity Extraction model

Above is the node.js app that shows the output of our email extraction problem. The left side pane shows the entities which are extracted from this email set. As you can see it has extracted not just peoples names but also companies names. The most interesting aspect of the model is the ability to recognize and associate acronyms of names.

But how much of time or python coding or model training CPU was needed to get his result?

The answer is ZERO. How? Read on …….

Data

For this problem we used an email dataset from https://www.cs.cmu.edu/~enron/. This dataset is legally available for all academic & industry purposes.

Currently, the document format of emails are not directly supported by Watson Discovery. We can parse emails using any open source .eml parser to any of the known formats including .txt. In this case we have parsed the emails into .doc formats and then used them for subsequent analysis. There is a parser provided by email vendors that allow emails to be directly exported in .txt formats.

1. Create the Instance (Only For ICP4D version)

For ICP4D version of Watson Discovery , users need to first create an instance for the cluster. Users who intend to work on a project need to be provided with access to this instance. After the instance is created , launch the tool and begin to import documents for text processing.

2. Import the collection

The tool allows users to import various types of document formats such as . doc, ppt, pdf , txt, etc. Users can create a new collection for a new project and import documents.

For this project, we have imported emails in the collections. Once done users you can open collections for further processing.

3. Analyze Imported Documents

Once documents are imported the user can add various out-of-box enrichments available on the input text. For example, four enrichments are added to above collection on cloud as follows:

  • Entity Extraction
  • Concept Tagging
  • Sentiment Analysis
  • Category Classification

The entity extraction enrichment in this case is based on NLU (Natural Language Understanding) . This is trained on millions of web archives & news content and can extract a rich set of entities mentioned below. The entities you would extracted are mapped to DBpedia and this project contains largest dictionary of real people, companies, locations & other entities. Full API details can be found here https://cloud.ibm.com/apidocs/natural-language-understanding

4. Extract Entities

It is possible to visualize all the entities extracted using a query language. For example, to see entities of type “company” extracted , a visual query can be built as below-

The above output in json format can be passed to any output app for visualization or further processing of the results. The relevancy score in the above query results uses a TextRank algorithm to understand how relevant a particular entity in a document really is. This is different from the Relevancy Training that we will see next. More info can be found at https://developer.ibm.com/dwblog/2017/watson-nlu-natural-language-understanding-metadata-concepts-categories/

5. Relevancy Training

One of the key aspects of entity extraction is to verify how relevant the results are compared to the expectations. This can differ not just from domain to domain or problem to problem but also from query to query. It is also expected that models will constantly learn based on relevant results. One easy way to achieve this is via “Relevancy Training”, whereby users pass their feedback to the model .

6. AND THAT’S IT. WE ARE DONE !

That’s how simple it is to use, to extract entities using Watson Discovery. Any app like node.js mentioned above can now consume results and provide them for wider assimilation. This functionality can solve 70–80% of modelling problems right away and gives a consumable model in a jiffy, which becomes a baseline for validating the business efficiency of the proposed data science project.

7. Custom Enrichment (Optional)

Users can further enrich documents visually by providing custom annotations. For example, users can open an email and create a new field for Entity= Company and then visually identify the text which is an entity corresponding to the company. Once enrichment is submitted it is used to query content as shown above in step 4.

Watson Discovery also provides capabilities for “Smart Document Understanding” and OCR reading which is evident in the above case. Once enriched you can reprocess your collection of documents for it to take effect.

Once enrichment is submitted, it is used to query content as shown above in step 4.

Watson Discovery Under the Hood

Watson Discovery’s architecture is open and scaleable. Each workspace collection creates shards for processing queries. These are extendible by adding more capacity to the underlying instance. The different pieces of Watson Discovery, Watson Knowledge Studio and NLU are connected by an API framework, making it easier to consume and embed.

What does Watson Discovery Offer On-Prem?

A big concern for Enterprise customers is that some data should never go beyond their firewall. This means that most of the public cloud options for doing Entity extraction models are not feasible. Using Watson Discovery on Cloud Pak for Data not only provides the option to do so on-prem but also provides capabilities to indemnify the development process using IBM vetted libraries & packages.

Watson Discovery on Cloud Pak for Data includes:

  • Smart Document Understanding- Visually annotate and enrich documents, including image formats using OCR
  • Ability to identify entities in customer conversations using Watson Assistant
  • Integration with Watson Knowledge Studio for domain specific problems

In next part we will see how Watson Knowledge Studio helps empower subject matter experts (SME) & analysts with self-service domain-aware entity extraction capabilities. This helps in enhancing the baseline entity extraction model for specific use cases in industries like healthcare, mining or finance by providing custom dictionaries, human annotations & training a specific machine learning model

Project Github Link:

https://github.com/greenorange1994/EmailRoutingByWatsonDiscovery

--

--

Kunal Sawarkar
IBM Data Science in Practice

Distinguished Engg- Gen AI & Chief Data Scientist@IBM. Angel Investor. Author #RockClimbing #Harvard. “We are all just stories in the end, just make a good one"