How to Finetune Entity Recognition Models using Watson Knowledge Studio (Part 2 of 3)

Kunal Sawarkar
IBM Data Science in Practice
6 min readOct 16, 2019

--

A step by step guide to understand how Watson Knowledge Studio empowers subject matter experts with a self-service domain-aware entity extraction capabilities.

This is the second story in a 3 part series on this topic. See previous article on how Watson Discovery can help business analysts jump-start entity recognition model building with zero coding. See next article here about how data scientists can upend their game with sophisticated but efficient handling of NLP issues — while exploiting the integration of Watson Discovery with Jupyter Notebooks..

Written by Kunal Sawarkar & Jade Zhou

Part II — Enhancing the baseline Entity Extraction model using Watson Knowledge Studio with domain specific annotations

Watson Knowledge Studio (WKS) offers rich capabilities when it comes to complex entity recognition, extraction and relationship understanding problems. Some of the problems involve vocabulary that is specific to particular domains and problem areas. For example, the word “model” may refer to a car in the auto industry, an algorithm in data science, a person in the fashion industry and a prototype in manufacturing. Watson Knowledge Studio offers capabilities to provide this knowledge, with hints and tools to extract entities that may be very pertinent to a given industry. Medical domains or banking domains are full of such terms and are highly specific.

  1. Install and Create a Workspace in Watson Knowledge Studio

Watson Knowledge Studio is available a separate add-on for IBM Cloud Pak for Data. To avoid jumping across different clusters, WKS should be installed on the same cluster where Watson Discovery is located. In case it is installed on a different cluster, trained models from WKS can be exported via zip files. To launch WKS, an instance would need to be created just as we did in Watson Discovery, and the admin needs to add users who are collaborating on that project.

2. Create a Workspace with Tokenization

Once an instance is created, the next take is to make a workspace for documents and import them. At this time you can choose what tokenisation system you want to use. The “Default” tokenizer is based on machine learning, built on the statistical learning it has done in the language of the source documents. This tokenizer finds tokens that capture the more natural and nuanced patterns of language. Another option is a“Dictionary” based tokenizer which is based on linguistic patterns. Once selected this can not be changed.

3. Provide A Domain Specific Entity Type System

A “Type System” defines items that are specific in your content domain that you want to label with an annotation. The type system controls how content can be annotated by defining the types of entities that can be labeled and how relationships among different entities can be labeled. You can define your own type system with inputs from subject matter experts. Many industry domains like metallurgy, geology, market intelligence, life science, electronic health records publish dictionaries or ontologies of domain-specific terminology. Consider referencing this type of resource to get an idea of the types of entities you might want to define in your own type system.

In WKS, a type system can be created from scratch or an existing one uploaded. A sample type system based on KLUE is available and it can be imported from here.

KLUE stands for Knowledge from Language Understanding and Extraction and was derived by IBM Research based on the analysis of collections of news articles. For general purpose problem like ours, this provides an excellent set of all possible entity types to build extraction model from.

4. Create a Custom Dictionary

Sometimes, unique abbreviations exist that are only used by one company or one department or by just one person to refer to an entity. For example, someone may refer to a bank that begins with the letter “A” as BAC. In such cases SMEs can create their own custom dictionaries.

5. Use Pre-Annotation Services

Now that your documents are ready for training in a specific domain, the next task is to provide a “ground truth”, whereby SMEs annotate a set of documents to build a training set for model building. This can be achieved using a pre-annotation method based on a custom dictionary which will simply look for patterns of text as defined in the previous step.

However, an SME or analyst also has the option to use pre-annotation services like Natural Language Understanding. This is an excellent choice for any generic entity extraction as the model is trained on millions on news articles.

In the above example which was based on a type system we imported, WKS provides the option to automatically annotate entities. In our use case , since we want to identify company names in emails , we have chosen to use a pre-annotation NLU service for ORGANIZATION and under that COMPANY.

6. SME creating a Ground Truth

If needed, an analyst can create a ground truth by manually providing an annotation to the document entities. This is an optional step if pre-annotation is not enough.

This is expected to be done on a very small number of documents (dozens) so that WKS can build a model from it.

Once documents are provided with a ground truth, the collection is ready for the machine learning model

7. Build the Machine Learning Model

The machine learning model is built on a small set of documents with pre-annotations and annotations as ground truth, and can now provide a full fledged entity extraction model that is suitable for any domain specific problems.

In the case of large numbers of documents, where no pre-annotations are possible, it is expected that users can annotate small numbers of documents as ground truth. Users can build a machine learning model using that and then use that machine learning model to pre-annotate other document sets. This process can significantly improve the speed of annotations. This option is only available if the user has already created a machine learning model with WKS.

If adding a new document set, users can run the machine learning annotator previously created (named “Training” above) to pre-annotate the new documents. If the new set of documents are similar to the documents that were originally used to train the machine learning annotator, then this is probably the best choice for pre-annotation.

In next part of this series we will see how an enhanced model built in Watson Discovery or Watson Knowledge Studio can be embedded for business applications. We will also look at how results can be consumed in a jupyter notebook and be integrated with other NLP libraries.

Project Github Link:

https://github.com/greenorange1994/EmailRoutingByWatsonDiscovery

--

--

Kunal Sawarkar
IBM Data Science in Practice

Distinguished Engg-Gen AI & Chief Data Scientist @IBM. Angel Investor. Author #RockClimbing #Harvard. “We are all just stories in the end, just make a good one"