Accelerating data extraction from complex business documents with IBM watsonx.ai and Watson Discovery

Daiki Tsuzuku
5 min readDec 8, 2023

--

An annual report is a comprehensive document that provides a detailed overview of a company’s financial performance, operations, and achievements in a year. It is one of the essential tools for investors, stakeholders, and decision-makers to assess the business of the company. However, it is usually a large document (for example, there are more than 100 pages in IBM’s annual report!) and its format can vary depending on the company. This makes extracting valuable information not only time-consuming but also prone to errors and bias. By automating the process required to extract information from an annual report, we can reduce cost, time, and mistakes, which can contribute to high-value business outcomes.

Why you need to use Watson Discovery

IBM Watson® Discovery can empower you and your employees to accelerate extracting information buried in such large and complex documents like annual report automatically. Watson Discovery can extract texts from large and complex PDF documents using advanced OCR technology, extract meaningful information like entities from extracted text using advanced NLP/NLU solutions, and visualize extracted entities in the preview of original PDF documents.

Moreover, Watson Discovery’s entity extraction lets you create and use your own custom entity extractor. You can prepare labeled data in low-code/no-code approach efficiently and train machine learning model to extract entities that will fit to your business scenario. You can improve you entity extractor by retraining after adding or updating labeled data, so you can continuously improve your entity extractor or adapt to changes in your business environment.

You can learn how to use Watson Discovery’s entity extraction here:

https://cloud.ibm.com/docs/discovery-data?topic=discovery-data-entity-extractor

You can learn more about what values Watson Discovery’s entity extraction can provide from this article:

https://kait-arnold.medium.com/using-ai-powered-watson-discovery-to-accelerate-your-manual-repetitive-financial-processes-92e0050bbb4a

Why and how you use watsonx.ai

Watson Discovery already provides a built-in entity extractor for general proper nouns, and you can also create and use your own entity extractor for specific cases. However, a recent rise of large language models or foundation models brings forth another scenario, where using your favorite pre-trained or fine-tuned large language model has become table stakes. Large language models or foundation models have an incredible ability of understanding language, common-knowledge, and generalization performance. You can perform various tasks with those models without training, or fine-tuning them to your specific task with less data.

IBM has introduced watsonx to provide self-service access to high-quality, trustworthy data, enabling users to collaborate on a single platform where they can build and refine foundation models.

Watsonx.ai currently provides several types of foundation models, like Granite models and Slate models https://www.ibm.com/blog/introducing-the-technology-behind-watsonx-ai/. In this blog, we will show you how to leverage watsonx.ai for entity extraction with both generative models through prompt engineering, as well as finetuning extractive models for entity extraction.

How to fine-tune and use watsonx.ai Slate model

So, you cannot use Watson Discovery for that scenario? Don’t worry, there is a good news. Watson Discovery released two new features https://cloud.ibm.com/docs/discovery-data?topic=discovery-data-release-notes#discovery-4october2023 :

  1. Export labeled data for an entity extractor
  2. External enrichment feature to annotate documents with a model of your choice

With these two features, you can prepare labeled data to train a watsonx.ai Slate model with less effort and use a fine-tuned model as a Watson Discovery enrichment to extract entities.

How do we use these features to extract information from an annual report? I’ll show that step by step.

  1. Prepare your labeled data in your Watson Discovery entity extractor workspace. You can use a collection of financial statements as data to label.
  1. Download labeled data from the workspace
  1. Upload labeled data to your Watson Studio Jupyter Notebook and fine-tune/deploy Slate model.
    (Check this article to learn more about fine-tuning watsonx.ai model: https://medium.com/@alex.lang/fair-is-fast-and-fast-is-fair-ibm-slate-foundation-models-for-nlp-3508412a4b04. Once this gets available in public cloud, you can do the same thing there)
  1. Implement your webhook application to get a result predicted by your fine-tuned Slate model.
  2. Create a webhook enrichment using Watson Discovery’s create enrichment API: https://cloud.ibm.com/apidocs/discovery-data#createenrichment
curl -X POST {auth} \
--header 'Content-Type: multipart/form-data' \
--form 'enrichment={"name":"my-first-webhook-enrichment",
"type":"webhook",
"options":{"url":"{your_code_engine_app_domain}/webhook",
"headers":[
{
"name": "Authorization",
"value": "Bearer {SCORING_API_TOKEN}"
}
],
"location_encoding":"utf-32"}}' \
'{url}/v2/projects/{project_id}/enrichments?version=2023-03-31'

6. Apply the webhook enrichment to your collection and click “Apply changes and reprocess” in Watson Discovery.

  1. Visualize the result in the context of the original annual report PDF document uploaded into Watson Discovery. (Reference)

As you can see, webhook enrichment (and your fine-tuned watsonx.ai Slate model under the hood) have extracted financial outcomes and their values successfully. Sweet!

As a review of what we have done so far, I’d like to share two images:

Export labeled data created in Watson Discovery, upload the data to Watson Studio, fine tune Slate model the data, deploy Slate model as deployment of Watson Machine Learning, and implement and deploy your webhook application which will act as an frontend of your Watson Machine Learning deployment

This is for steps from 1 to 4, preparing your webhook application.

Create webhook enrichment which points to your webhook application, apply the enrichment to your collection and start processing documents. Watson Discovery will then enrich your documents with your webhook application

This is for steps from 5 to 7, using your webhook application.

Difficult? Don’t worry. We provide sample jupyter notebook and webhook application as the tutorial:

https://github.com/watson-developer-cloud/doc-tutorial-downloads/tree/master/discovery-data/webhook-enrichment-sample/slate

You can reuse it to learn or develop your own webhook enrichment.

What else you can do with webhook enrichment

You can do various tasks using webhook enrichment, not only extracting entities but also assigning labels at a sentence or document-level.

Moreover, you can also leverage next-generation generative AI in extracting entities in webhook enrichment.

Check this repo to learn what and how you can do with this powerful feature:

https://github.com/watson-developer-cloud/doc-tutorial-downloads/tree/master/discovery-data/webhook-enrichment-sample

  • “regex” provides the easiest sample application how to implement your own webhook application.
  • “granite” provides a sample of how to integrate generative AI for entity extraction scenario

Let’s accelerate your business in unprecedented speed with Watson Discovery and Watsonx.ai!

--

--