Using IBM Watson Discovery’s New Document Segmentation Feature

Published in

IBM watsonx Assistant

4 min readMay 14, 2018

The IBM Watson Discovery (WDS) product team continually consults with developers, and reviews feedback to ensure our products meet what you, our clients, genuinely need. Document Segmentation has been one of those requirements that we have been asked to implement. Last year, upon receiving such feedback, the team responded immediately with a beta version of Document Segmentation. We know you have been waiting patiently for the full version to be released since October 2017, and now it’s available. The beta launch has seen tremendous success since it was launched, and the general availability version will impress you further with some of the limitations that we had in the beta removed. For example, you can now extract the PDF, HTML, Microsoft Word and custom metadata and split up the metadata into as many as 250 segments, as opposed to the 50 segments we had in the beta version.

What is Document Segmentation?

Let’s assume that you have to upload an entire architectural manual into the Watson Discovery to be ingested and enriched. This manual on its own could have several headings and subheadings. The <h1> to <h6> tags are used to define HTML headings. So, for example, all <h1> tags would be treated separately from <h2> tags and would be enriched separately. In general, Document Segmentation takes these documents and divides them based on the selected heading levels.

Why do I need to split my documents?

Sometimes, documents are big! Big for a reason (the information needs to be stored together), but this makes it harder to identify quick answers to questions. These documents could be user manuals, frequently asked questions, catalogs, or many more. In these documents, the information is typically organized into sections that cover a specific topic. These specific topics can be very useful in an application that is designed to get the user to an answer quickly. This leads to Document Segmentation (manually or via the newly released Document Segmentation feature) to accurately ingest, enrich and analyze the documents in their logical parts.

The Document Segmentation feature splits an unstructured document into useful chunks that are then enriched and stored as individual searchable results. This feature can also result in improved result ranking when performing relevancy training (specifically for natural language query) as the training is performed on segmented and information-specific portions of documents instead of the entire general document.

Use Case — Company XYZ wants to ingest their company’s policy documentation

Company XYZ has multiple policy documents with over 100 separate clauses in each one. They have decided to use the IBM Watson Discovery to ingest, enrich and query its company’s policy documents. XYZ has two options: They can employ someone to manually save each clause as a separate document so that they can be uploaded to WDS individually. Or, they can use the Document Segmentation feature to segment the original document based on headings within the document. They choose to try the Document Segmentation feature (good choice). The company’s policies are saved as a word document. (We are assuming the documents are saved as word documents although the system would also be able to split the documents if they were saved as PDF or HTML files). To ingest this document, the company will upload the document into Discovery using their instance. Once the documents are uploaded, WDS will convert each document into an HTML file format. Taking advantage of the newly launched Document Segmentation functionality, XYZ is opting to split their documents using this function. While enabled, and based on the configuration, the document will be broken down into small chunks. They can choose to change the <h> tags of the documents or change the rules on how WDS splits their documents or finds their <h> tags in configuration. Once broken down, each segment will be treated as a separate document that will be enriched and indexed separately. Document Segmentation will segment each time the specified <h> tag(s) are detected.

Use Case — My company’s feedback system

You have a company feedback form that your customers and clients use to let you know what they think about your product. As a company, you want to review these comments to understand more about what your customers and users are saying. You can upload your word document into WDS and query the document(s) to understand what your customers are saying. During the configuration, you can use the split function to split the document into 2 or more chunks. After the split, each chunk will be enriched separately. Without splitting the document before enriching, you will have to do enrichment on the entire document. This might work in some cases but this could also lead to false conclusion in some cases.

Give it a shot!

If you have not tried the new Document Segmentation feature then now is a good time to check it out and see what it can do. There are many use cases and benefits and we want you to try it out. If you have any suggestions or feedback, please submit them via our Ideas Portal.

Originally published at developer.ibm.com on May 14, 2018.