The world is being challenged every day to surface insights and capture information quickly. While tools like IBM Cognos and IBM SPSS are well equipped to handle structured data, organizations struggle when it comes to unstructured data that lives in PDFs, Word documents, PowerPoint presentations, and Excel worksheets to name a few. Furthermore, with analyst firms, like Gartner, estimating that 80% of the world’s data is unstructured the question becomes — how can an organization best find actionable insights in this sea of unstructured data?
A technique most commonly used to uncover these insights is Natural Language Processing (NLP).
If you have tried to work with text data to perform NLP on many documents, you might have realized that annotating them is often very laborious and time consuming. In the quest to capture relevant information it becomes important to clean up the documents and eliminate unnecessary information. They may have headers and footers that do not contain any important information. It could be riddled with other features like figures and tables, that are difficult for traditional text parsers to identify and make sense of. Sometimes it is useful to split the document into sections, but there is no easy way to divide the document semantically without having someone manually go through the entire document.
How IBM Watson Discovery Can Help
IBM’s Watson Discovery is an AI-powered search and text analytics tool that enables you to receive targeted and specific answers from your data, all while keeping it private and secure on the IBM Cloud or any other cloud or on-premises environment.
Smart Document Understanding (SDU) enables you to identify and extract custom fields in your documents uploaded to Discovery. You can use SDU to annotate fields in your documents, and as you do so, it automatically learns from these annotations and starts applying them across your document collection. Even better, SDU allows you to customize how your documents are indexed into Discovery, which in turn will improve the answers that your application returns. It is leaps ahead of having to manually split up or extract information from each page of a document, which could take hours for something like a textbook or many days for a large collection of documents.
Getting Started with Smart Document Understanding
To get started with SDU, you need to create an IBM Cloud account and set up a Watson Discovery Instance.
After uploading the document, choose the ‘Configure Data’ option to launch Smart Document Understanding.
Under the ‘Identify Fields’ tab, you can tag fields based on the ones described to the right.
SDU can be used to annotate a variety of documents. For example, the textbook below has been annotated using a few of the built-in fields. You simply highlight the text boxes with the color corresponding to the field label.
Below are a couple of additional examples showing how more complex features like tables within the text can be annotated using SDU as well. Discovery does not only distinguish between features like text and tables, but it actually maintains the structure of the table that is annotated. This means that when you query for data in this technical manual, for example, you’ll be able to associate things like text to the column header that it belongs to. Having this capability could allow you to answer questions like “what are the minimum processing requirements for a single computer installation,” where many text parsers will only see the table as a single stream of text.
As you annotate the pages, Watson learns and predicts the annotations in real time!
The SDU Model continues training on the new annotations that are being provided and the following pages are marked by the live predictions. This is really amazing, because you’ll be able to see the model improving as you complete additional training, and when the model is performing at the level that you expect, you know you can stop training. This is in contrast to traditional systems where you have to first gather training data, then submit that training data for model creation, wait for the model to complete training, and then analyze the results. Having all of this performed for you in real-time reduces the time scale by an order of magnitude. The trained Machine Learning models can be exported and leveraged for use on any similar documents in other Discovery collections.
SDU also has the option to create custom fields. If your document has specific attributes such as ‘Code’ or ‘References’ that you would want to mark you can create custom fields to identify them.
Managing Where SDU is Applied
Once the entire document is annotated, you can switch to ‘Manage Fields’ tab to pick the fields you’re interested in.
Here you can select which fields you want to include in your indexed data. This means you can remove things like headers, footers, or anything else that is not beneficial to include in your documents, thereby improving the quality of the query results. The same approach can be used to focus on text in certain sections.
SDU can also be used to split up documents. Instead of the monotonous task of manually dividing up the document, SDU can be used to segment the document semantically based on headers, subtitles or any other fields. Instead of simply dividing the document based on page numbers, SDU divides it into sections that make sense.
Once you’ve selected the fields to index and split on, you can click “Apply changes to collection” and all new documents uploaded to your collection will have the SDU Model applied to them during ingestion. Do note, you will need to re-upload your existing documents in order for SDU to be applied to them as well.
The training that SDU does is based on the visual aspects of the features instead of the text. For example you would expect the Title before the Subtitle. It looks at features such as the size, font, boldface, etc. It also trains based on the ‘boxes’ you annotate. It uses the topological features of each ‘box’ such as their size, position within the page, and features of neighboring boxes.
It’s critically important to remember that SDU only uses the visual features of the document to train on. In other words, the textual context is not fed into the model, so you can’t label sections of the text based on the text itself, but only on things like font size, color, position in relation to other fields, etc.
On a page like this, it would intuitively make sense to mark ‘Philosophy and team structure’ as the ‘Title’ and then create a custom field titled ‘Philosophy’ and mark the section underneath with this custom field. However, this would break our rule of not using textual context. SDU understands that ‘Philosophy and team structure’ is a title based on the font, size, and relation to other fields, but it does not carry the textual context forward. So, the section beneath it can’t be specifically labeled as a ‘Philosophy’ section, but would rather just be labeled as ‘text’.
How SDU is Being Used
In many cases Discovery gets used in conjunction with IBM Watson Assistant to help you quickly populate your Assistant with potential answers from a FAQ or knowledge base or to act as a fallback for complex questions (that aren’t defined in the Assistant dialog), which are sometimes referred to as “long tail” questions. Instead of manually building dialog and responses to every possible question, documents are loaded into Discovery and can be queried using a natural language question to retrieve documents and passages that might answer the question. SDU can take this one step further by breaking large documents down into smaller chunks, while still keeping content logically grouped together, thereby improving the specificity of the results. Additionally, because SDU applies field labels to the text within your documents, you can display the content to end-users with rich formatting as it was originally intended.
We’ve helped clients use SDU for a wide variety of use cases. Below are just a few examples:
- A large retail chain was building a chat assistant using Watson Assistant and wanted to use Discovery to answer FAQs instead of building additional dialog. This was a design move to reduce the time required to modify FAQs and implement new ones. In order to do this, they used SDU to annotate ‘question’ and ‘answer’ fields within their FAQ documents and split on the ‘question’ field. This effectively created separate documents within Discovery for each question/answer pair, and made them very simple to query and get relevant results from in the broader chat solution.
- An automobile manufacturer was preparing their employees to return to work amidst the ongoing health crisis and had prepared documents detailing policies and practices to follow. In order to make it easier for their employees to get answers to their questions, they wanted to expose this information through their chatbot using Watson Assistant. Using SDU, these documents were able to be chunked into much smaller documents, which not only made it easy to get more targeted answers, but also made it possible to perform relevancy training with Discovery to improve the relevance of the natural language query results.
- A large technology company had been manually pulling data from hundreds of lengthy documents and putting it into a spreadsheet format and they needed a solution to automate this process. With SDU, the content within these documents was able to be labeled and split into hundreds of new documents, and then a microservice was able to call the Discovery API to get the content and translate it into the required spreadsheet format. This was a process that previously took hours for a person to manually complete, but using Discovery was reduced to less than a minute.
If you’re interested in consulting with IBM on your AI projects, you can learn more about and sign up for an engagement with the IBM Garage or IBM Data and AI Expert Labs & Learning. The team of experts at IBM is experienced in implementing Watson for a variety of industries and use-cases and would be happy to help with yours.