Convert Documents into Knowledge Graph

Pranav Tondgaonkar
5 min readNov 23, 2021

Knowledge graphs (KG) have quickly become one of the most popular tools for modeling the relationships between entities in wide range of domains. The ability of KG to represent information in the form of a graph and derive inferences is one of the key reasons for their prominence in the field of information retrieval and representation.

We at Recobo, provide a Cognitive Search for your unstructured enterprise documents. This is a highly accurate and intelligent search service powered by machine learning. By combining KG with natural language processing and machine learning, we hope to improve existing search results. In this article, we take a look at KG and talk about an interesting use case.

What is Knowledge Graph?

Knowledge Graph is a way of representing information in the form of a graph, with nodes representing entities and edges representing relationships between them. Take a look at the following image:

Here, Node A and Node B are two distinct entities. These nodes are linked together by an edge that represents the relationship between them. This is the smallest knowledge graph we can create, also known as a triple. KG provides a representation that is simple to comprehend for both human and machines. It is also dynamic, in the sense that it can draw new inferences and can be remodeled with addition of new data over time.

How knowledge is represented in a graph?

We’re using Neo4j, an open source graph database management system, to store our knowledge graph. If Node A is ‘Barack Obama’ and Node B is ‘USA’ , then it is very likely that the relationship between the nodes would be ‘president of’ and the graph would look like this:

A node can have multiple relations. The node ‘Barack Obama’ might be related to some other node with a different relation. For example, ‘was born in’.

Populating Knowledge Graph

Pipeline for populating a KG

Our initial focus is on extracting content from form-like PDF documents of chemical products to populate the KG, as a lot of our customers have significant number of semi-structured documents like safety data sheets. These documents have a lot of information regarding the product e.g. Handling and Storage, Physical and Chemical Properties, etc. The example of this would be:

The important parts of developing a knowledge graph are the nodes and the relations between them. Since the text in these documents is not running /continuous text, typical triple extraction technique such as subject-predicate-object is ineffective. The information in these form-based PDFs is in the form of key-value pairs. For example, in the above image of the document, the key is ‘Colour’ and the value is what appears next to it. As a result, a specialized approach for the entity extraction is required.

We have trained a form extraction model, which utilizes BERT, along with azure OCR to extract text, table and key-value pairs from unstructured data. If you want to read more about this model, please read this article.

The key-value pairs are extracted from documents using the form extraction model.

The prediction output of the form extraction model is used as a data source to populate KG.

Information Representation in Knowledge Graph

This use case involves combining a KG with deep learning to improve search capabilities, as KG represents information in an intuitive manner similar to how people link concepts to one another. The interlinking of documents makes it much easier to identify relationships and execute navigational queries. The output of form extraction model is preprocessed before entering into Neo4j. With product name as one node and property name with it’s value being other node, KG can be populated.

Knowledge graph for answering questions

Answering questions is an interesting use of KG, as querying them is easier compared to traditional databases which require a lot of joins to get the specific value. Specific questions such as ‘What is the form of Cavipor® T0?’ should yield an appropriate answer from the pool of data.

The KG for one product is shown below, with all its properties with their values connected as nodes.

End Notes

In this article, we saw the pipeline to extract information from the documents and build a knowledge graph from it. KG can also be used as a base to combine data from many sources or documents, making possible to aggregate answer from multiple sources. In addition to form like PDFs, this pipeline could be enhanced by looking at other types of PDF documents.

Please feel free to visit our website:

HOME | Recobo

Our Documentation for services provided by Recobo:

Overview | RECOBO

--

--