Knowledge Graphs: How to build from text

6 min readFeb 5, 2023

First things first, have you heard of Knowledge Graphs before? Not asking if you know what it is, just if you overheard someone talking about it or casually read the term somewhere. If yes, amazing! If no then also amazing because we will go through it in as demystified way as possible.

We use google almost everyday. Have you ever wondered how google search provides such accurate information from such vast amount of data? The term Knowledge Graph (KG) has been coined around since late 1980’s but it actually picked pace in 2012 when Google introduced their KG built on DBpedia and Freebase (both graph-based repositories which were developed on the related concepts). Following that, other multinationals such as Facebook, LinkedIn, Airbnb, Microsoft, Amazon etc, also developed their own KGs.

Enough about the history. We still don’t know what it actually is. Knowledge graph is nothing but an interconnected web of entities with some relationship between them. These entities can be any object, a place, an event, a situation or even a concept. This information is usually stored in a graph database and visualised as a graph structure and hence the name “Knowledge Graph”. It consists of three main components i.e., node (or the entities), edge (or the relationship) and the label of the node. The connection of two entities with a relationship is what we call a “Triplet”. This is nicely represented in Fig.1, where in the triplet Alice visited Eiffel tower, Alice and Eiffel tower are the entities, person and place are their labels and visited is the relationship between them.

Since we have the terminologies sorted now, let’s move to the meat of the matter: building a KG. We primarily need only three piece of information to build a KG(as we discussed earlier), entities, relationships and labels. These information can be easily extracted using regex if the data has some structure, for example tables, HTML pages etc. But it gets complicated when we have data in the form of Natural Language Text. Machines are not equipped to understand Natural Language and that is where Natural Language Processing (NLP) comes in. We will breakdown our problem into smaller bits and use different NLP techniques to extract the required pieces of information.

Entities Extraction
Entities are the most fundamental part of a knowledge graph. If you have data in some structured form then extracting entities is just a query away. But if the data is unstructured such as in the form of Natural Language Text then we need to do some preprocessing and text analysis to pull out the entities. We will talk about these steps in detail.

In a Natural Language Text, there can be many entities that refers to the same entity. For example, in Fig.3 the entity “Ana” is again referred in the second part of the sentence as “she”. Similarly, the entity “Natural Language Processing” is later referred as “it”. This process of finding all expressions which refers to the same entity is called Coreference Resolution. There are numerous libraries and APIs already present which does this like Numpy, Scipy, stanfordNLP, HuggingFace, etc.

In order to extract entities from each sentence, we break the resultant text from previous step into sentences which is referred as sentence tokenisation. After these preprocessing steps, there are three main techniques for entity extraction, as shown in Fig.2:

Named Entity Extraction — Classifying entities into defined classes. Spacy has an excellent implementation of this.
Dependency parsing of sentences to extract subject and object
Topic Extraction — Extracting keywords or key phrases from text. This technique doesn’t require sentence tokenisation as a preprocessing step. This can be achieved using Term Frequency — Inverse Document Frequency (TF-IDF), where we quantify the words in a sentence and extract the important phrases based on it’s TF-IDF score.

Then there’s a post processing step to identify and replace duplicate entities which are under different names. For example: New York and NYC. These kinds of disambiguation create anomalies in the Knowledge Graph. This kind of entity disambiguation can be solved by training a Word2Vec model and perform a K Nearest Neighbour (KNN) search to get the entities which lie very close in the latent space. Disambiguated entities can then be identified using some rule based approach like checking the first few words or the initials.

2. Relationship Extraction

Relationships are the links which connect two entities. But unlike entities, it is difficult to extract relationships by NLP techniques. Even if we are able to somehow extract relationships from techniques like dependency parsing, the problem is that it creates a large set of unique relationships. Due to large number of relationships as compared to entities, the Knowledge Graph built out of it will be very sparse. The sparsity of the graph makes it difficult to perform algorithms like graph completion and graph traversal. And as a result, it will perform very poorly on the downstream tasks. To avoid this problem, there has to be some categorisation of relationships so that we can have a definite number of relationship classes. While working on this problem, I came across SelfORE paper which has a precise solution for this problem.

Fig.5 Relationship Extraction technique (src. SelfORE paper)

The output from the entity extraction module is fed into the Bert language model for sequence classification. The entities in the text are annotated with the E1 and E2 markers as shown in this flowchart. The output of this model is the sentence encodings giving more weightage on the two entities marked. This sentence embedding is nothing but the relation embedding between the two entities. These embeddings are then fed into an adaptive clustering model which uses K-means as its base clustering technique. The labels generated for each sentence after the clustering is then used to train and refine the Bert model for sequence classification. And then again, the sentence embeddings are extracted, and the process goes on until the loop ends. To visualize the relation embedding, Fig.6 shows the relationship class vectors in 3D space after each iteration. It is clearly visible how the relation embeddings are getting refined after each iteration.

Fig.6 Relationship vector representation in 3D

3. Graph Completion

Graph Completion is basically predicting the missing links (relationships) in the Knowledge Graph. This is were Knowledge Graph embeddings comes in. It aids in the link prediction by providing a generalisable context about the overall KG that then can be used to infer relations.

Almost all KG embedding models are build using these three steps:

Encoding each node into a vector
Defining a scoring function
Optimizing the scoring function.

Relationships are represented as translations in the embedding space. The scoring function is defined such that the embedding of the head entity “h” should be close to the embedding of the tail entity “t” plus some vector that depends on the relationship “r”. Then the model optimizes the scoring function. For training, since we have only the true triplets, so we need to create negative samples by corrupting the true triplets.

Awesome! That is it. That is all you needed to build a basic Knowledge Graph. Knowledge Graphs is a huge area which has a variety of implementations and applications depending on the kind of data we are dealing with. This was just a overview on how to build a KG from NLT in the simplest way possible. There’s a lot more to talk when we go into the nitty-gritty details of these methods.

Knowledge Graphs: How to build from text

Written by Ritika Kumari