Knowledge Graph Creation: Part I

How to construct a knowledge graph?

Selen Parlar
Analytics Vidhya
4 min readDec 11, 2019

--

In the previous story, we made a gentle introduction to the knowledge graphs and gained some intuition about them. This post will be consist of two parts: in the first part, we will do some NLP and extract information from unstructured data using spaCy, and in the second part we will construct our knowledge graph using this information.

Natural Language Processing With spaCy

Natural Language Processing (NLP) is a subfield of Artificial Intelligence and it tries to connect computers and human languages. spaCy is a free and open-source library for NLP in Python. It has a lot of in-built capabilities. Since it is important to process and derive some insights from unstructured, and spaCy is commonly used for that purpose.

Installation

One can easily install spaCy using the Python package manager pip.

Download Models and Data

There are various different types of models in spaCy let’s download then load one of them. We will store the model in an nlp object which is a language model instance loaded by en_core_web_sm.

Using spaCy

Let’s read a text using spaCy and store in a doc object which is a container for accessing linguistic annotations.

Tokenization

Tokenization is a process that allows us to identify the basic units in our text. Let’s extract the basic units, namely tokens, for the given doc:

Entity Extraction

We can also perform further analysis like getting the syntactic structure of a sentence. Parts-of-speech or POS is a grammatical role that explains how a particular word is used in a sentence. There are eight parts-of-speech:

  1. Noun
  2. Pronoun
  3. Adjective
  4. Verb
  5. Adverb
  6. Preposition
  7. Conjunction
  8. Interjection

To extract the entities from text let’s get the POS tags.

Relation Extraction

The POS tags alone are not sufficient for various cases and require further analysis like dependency parsing. Dependency parsing is the process of extracting the dependency parse of a sentence to represent its grammatical structure. Now, let’s extract the dependency relations among entities:

Visualization

Visualization is also possible using spaCy’s built-in visualizer called displaCy. Let’s see the POS tags of the given text:

You can see the visualization by opening http://127.0.0.1:5000 in your browser:

Sentence Segmentation

Generally, texts are consist of several sentences and sentence detection is an important feature to divide a text into linguistically meaningful units. Let’s extract the sentences of text about Ada Lovelace:

As you can see with the above example, spaCy is correctly able to identify sentences in the English language, using a full stop(.) as the sentence delimiter.

All in all…

Now, we know how to perform some basic NLP tasks like tokenization and sentence segmentation. Also, we have enough knowledge about how to get the entities and the relations between entities. We will use this knowledge in the second part of this story, so that, each entity will be represented as nodes and the relations between these entities will be the edges in our knowledge graphs. Let’s see how! Stay tuned!

--

--