Knowledge Graph Creation: Part I

How to construct a knowledge graph?

Published in

Analytics Vidhya

4 min readDec 11, 2019

In the previous story, we made a gentle introduction to the knowledge graphs and gained some intuition about them. This post will be consist of two parts: in the first part, we will do some NLP and extract information from unstructured data using spaCy, and in the second part we will construct our knowledge graph using this information.

Natural Language Processing With spaCy

Natural Language Processing (NLP) is a subfield of Artificial Intelligence and it tries to connect computers and human languages. spaCy is a free and open-source library for NLP in Python. It has a lot of in-built capabilities. Since it is important to process and derive some insights from unstructured, and spaCy is commonly used for that purpose.

Installation

One can easily install spaCy using the Python package manager pip.

$ pip install spacy

Download Models and Data

There are various different types of models in spaCy let’s download then load one of them. We will store the model in an nlp object which is a language model instance loaded by en_core_web_sm.

$ python -m spacy download en_core_web_sm>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')

Using spaCy

Let’s read a text using spaCy and store in a doc object which is a container for accessing linguistic annotations.

>>> doc = nlp("This story is about Natural Language Processing using spaCy and creating knowledge base.")

Tokenization

Tokenization is a process that allows us to identify the basic units in our text. Let’s extract the basic units, namely tokens, for the given doc:

>>> print ([token.text for token in doc])
['This', 'story', 'is', 'about', 'Natural', 'Language', 'Processing', 'using', 'spaCy', 'and', 'creating', 'knowledge', 'base', '.']

Entity Extraction

We can also perform further analysis like getting the syntactic structure of a sentence. Parts-of-speech or POS is a grammatical role that explains how a particular word is used in a sentence. There are eight parts-of-speech:

Noun
Pronoun
Adjective
Verb
Adverb
Preposition
Conjunction
Interjection

To extract the entities from text let’s get the POS tags.

>>>for token in doc: 
...   print(token.text, "-->", token.pos_)This --> DET
story --> NOUN
is --> AUX
about --> ADP
Natural --> PROPN
Language --> PROPN
Processing --> PROPN
using --> VERB
spaCy --> NOUN
and --> CCONJ
creating --> VERB
knowledge --> NOUN
base --> NOUN
. --> PUNCT

Relation Extraction

The POS tags alone are not sufficient for various cases and require further analysis like dependency parsing. Dependency parsing is the process of extracting the dependency parse of a sentence to represent its grammatical structure. Now, let’s extract the dependency relations among entities:

>>>for token in doc: 
...    print(token.text, "-->",token.dep_)This --> det
story --> nsubj
is --> ROOT
about --> prep
Natural --> compound
Language --> compound
Processing --> pobj
using --> advcl
spaCy --> dobj
and --> cc
creating --> conj
knowledge --> compound
base --> dobj
. --> punct

Visualization

Visualization is also possible using spaCy’s built-in visualizer called displaCy. Let’s see the POS tags of the given text:

>>>from spacy import displacy>>>about_interest_text = ('Let\'s create knowledge graph with ...spaCy.')
>>>about_interest_doc = nlp(about_interest_text)
>>>displacy.serve(about_interest_doc, style='dep')

You can see the visualization by opening http://127.0.0.1:5000 in your browser:

Sentence Segmentation

Generally, texts are consist of several sentences and sentence detection is an important feature to divide a text into linguistically meaningful units. Let’s extract the sentences of text about Ada Lovelace:

>>>text_ada = ('Ada Lovelace was an English mathematician and' 
            ' writer, chiefly known for her work on'
            ' mechanical general-purpose computer, the'
            ' Analytical Engine. She was the first to'
            ' recognise that the machine had applications'
            ' beyond pure calculation, and published the'
            ' first algorithm intended to be carried out' 
            ' by such a machine. As a result, she is'
            ' sometimes regarded as the first to recognise'
            ' the full potential of a computing machine and'
            ' one of the first computer programmers.')
>>>doc_ada = nlp(text_ada)
>>>sentences = list(doc_ada.sents)
>>>print('Number of sentences: ' + str(len(sentences)))>>>for sentence in sentences:
...    print ('-' + str(sentence))Number of sentences: 3
-Ada Lovelace was an English mathematician and writer, chiefly known for her work on mechanical general-purpose computer, the Analytical Engine.
-She was the first to recognise that the machine had applications beyond pure calculation, and published the first algorithm intended to be carried out by such a machine.
-As a result, she is sometimes regarded as the first to recognise the full potential of a computing machine and one of the first computer programmers.

As you can see with the above example, spaCy is correctly able to identify sentences in the English language, using a full stop(.) as the sentence delimiter.

All in all…

Now, we know how to perform some basic NLP tasks like tokenization and sentence segmentation. Also, we have enough knowledge about how to get the entities and the relations between entities. We will use this knowledge in the second part of this story, so that, each entity will be represented as nodes and the relations between these entities will be the edges in our knowledge graphs. Let’s see how! Stay tuned!

Knowledge Graph Creation: Part I

How to construct a knowledge graph?

Natural Language Processing With spaCy

All in all…

Written by Selen Parlar