Python NLP Tutorial: Information Extraction and Knowledge Graphs
This article was originally published on the Programmer Backpack blog. Make sure to visit this blog if you want to read more stories of this kind.
Interested in more? Follow me on Twitter at @b_dmarius and I’ll post there every new article.
NLP tutorial for Information Extraction and building a Knowledge Graph in Python and spaCy.
In a previous article, we discussed about Natural Language Processing and various tools that we have to quickly get our hands dirty in this field. This post will be about trying spaCy, one of the most wonderful tools that we have for NLP tasks in Python.
Today’s objective is to get us acquainted with spaCy and NLP. We will write some code to build a small knowledge graph that will contain structured information extracted from unstructured text. The entire code for the project can be found at the end of this article.
Information extraction and knowledge graphs
Information extraction is a technique of extracting structured information from unstructured text. This means taking a raw text(say an article) and processing it in such way that we can extract information from it in a format that a computer understands and can use. This is a very difficult problem in NLP because human language is so complex and lots of words can have a different meaning when we put it in a different context.
A knowledge graph is a way of storing data that resulted from an information extraction task. Many basic implementations of knowledge graphs make use of a concept we call triple, that is a set of three items(a subject, a predicate and an object) that we can use to store information about something. Let’s take this sentence as an example:
London is the capital of England. Westminster is located in London.
After some basic processing which we will see later, we would 2 triples like this:
(London, be capital, England), (Westminster, locate, London)
So in this example we have three unique entities(London, England and Westminster) and two relations(be capital, locate). To build a knowledge graph, we only have two associated nodes in the graph with the entities and vertices with the relations and we will get something like this:
So what is the use case for this? Imagine if we ask a computer where Westminster is located. Or what is the capital of England. We would only have to search the graph for the right relation and extract the correct entities at both sides of it. This is a simple use case for a very basic example, but knowledge graphs are used quite a lot today in many Machine Learning tasks by some of the biggest companies(you know about the Google Knowledge Graph, right?)
Getting our hands dirty
We will write together a very basic implementation of a small knowledge graph. All the code is available on Github if you want to check it you(feel free to star it so that I know you’ve found it useful in any way).
We will use these sentences for our knowledge graph. Some of the text is taken from Wikipedia, while some of it was manually added.
London is the capital and largest city of England and the United Kingdom. Standing on the River Thames in the south-east of England, at the head of its 50-mile (80 km) estuary leading to the North Sea, London has been a major settlement for two millennia. Londinium was founded by the Romans. The City of London, London’s ancient core − an area of just 1.12 square miles (2.9 km2) and colloquially known as the Square Mile − retains boundaries that follow closely its medieval limits.The City of Westminster is also an Inner London borough holding city status. Greater London is governed by the Mayor of London and the London Assembly.London is located in the southeast of England.Westminster is located in London.London is the biggest city in Britain. London has a population of 7,172,036.
pip install spacy
After the installation, we need to download spaCy’s model for English language. This is as simple as it gets:
python3 -m spacy download en_core_web_sm
Now you can open up your favourite editor and we are good to go. Here’s what we are going to do:
- Import our dependencies
- Use spaCy to split the text in sentences
- For each sentence, use spaCy to figure out what kind of word is every word in that sentence: is it a subject, an object, a predicate and so on
- Use the information from above to figure out where in the triple we should put a word
- Finally build the triples
- Build and show the knowledge graph
Every sentence is split into tokens, meaning every word or punctuation mark is taken by spaCy and we can use that to figure out what kind of word we have. When processing is being done, spaCy attaches a tag called dep_ to every word so that we know a word is either a subject, an object and so on. For example, for the first sentence in our text we will get:
London -> nsubj
is -> ROOT
the -> det
capital -> attr
and -> cc
largest -> amod
city -> conj
of -> prep
England -> pobj
and -> cc
the -> det
United -> compound
Kingdom -> conj
. -> punct
I honestly recommend to checkout the meaning of these tags in spaCy’s documentation because there are so many and they are different for every language.
For the simplicity of this example, I’ve chosen to extract a single triple from every sentence. Moreover, I am choosing only a few tags to enter the triples, but you can choose a lot more depending on what you want to achieve. This is for example what I chose:
# Tags I've chosen for relations
deps = ["ROOT", "adj", "attr", "agent", "amod"]
# Tags I've chosen for entities(subjects and objects)
deps = ["compound", "prep", "conj", "mod"]
Obviously, for a given triple, relations and entities can be built using more than one word. We want to capture the information like (London, be capital, England), not just (London, be, England) which has a totally different meaning.
Having all of these said, this is the main skeleton of the algorithm. You can see there are a few calls to other methods you will be able to see later, but for now we can just get the basic idea.
def processSubjectObjectPairs(tokens):
subject = ''
object = ''
relation = ''
subjectConstruction = ''
objectConstruction = ''
for token in tokens:
printToken(token)
if "punct" in token.dep_:
continue
if isRelationCandidate(token):
relation = appendChunk(relation, token.lemma_)
if isConstructionCandidate(token):
if subjectConstruction:
subjectConstruction = appendChunk(subjectConstruction, token.text)
if objectConstruction:
objectConstruction = appendChunk(objectConstruction, token.text)
if "subj" in token.dep_:
subject = appendChunk(subject, token.text)
subject = appendChunk(subjectConstruction, subject)
subjectConstruction = ''
if "obj" in token.dep_:
object = appendChunk(object, token.text)
object = appendChunk(objectConstruction, object)
objectConstruction = ''
return (subject.strip(), relation.strip(), object.strip())
So what we are doing is basically taking every token(meaning every word and every punctuation mark) and putting it in a category. After we have reached the end of a sentence, we clear up the whitespaces which might have remained and then we’re good to go, we have obtained a triple. We can use this and the networkx and pyplot libraries to build the Knowledge Graph. Here is the code for displaying the graph.
def printGraph(triples):
G = nx.Graph()
for triple in triples:
G.add_node(triple[0])
G.add_node(triple[1])
G.add_node(triple[2])
G.add_edge(triple[0], triple[1])
G.add_edge(triple[1], triple[2])
pos = nx.spring_layout(G)
plt.figure()
nx.draw(G, pos, edge_color='black', width=1, linewidths=1,
node_size=500, node_color='seagreen', alpha=0.9,
labels={node: node for node in G.nodes()})
plt.axis('off')
plt.show()
Depending on the information you are trying to capture, it is possible that you will not obtain only one graph, but several, disconnected graphs of various size. This of course is less likely to happen with huge amounts of complete data, but for this case and the example I chose, it happened that we obtained four different graphs. Here is the Knowledge Graph we were able to build with our code.
Results analysis
The knowledge graph we obtained is exceptionally small and basic but that is because we used a very small amount of data and a basic implementation. For more advanced purposes, I recommend that you use as much data as you can and try enriching the Knowledge Graph with other NLP techniques. Given the fragment I fed into the Knowledge Graph, we can see it is able to answer a few questions like:
- What is the capital of England?
- Where is London located?
- Who is London governed by?
- What is the population of London?
- Who was London founded by?
And this is only from only paragraph of text and 99 lines of Python code. So just imagine the possibilities!
Conclusion
Today we learned about Knowledge Graphs and we managed to build a small, basic one in python and spaCy. We saw some good results and also noticed areas which we can improve. We can see that NLP it’s a fun domain and we can build applications to help us in our day to day activities. The code for the whole project can be found on Github. Please be sure to start the project if you think is helpful for you.
Interested in more? Follow me on Twitter at @b_dmarius and I’ll post there every new article.