Auto-Generated Knowledge Graphs

Utilize an ensemble of web scraping bots, computational linguistics, natural language processing algorithms and graph theory.

Chris Thornton
Towards Data Science
4 min readFeb 9, 2020

--

Knowledge graphs are a tool of data science that deal with interconnected entities (people, organizations, places, events, etc.). Entities are the nodes which are connected via edges. Knowledge graphs consist of these entity pairs that can be traversed to uncover meaningful connections in unstructured data.

There are issues inherent with graph databases, one being the manual effort required to construct them. In this article I will discuss my research and implementations of automatic generation using web scraping bots, computational linguistics, natural language processing (NLP) algorithms and graph theory (with python code provided).

Web Scraping

The first step in constructing a knowledge graph is to gather your sources. One document may be enough for some purposes, but if you want to go deeper and crawl the web for more information there are multiple ways to achieve this using web scraping. Wikipedia is a decent starting point, as the site functions as a user-generated content database with citations to mostly reliable secondary sources, which vet data from primary sources.

Side Note: Always check your sources. Believe it or not, not all information on the internet is true! For a heuristic based solution, cross-reference other sites or opt for SEO metrics as a proxy for trust-signals.

I will avoid screen-scraping wherever possible by using a direct python wrapper for the Wikipedia API.

The following function searches Wikipedia for a given topic and extracts information from the target page and its internal links.

Let’s test this function on the topic: “Financial crisis of 2007–08”

wiki_data = wiki_scrape('Financial crisis of 2007–08')

Output:

Wikipedia pages scraped: 798

If you want to extract a single page use the below function:

Computational Linguistics & NLP Algorithms

Knowledge graphs can be constructed automatically from text using part-of-speech and dependency parsing. The extraction of entity pairs from grammatical patterns is fast and scalable to large amounts of text using NLP library SpaCy.

The following function defines entity pairs as entities/noun chunks with subject — object dependencies connected by a root verb. Other rules-of-thumb can be used to produce different types of connections. This kind of connection can be referred to as a subject-predicate-object triple.

Call the function on the main topic page:

pairs = get_entity_pairs(wiki_data.loc[0,'text'])

Output:

Entity pairs extracted: 71

Coreference resolution significantly improves entity pair extraction by normalizing the text, removing redundancies, and assigning entities to pronouns (see my article on coreference resolution below).

It may also be worthwhile to train a custom entity recognizer model if your use-case is domain-specific (healthcare, legal, scientific).

Graph Theory

Next, lets draw the network using the NetworkX library. I will create a directed multigraph network with nodes sized in proportion to degree centrality.

draw_kg(pairs)

If a drawn graph becomes unintelligible we can increase the figure size or filter/query.

filter_graph(pairs, 'Congress')

Knowledge Graphs at Scale

To effectively use the entire corpus of ~800 Wikipedia pages for our topic, use the columns created in the wiki_scrape function to add properties to each node, then you can track which pages and categories each node lies in.

I recommend using multiprocessing or parallel processing to reduce execution time.

Knowledge graphs on a large scale are at the frontier of AI research. Alas, real-world knowledge is not structured neatly into a schema but rather unstructured, messy, and organic.

--

--

Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Chris Thornton
Chris Thornton

Written by Chris Thornton

Sharing ideas and research about ML, NLP, Data Science | Toronto, Canada | https://www.linkedin.com/in/christopher-thornton1

Responses (9)