Turn a Harry Potter Book into a Knowledge Graph

Learn how to combine Selenium and SpaCy to create a Neo4j knowledge graph of the Harry Potter universe

Agenda

  1. Scrape Harry Potter fandom page
  2. Preprocess book text (Co-reference resolution)
  3. Entity recognition with SpaCy’s rule-based matching
  4. Infer relationships between characters
  5. Store results to Neo4j graph database

Harry Potter Fandom Page Scraping

We will use Selenium for web scraping. As mentioned, we will begin by scraping the characters in the Harry Potter and the Philosopher’s Stone book. The list of characters by chapter is available under the CC-BY-SA license, so we don’t have to worry about any copyright infringement.

Scraping list of characters from HP fandom site
Enrich character details by scraping character’s sites
  • name
  • url
  • aliases
  • nationality
  • blood-type
  • gender
  • species
  • House
  • Loyalty
  • Family
Store character information to Neo4j
Example subgraph of Hermione Granger. Image by the author with Neo4j Bloom.

Text Preprocessing

First of all, we have to get our hands on the text from the book. I’ve found a GitHub repository that contains the text of the first four Harry Potter books.

Read a text from a file on GitHub

Entity Recognition with SpaCy’s Rule-Based Matching

First, I wanted to be cool and use a Named Entity Recognition model. I’ve tried models from SpaCy, HuggingFace, Flair, and even Stanford NLP.

  • Full name: Albus Dumbledore
  • First name: Albus
  • Last name: Dumbledore
Matcher pattern results based on longer entities prioritization.
Disambiguate when multiple options are available for a single entity.

Infer Relationships Between Characters

We are finished with the hard part. Inferring relationships between characters is very simple. First, we need to define the distance threshold of interaction or relation between two characters. As mentioned, we will use the same distance threshold as was used in the GoT extraction.

Count the number of interactions between characters based on the text co-occurrence.

Store Results to Neo4j Graph Database

We have extracted the interactions network between character, and the only thing left is to store the results into a graph database. The import query is very straightforward as we are dealing with a monopartite network.

Store interaction network to Neo4j.
Interaction network visualized with NEuler. Image by the author.

Conclusion

I am quite proud of how good the rule-based matching based on predefined entities turned out.

--

--

Developer Content around Graph Databases, Neo4j, Cypher, Data Science, Graph Analytics, GraphQL and more.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Tomaz Bratanic

Data explorer. Turn everything into a graph. Author of Graph algorithms for Data Science at Manning publication. http://mng.bz/GGVN