Natural Language Processing

BookNLP: Analyze Your Favorite Books with NLP

Extract character names, quotes, locations, and more in a few lines of code.

Benedict Neo
bitgrit Data Science Publication

--

Photo by Laura Kapfer on Unsplash

I’m a big fan of reading literature, and the most recent non-fiction book I read cover-to-cover was Crime and Punishment by Fyodor Dostoevsky.

There were many characters and a lot to analyze in that book, and I wished there was a tool to help make things easier.

While browsing through Twitter, I came across a Python library, BookNLP, which is a natural language processing (NLP) pipeline for books that leverages transformer models.

In this article, we’ll see how to use BookNLP to analyze the book Crime and Punishment in a few lines of code.

Before we start coding, let’s discuss what BookNLP can do.

BookNLP

Based on the documentation:

BookNLP is a natural language processing pipeline that scales to books and other long documents (in English), including:

  • Part-of-speech tagging
  • Dependency parsing
  • Entity recognition
  • Character name clustering (e.g., “Tom”, “Tom Sawyer”, “Mr. Sawyer”, “Thomas Sawyer” -> TOM_SAWYER) and coreference resolution
  • Quotation speaker identification
  • Supersense tagging (e.g., “animal”, “artifact”, “body”, “cognition”, etc.)
  • Event tagging
  • Referential gender inference (TOM_SAWYER -> he/him/his)

The point about it scaling to books is important because most language models do not perform well with larger documents.

The recurrent neural network (RNN), which used to be the state-of-the-art architecture for text problems, has trouble remembering earlier words

Newer transformer-based models like BERT have larger memory and is bidirectional, but the maximum size of the input is only 512 words. For larger documents, other solutions have to be considered; this is where BookNLP comes in.

The problems BookNLP solves

  • Character identification — using clustering and conference resolution to correctly assign names to the same identifier, i.e., Harry, Harry Potter, and Mr. Harry Potter are all the same person.
  • Gender inferencing — uses data with each pronoun used in the book to reference a character and includes non-binary pronouns.
  • Quotation speaker identification — correctly links characters to dialogues with a fair degree of accuracy.
  • Event tagging — tag events based around key actions with triple extraction (extracting three pieces of information, such as (Actor, Action, and Recipient)

After that brief introduction to BookNLP, it’s time to use the library!

As always, you can find the code here:

Get the book text

To get the text for BookNLP to analyze, we’ll use the website Project Gutenberg and download the ebook Crime and Punishment.

Click here to install the text file.

After that, place it in the same directory as your Jupyter notebook.

Install dependencies

If you’re using an M1 mac, follow this guideline.

We start by installing BookNLP with pip.

Import library

After importing BookNLP, we also bring Pandas for data processing, json for loading JSON data, and print for output formatting.

Define parameters and create the model

To create the BookNLP pipeline, we need to provide a dictionary containing the model parameters to the BookNLP class.

The dictionary should contain two keys

  • pipeline — elements you want to include for the analysis
  • model — the model's size.

We are running the full BookNLP pipeline by including all the elements.

We also chose the big model, which is larger and more accurate. It’s fit for GPUs and multi-core computers, so if you want to compute to be faster and you’re running this on your personal computer, change it to small.

After running this chunk of code, you should see a similar output as shown below.

What we’re doing is downloading the BERT models that are necessary for the configured pipeline.

output

Downloading the models here took ~42 seconds.

Run BookNLP

Now that we have the models downloaded, we have to set the names for three things before we can run BookNLP.

  • input_file — the filename/location of your text file that you downloaded earlier
  • output_directory — the directory you want BookNLP to dump all the output files to.
  • book_id — the basis for how the output files are named.

Here we defined those names and passed them to the .process function, and we can begin processing the text!

This chunk of code can take some time to run. As you can see above, it took a little over an hour to finish running on the Google Colab CPU.

It also breaks down how long each specific element took. Here attribution (speaker attribution) took the longest, with coref (conference resolution) coming up in second.

Analyzing the Output files

Let’s go over what each file is and what it contains.

Download the output files here if you don’t want to run the code yourself.

Here’s the output file structure you should have if you run the previous code.

📁 crime_and_punishment
┣━━crime_and_punishment.book
┣━━crime_and_punishment.book.html
┣━━crime_and_punishment.entities
┣━━crime_and_punishment.quotes
┣━━crime_and_punishment.supersense
┗━━crime_and_punishment.tokens

Let’s go over what each of them is and what they contain.

Head over to the documentation and this book for a more detailed breakdown of the columns.

book file

This is a large JSON file that contains information about all characters mentioned more than once in the book, including their references, gender, actions for which they are the agent and patient, objects they possess, and modifiers.

Based on our .book file, we have 1474 characters in total!

Looking at the first character, here is their information included (more about them here)

Later in the article, we’ll utilize this file to extract more insights into the characters using the agent and patient data.

More details about them are here.

book.html file

This HTML file contains

  • list of named characters and major entity categories
  • full text of the book along with annotations for entities, coreference, and speaker attribution

Here is the list of named characters.

Notice that Russian has a unique naming convention, which makes this much more difficult for BookNLP to identify the right reference.

There are also the major entity categories

Here are some of them from our output file.

  • People (PER): Raskolnikov, the old woman, Mother, Sonia
  • Facilities (FAC): the Palais de Cristal, the room
  • Geo-political entities (GPE): Petersburg, the town
  • Locations (LOC): the United States, the Neva, the world
  • Vehicles (VEH): a cab, the train, the carriage
  • Organizations (ORG): the Foundation, the university

Here is an example of the quotation.

output from crime_and_punishment.book.html

Entities file

Represents the typed entities within the document (e.g., people and places), along with their conference.

  • COREF is a unique identifier for a person and is also used in the .quotes file to link the speaker with the block of text. It is one of the more challenging problems in NLP, and the accuracy range is <70%.
  • prop tells you if the text is NOM (nominal), PROP (proper), or PRON (pronoun)
  • cat shows you the major entity categories as seen in the book.html file

quote

This contains information on all the quotes in Crime and Punishment and the corresponding speaker.

supersense

This stores information from supersense tagging.

Supersense tagging provides coarse semantic information for a sentence by tagging spans with 41 lexical semantic categories drawn from WordNet, spanning both nouns (including plant, animal, food, feeling, and artifact) and verbs (including cognition, communication, motion, etc.)

Tokens

This file encodes core word-level information, which contains all the tokens on each line of the book and important information about those tokens.

  • word — raw text of the word
  • lemma — the root of the word
  • POS_tag — spaCy’s Part of Speech
  • dependency_relation — spaCy’s dep tag
  • event — tells you if a token triggers an EVENT or not.

Those were all the output files! Let’s see an example of what we can do with this data.

Character analysis

This example is taken from the book: Introduction to BookNLP.

We’ll analyze characters in a verb-centric manner, which means we’ll take a verb, i.e., “struck” and see what characters were the agents or patients of that verb.

If you’re not familiar with those concepts:

  • agent — the doer of the action
  • patient — the recipient of the action

I won’t include the code for the function here; you can find the code here.

First, we load the book data and then create the character data with the custom function.

This creates a new dictionary with the information that book had, but the keys are every character, and the values are the information about the characters.

Here are some of the characters in our dictionary.

Next, we use the find_verb_usage function to create a new dictionary where the keys are “agent” and “patient”, and the values are all the verbs in the text and the characters involved.

Here are some of the verbs in agent.

Now let’s have some fun and see what characters are related to the verbs — “died”, “love”, and “lies”.

Now let's look at the verbs in patient.

And similarly, see what characters are related to the verbs — “stabbed”, “betray”, and “struck”.

If you’re interested, another chapter in the book shows how you can analyze the events in the .tokens file.

That’s all for this article! I hope you found this library interesting and will try it out on your favorite book!

Thanks for reading!

Links

Like this article? Here are three articles you may like:

Be sure to follow the bitgrit Data Science Publication to keep updated!

Want to discuss the latest developments in Data Science and AI with other data scientists? Join our discord server!

Follow Bitgrit below to stay updated on workshops and upcoming competitions!

Discord | Website | Twitter | LinkedIn | Instagram | Facebook | YouTube

--

--