From Pages to Networks: A Graph Theory Analysis of ‘Little Women’

Ria Kulkarni
10 min readJun 24, 2024

--

When I was in my 5th semester at college, I picked an elective that I was initially not very keen about: Graph Theory and its Applications. Mathematics had never been my strong suit and this course promised to be particularly challenging. Yet, seeing most of my peers sign up, I decided to follow suit.

As the semester progressed, I struggled to keep up. Correlating the concepts I learned in class to the real world proved to be difficult. Then, midway through this course that I was starting to regret, I was informed that we were required to submit a project analyzing the social relationships between characters in a book using graph theory.

Books. Now, that’s something I could do. As an avid reader, I realized this project would allow me to bridge the gap between my passion for literature and this course that I had so far, not enjoyed very much.

Now, you can choose your own novel and download it from whatever source you want but I chose ‘Little Women’ for some very specific reasons (though indeed, I do love this book). The book is set in the early 1860s and is about the four March sisters and their journey into womanhood; it has a variety of supporting characters that add rich meaning to the social dynamic to be analyzed. It also spans several years which provides temporal depth to the analysis by modelling evolving relationships.

Lastly, since it is now open domain, I downloaded a .txt file of the book from Project Gutenberg (https://www.gutenberg.org/ebooks/514) to use as the dataset.

Code Walkthrough

At its core, this project aims to uncover the web of relationships between characters in ‘Little Women’ by transforming the narrative into a mathematical structure known as a graph. In this context, characters are modelled as nodes and their interactions form edges, creating a complex network that represents the social fabric of the novel.

First, let’s install all the necessary Python libraries.

import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
import math
%matplotlib inline
import re
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
from networkx.algorithms import centrality
from networkx.algorithms import community

Then, we’ll open the text file in our working directory and check whether the full book has appeared.

file_name = 'littlewomen.txt'
with open(file_name, 'r', encoding='utf-8') as file:
text = file.read()
print(text)
  1. Preprocessing:

Before we move on to the real nitty-gritty parts, let’s perform some basic preprocessing to ensure that there is no noisy data that will disrupt the social graph modelling such as headings or other texts that Project Gutenberg might have inserted.

def preprocess(text):
header_pattern = re.compile(r'\*\*\* START OF (THE|THIS) PROJECT GUTENBERG EBOOK .* \*\*\*|\*\*\* END OF (THE|THIS) PROJECT GUTENBERG EBOOK .* \*\*\*|This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever.')
copyright_pattern = re.compile(r'\(C\) \d{4} .* Project Gutenberg')
text = re.sub(header_pattern, '', text)
text = re.sub(copyright_pattern, '', text)
start_pattern = re.compile(r'Updated editions will replace the previous one—the old editions will.*?START: FULL LICENSE', re.DOTALL)
end_pattern = re.compile(r'Section 5. General Information About Project Gutenberg™ electronic works', re.DOTALL)
start_match = start_pattern.search(text)
end_match = end_pattern.search(text)
if start_match and end_match:
text = text[:start_match.start()] + text[end_match.end():]
header_pattern2 = re.compile(r'The Project Gutenberg eBook of Little Women.*?Language: English', re.DOTALL)
header_match = header_pattern2.search(text)
if header_match:
text = text[header_match.end():]
return text

text = preprocess(text) # print and check

2. Extracting Character Names & Social Graph:

There are about 25 total characters that impact the trajectory of the narrative and I have listed them in my Python notebook. Naturally, the next step is extracting these character names and equivalence classing each nickname.

def extract(text):
pattern = re.compile(r'\b(?:Mr\.|Mrs\.|Miss|Dr\.) [A-Z][a-z]+\b')
names = set(re.findall(pattern, text))
return names

names = extract(text)
print("List of Names: ", names)

character_list = [
"Meg", "Jo", "Beth", "Amy", "Laurie",
"Marmee", "Bhaer", "Mr. March", "Mr. Brooke", "Mr. Laurence",
"Hannah", "Aunt March", "Daisy", "Demi", "Mrs. Kirke", "Kate Vaughn",
"Sallie Gardiner", "Aunt Carrol", "Florence", "Fred Vaughn", "Esther",
"Annie Moffat", "Ned Moffat", "Frank Vaughn", "Grace Vaughn", "Dr. Bangs",
"The Hummels"
]

equivalent_characters = {
'Miss March': 'Jo',
'Miss Josephine':'Jo',
'Miss Jo':'Jo',
'Miss Amy': 'Amy',
'Miss Beth': 'Beth',
'Mr. Laurie': 'Laurie',
'Mr. Laurence': 'Laurie',
'Mr. Bhaer': 'Bhaer',
'Frederick Bhaer':'Bhaer',
}

character_list = [equivalent_characters.get(char, char) for char in character_list]

sentences = sent_tokenize(text)

def extract_characters_from_paragraph(paragraph):
return [char for char in character_list if char in paragraph]

Once we’ve stored the names of the characters in a list, we can now extract a social graph from the text.

G = nx.Graph()
for paragraph in text.split('\n\n'):
chars_in_paragraph = extract_characters_from_paragraph(paragraph)
for char1 in chars_in_paragraph:
for char2 in chars_in_paragraph:
if char1 != char2:
if not G.has_edge(char1, char2):
G.add_edge(char1, char2, weight=1)
else:
G[char1][char2]['weight'] += 1

plt.figure(figsize=(14, 12))
pos = nx.spring_layout(G, seed=42)
nx.draw(G, pos, with_labels=True, font_size=8, node_color='skyblue', node_size=800,
font_color='black', font_weight='bold', edge_color='gray', linewidths=0.5)
plt.title("Social Graph Visualization")
plt.show()

This should be the social graph that appears on running this code block which was constructed using the co-occurrence approach in Python with the NetworkX library. The code iterates through paragraphs of the text, identifying characters present in each. For every pair of characters appearing in the same paragraph, an edge is added to the graph or its weight is increased if the edge already exists. This method assumes that characters mentioned together are likely interacting or related in some way.

The resulting graph is then visualized using matplotlib, with characters as nodes and their co-occurrences as weighted edges. The spring layout algorithm is used to position nodes, creating a visual representation where more frequently co-occurring characters are placed closer together. This approach effectively captures the network of relationships in the novel, with central characters having more connections and peripheral characters appearing at the edges of the graph.

Refer my Python notebook for implementations of other types of social graphs (Erdos-Renyi, Small Word, Preferential Attachment).

3. Character Centrality Analysis:

No analysis is complete without centrality, i.e., how important a character is to a narrative. Let us now identify the top 4 protagonists and discuss the different centrality measures used to arrive at that conclusion.

degree_centrality = nx.degree_centrality(G)
top4 = sorted(degree_centrality, key=degree_centrality.get, reverse=True)[:4]

print("Top 4 Protagonists: ", top4)

plt.figure(figsize=(12, 10))
nx.draw(G, pos, with_labels=True, font_size=8, node_color='skyblue', node_size=800,
font_color='black', font_weight='bold', edge_color='gray', linewidths=0.5)

nx.draw_networkx_nodes(G, pos, nodelist=top4, node_color='orange', node_size=800)

plt.title("Social Graph Visualization with Top 4 Protagonists")
plt.show()

The resulting graph clearly shows that our top 4 protagonists are: Jo, Meg, Beth and Laurie. We now calculate the centralities for each.

main_protagonists = ['Meg', 'Jo', 'Beth', 'Laurie']

degree_centrality = centrality.degree_centrality(G)
betweenness_centrality = centrality.betweenness_centrality(G)
closeness_centrality = centrality.closeness_centrality(G)
pagerank_centrality = nx.pagerank(G)

for protagonist in main_protagonists:
print(f"Centrality measures for {protagonist}:")
print(f"Degree Centrality: {degree_centrality.get(protagonist, 0)}")
print(f"Betweenness Centrality: {betweenness_centrality.get(protagonist, 0)}")
print(f"Closeness Centrality: {closeness_centrality.get(protagonist, 0)}")
print(f"PageRank Centrality: {pagerank_centrality.get(protagonist, 0)}")
print("--------------------------")

The protagonists’ centrality measures offer insights into their roles and significance within the social network depicted in the novel.

From the above graphs, it is evident that the top 4 protagonists are Jo March, Beth March, Laurie Laurence, and Meg March. The fourth March sister (Amy) seems to have been less influential due to her low sentence appearance rate (co-occurrence). Jo March stands out as a central character due to her high degree centrality, reflecting her extensive connections with other characters. Her notable closeness centrality suggests that she is closely linked to various characters in the network.

Beth, Meg, and Laurie have similar centrality measures, suggesting comparable importance in the social structure.

4. Ego Networks and Local Clustering:

Ego networks are a way of looking at a larger network from the perspective of a single individual. Let us now look at the ego networks of the protagonists.

for protagonist in main_protagonists:
ego_network = nx.ego_graph(G, protagonist)
node_color = ['skyblue' if node != protagonist else 'red' for node in ego_network.nodes()]

pos = nx.spring_layout(ego_network)
nx.draw(ego_network, pos, with_labels=True, font_weight='bold', node_color=node_color, node_size=500)

plt.title(f"Ego Network for {protagonist}")
plt.show()

plt.figure()

The local clustering coefficient, in the context of graphs, tells us how tightly knit the connections are around a specific node (here, a character) in a network.

for protagonist in main_protagonists:
try:
ego_network = nx.ego_graph(G, protagonist)
local_clustering_coefficient = nx.average_clustering(ego_network)

print(f"Ego network and local clustering coefficient for {protagonist}:")
print(f"Nodes in the ego network: {ego_network.nodes()}")
print(f"Edges in the ego network: {ego_network.edges()}")
print(f"Local Clustering Coefficient: {local_clustering_coefficient}")
print("--------------------------")

except nx.NetworkXError as e:
print(f"Error for {protagonist}: {e}")

The results of these code blocks suggest that each character — Meg, Jo, Beth and Laurie, has a network of friends and acquaintances (ego network). A high clustering coefficient within their ego networks suggests their friends are likely to be friends with each other as well. This indicates close-knit social circles for all of them. Interestingly, Laurie has the highest clustering coefficient, meaning his friends form the most tightly connected group among the four.

5. Community Detection:

Community detection is a technique used to identify groups (communities) within networks. There are various approaches to community detection, as listed below.

  1. Clique Percolation: It focuses on finding small, fully connected groups and merging them iteratively. It’s good for large networks and overlapping communities but finding the right group size can be tricky.
  2. Girvan-Newman Algorithm: This takes the opposite approach, removing weak connections to split the network into communities. It excels at finding clear-cut communities but can be slow for massive networks and might miss overlapping ones.
  3. Louvain Modularity: It aims to optimize a score based on community quality. It’s fast and efficient for large networks but may struggle with overlapping communities and can be influenced by how the network is initially set up.

The beauty of Python is that it allows us to abstract all this logic into a single line of code and let’s now use this for each algorithm to further analyze the communities in ‘Little Women’.

clique_percolation_communities = list(community.k_clique_communities(G, 3))
print("Clique Percolation Method Communities:", clique_percolation_communities)

gn_communities = list(nx.algorithms.community.girvan_newman(G))
print("Girvan Newman Communities:", gn_communities)

louvain_communities = list(community.label_propagation_communities(G))
print("Louvain Communities:", louvain_communities)

As you will see, the CPM algorithm has identified groups of characters that are more densely connected to each other than to the rest of the network. In the context of ‘Little Women, this community represents a cohesive group of characters who share significant interactions and relationships within the story.

Likewise, the Girvan-Newman Algorithm has identified certain communities that consist of single characters. This tells us that these characters (Kate Vaughn, Fred Vaughn, Mrs. Kirke, Ned Moffat, Sallie Gardiner, Dr. Bangs, Aunt Carrol, Esther and Florence) are not very influential and do not have strong interconnected relationships with other characters/protagonists of the novel (due to their low co-occurrence per sentence).

That completes the walkthrough of the code. Let us now perform an analysis and interpretation on the obtained results.

Find this code on my GitHub here: https://github.com/aluminates/Little-Women-Social-Graph

Realism Assessment

The social graph extracted from Little Women is a representation of the relationships between characters in a fictional narrative. Therefore, it is inherently based on fiction rather than real-world interactions.

The high local clustering coefficients observed in the ego networks suggest that, similar to real-world social networks, individuals within the same character’s social circle are likely to be interconnected. This reflects the tendency of individuals in both fictional and real-world settings to form cohesive groups.

The ego networks of different characters include a mix of relationships with various individuals, mirroring the diversity seen in real-world social networks where individuals often interact with people from different parts of their lives.

The degree centrality of characters indicates their prominence and the number of connections they have. In real social networks, individuals with high degree centrality are often influential. In ‘Little Women’, characters like Meg, Jo, Amy, and Laurie have high degree centrality, suggesting their importance in the story.

The identified communities through methods like the Clique Percolation Method, Girvan-Newman and Louvain suggest groupings of characters that share strong connections. While these communities are fictional, they capture the essence of real-world social groups and affiliations.

However, one caveat to consider is that real-world social networks are constantly evolving as people form new relationships and existing ones change over time. The social graph in a fictional work, however, is static, capturing a snapshot of the relationships at a specific point in the story.

Despite this limitation, the social graph remains a valuable tool for analyzing the social dynamics within a fictional world. Thus, we can conclude that the social graph enhances the storytelling realism by incorporating elements reminiscent of genuine social networks.

Conclusion & Future Work

Alright, so there you have it!

While I was initially intimidated by Graph Theory, going from textbook formulas to analyzing friendships in ‘Little Women’ — that’s when it clicked for me. Having Jo’s central role quantified by degree centrality or the tight-knit connections between the March sisters visualized in their ego networks was a revelation and since then, I’ve gained a newfound appreciation for the subject — it’s not just abstract math, it’s a powerful tool for understanding connections.

To wrap things up, let’s do a quick rundown of what we’ve covered,

  1. We identified key characters based on their centrality in the network.
  2. We uncovered communities or cliques within the novel’s social structure using techniques like CPM, GN, etc.
  3. We analyzed the strength and nature of relationships between characters through ego networks and the local clustering coefficient.
  4. Lastly, we compared the novel’s social network to common network models in real-world systems through a realism assessment.

Future work on this project could expand in some exciting directions. For example, one could incorporate sentiment analysis to gauge the emotional tone of character interactions and implement temporal analysis to track how relationships evolve throughout the novel.

Thus, this concludes a simple computational literary analysis of one of the world’s most beloved classics, leveraging graph theory, NetworkX and NLTK. And if this inspired you to read the book again… there’s no such thing as too many times.

--

--

Ria Kulkarni

A computer science undergrad and an avid reader, trying to find her place in the world.