Zipf’s Law: Word Frequencies in Alexandre Dumas’s Works

Peter Oliver Caya
Pete Caya
Published in
3 min readApr 1, 2017

As a mini-project to combine my exercises from Data Science at the Command Line with my studies of French, I thought I would modify one of the script examples that was provided in the book to create a quick analysis inspired by Zipf’s Law.

As always, I saved this information to a Github repository here. This code assumes that you can use bash on your computer and that you have ggplot2 installed for R.

Introducing Zipf’s Law

It’s more likely that you know Zipf’s law as the popular heuristic of the “80–20 rule” applied to a language. It states that a small number of words will occur a disproportionately large number of times. It’s related to the Pareto distribution and the larger family of probability distribution functions known as power laws. You can read more about their connection here.

Unsurprisingly, the word frequencies for the collected works of Alexandre Dumas appear to follow the predictions made by Zipf’s law! In this blog post, I’ll show you how to use bash and R to perform this simple analysis.

Step 1: Downloading the Data

Getting the data was relatively simple for this. In order to get a reasonably complete list of Alexandre Dumas’s works in text format, I employed Project Gutenburg. I made a list of the urls leading to the text format of each of the listed works by Dumas. You can get the list here.

From here, I made a shell script to download each of the text files and then append them to a single text file:


#!/usr/bin/env bash
# This script reads lines from a specified text file one by one.
# The purpose is for each line to be succesively piped to the summarize text function

filename=dumas.txt
while read line; do
curl -s $line >> dumas_compiled.txt
done < $filename

A note: Project Gutenburg has a header and a footer at the beginning of each story so this analysis is not a perfect representation of the author’s work but for the purposes of this short exercise it’s good enough.

Step 2: Summarize the Word Frequencies:

From here, I made the words lower case, counted the occurrences of each word, and sort based on the word frequency. This summary is saved to the res.txt file:

cat dumas_compiled.txt | tr '[:upper:]' '[:lower:]' |
grep -oE '\w+' |
sort |
uniq -c |
sort -nr |
head -n 1000 > res.txt

Step 3: Getting a Graphical Representation of Commonly Used Words

From here, it’s just a matter of putting together a nice graphic and table summarizing these word counts:

library(ggplot2)
words <- read.table(file = "res.txt",sep = "")
words$V2 <- as.character(words$V2)
words$V1 <- as.numeric(words$V1)
words <- words[order(-words$V1),]
bigger_words <- words[which(nchar(words$V2)>5),]
ggplot(data = head(words,200)) + geom_histogram( aes(x = V1),fill="#003300", bins = 200) +
ggtitle("Distribution of Top 200 Words in Dumas")+xlab("Word Count")+ylab("")+ theme_minimal()
ggplot(data = head(bigger_words,50)) + geom_bar(aes(x = reorder(x = V2,V1), y = V1),fill="#003300",stat="identity") +
coord_flip( ) + xlab("")+ylab("")+
ggtitle("Frequencies of Top 50 Words in Dumas")+
theme_minimal()

--

--