In this article I will write a project in Python to apply Zipf’s Law to analysing word frequencies in a piece of text, specifically Bram Stoker’s Dracula.
Zipf’s Law describes a probability distribution where each frequency is the reciprocal of its rank multiplied by the highest frequency. Therefore the second highest frequency is the highest multiplied by 1/2, the third highest is the highest multiplied by 1/3 and so on.
This is best illustrated with a graph.
The Zipfian Distribution can be applied to many different areas including populations, incomes, company revenues and so on, as well as words in a piece of text as I mentioned above. The word frequencies of a single piece of text are unlikely to be a good fit — for that you really need a large and varied range of texts. However, applying it to a single piece of text is a simple way to demonstrate the distribution, and if we implement it in Python we can also see a few useful language techniques in action along the way.
For this project I will take the text of Dracula as a .txt file and count the frequencies of individual words. I’ll then create a data structure containing the top n words in descending order, together with the following pieces of information for each word:
- The word frequency
- The Zipfian fraction (1/1, 1/2, 1/3 etc,)
- The Zipfian frequency
- The difference between the actual and Zipfian frequency
- The percentage difference between the two frequencies
I’ll be using a handful of slightly lesser know Python features which I’ll list here just to give a sneak preview, and then I’ll describe them in more detail later on.
- The string
splitmethod to split text into words
translateto remove punctuation and digits
collections.Counterto count word frequencies
The project consists of the following two Python files, as well as dracula.txt which contains the full text of Bram Stoker’s novel which we’ll use as the input. This is the Github repository.
This is the full code of zipfslaw.py.
The core function which takes a string and a
top argument which specifies the maximum number of items to return the frequencies of. (Data sets following the Zipfian distribution will often have a long tail of very low frequencies which aren’t worth considering or trying to fit to the reciprocal of the rank.) It then generates and returns the data structure described in the bullet points above.
A short and simple but quite interesting function. It first creates a string containing all the characters we want to remove, basically punctuation plus numbers 0 to 9. It’s not very sophisticated and I have to admit that this whole project is really only suitable for use with ASCII-only text, but it works with Dracula.
Next we use the three-argument version of
str.maketrans. This is a static string method which creates a table of character replacement mappings, in this case meaning no characters will be replaced (the first two arguments therefore being empty strings) but the characters in the third argument are to be removed.
Finally we return the result of calling
translate on the text with the translation table created in the previous line.
I have gone a bit over the top with comments in this function just to make it clear how the rather less well known bits of code actually work. Firstly we split the text into a list of words and then use that list to construct a
Counter. We then call the
most_common method to get a sorted list of the top words.
We then take the list generated by the previous function and use it as the basis of a new list containing all the additional Zipfian Distribution stuff.
Within a loop through the list we calculate all the various extra values before adding them as a dictionary to a new list.
This final function simply takes the data structure from
_create_zipf_table and prints it in a neat table. The formatting string is long and unweildy so I have assigned it to a separate variable, not something I have to do often!
Now we can move on to trying out the module in zipfslawtest.py.
After some very mundate code to read a text file we pass it to
generate_zipf_table and then pass the result to
Now we can run the code with this command:
This is the first part of the output.
As you can see Dracula isn’t a good fit for the Zipfian Distribution — few individual texts are.
The code in this project uses a rather naive series of reciprocals of ranks, but more sophisticated methods of calculating the Zipfian probabilities might provide a better fit. However, when counting words these formulae require a value for the number of words in the language of the text. This is of course such a vague concept that it is pretty much impossible to arrive at a suitable value.
This project has been tailored to calculating the Zipfian distribution for words in a piece of text. However, as I stated above, the distribution can be applied to many types of data and often with a better fit. The code in this post could be enhanced to be more general-purpose, creating a probability distribution from a list of any data type.