# Zipf’s Law in Python

Aug 6 · 4 min read

In this article I will write a project in Python to apply Zipf’s Law to analysing word frequencies in a piece of text, specifically Bram Stoker’s Dracula.

Zipf’s Law describes a probability distribution where each frequency is the reciprocal of its rank multiplied by the highest frequency. Therefore the second highest frequency is the highest multiplied by 1/2, the third highest is the highest multiplied by 1/3 and so on.

This is best illustrated with a graph.

The Zipfian Distribution can be applied to many different areas including populations, incomes, company revenues and so on, as well as words in a piece of text as I mentioned above. The word frequencies of a single piece of text are unlikely to be a good fit — for that you really need a large and varied range of texts. However, applying it to a single piece of text is a simple way to demonstrate the distribution, and if we implement it in Python we can also see a few useful language techniques in action along the way.

# The Problem

• The word frequency
• The Zipfian fraction (1/1, 1/2, 1/3 etc,)
• The Zipfian frequency
• The difference between the actual and Zipfian frequency
• The percentage difference between the two frequencies

I’ll be using a handful of slightly lesser know Python features which I’ll list here just to give a sneak preview, and then I’ll describe them in more detail later on.

• The string `split` method to split text into words
• `maketrans` and `translate` to remove punctuation and digits
• `collections.Counter` to count word frequencies

The project consists of the following two Python files, as well as dracula.txt which contains the full text of Bram Stoker’s novel which we’ll use as the input. This is the Github repository.

• zipfslaw.py
• zipfslawtest.py

This is the full code of zipfslaw.py.

## _remove_punctuation

Next we use the three-argument version of `str.maketrans`. This is a static string method which creates a table of character replacement mappings, in this case meaning no characters will be replaced (the first two arguments therefore being empty strings) but the characters in the third argument are to be removed.

Finally we return the result of calling `translate` on the text with the translation table created in the previous line.

## _create_zipf_table

Within a loop through the list we calculate all the various extra values before adding them as a dictionary to a new list.

## print_zipf_table

Now we can move on to trying out the module in zipfslawtest.py.

After some very mundate code to read a text file we pass it to `generate_zipf_table` and then pass the result to `print_zipf_table`.

Now we can run the code with this command:

`python3.8 zipfslawtest.py`

This is the first part of the output.

As you can see Dracula isn’t a good fit for the Zipfian Distribution — few individual texts are.

The code in this project uses a rather naive series of reciprocals of ranks, but more sophisticated methods of calculating the Zipfian probabilities might provide a better fit. However, when counting words these formulae require a value for the number of words in the language of the text. This is of course such a vague concept that it is pretty much impossible to arrive at a suitable value.

This project has been tailored to calculating the Zipfian distribution for words in a piece of text. However, as I stated above, the distribution can be applied to many types of data and often with a better fit. The code in this post could be enhanced to be more general-purpose, creating a probability distribution from a list of any data type.

Written by

Written by