Zipf’s Law in Python

Chris Webb
Aug 6 · 4 min read
Image for post
Image for post
Image: Wikimedia

In this article I will write a project in Python to apply Zipf’s Law to analysing word frequencies in a piece of text, specifically Bram Stoker’s Dracula.

Zipf’s Law describes a probability distribution where each frequency is the reciprocal of its rank multiplied by the highest frequency. Therefore the second highest frequency is the highest multiplied by 1/2, the third highest is the highest multiplied by 1/3 and so on.

This is best illustrated with a graph.

Image for post
Image for post

The Zipfian Distribution can be applied to many different areas including populations, incomes, company revenues and so on, as well as words in a piece of text as I mentioned above. The word frequencies of a single piece of text are unlikely to be a good fit — for that you really need a large and varied range of texts. However, applying it to a single piece of text is a simple way to demonstrate the distribution, and if we implement it in Python we can also see a few useful language techniques in action along the way.

The Problem

  • The word frequency
  • The Zipfian fraction (1/1, 1/2, 1/3 etc,)
  • The Zipfian frequency
  • The difference between the actual and Zipfian frequency
  • The percentage difference between the two frequencies

I’ll be using a handful of slightly lesser know Python features which I’ll list here just to give a sneak preview, and then I’ll describe them in more detail later on.

  • The string split method to split text into words
  • maketrans and translate to remove punctuation and digits
  • collections.Counter to count word frequencies

The project consists of the following two Python files, as well as dracula.txt which contains the full text of Bram Stoker’s novel which we’ll use as the input. This is the Github repository.

  • zipfslaw.py
  • zipfslawtest.py

This is the full code of zipfslaw.py.

generate_zipf_table

_remove_punctuation

Next we use the three-argument version of str.maketrans. This is a static string method which creates a table of character replacement mappings, in this case meaning no characters will be replaced (the first two arguments therefore being empty strings) but the characters in the third argument are to be removed.

Finally we return the result of calling translate on the text with the translation table created in the previous line.

_top_word_frequencies

_create_zipf_table

Within a loop through the list we calculate all the various extra values before adding them as a dictionary to a new list.

print_zipf_table

Now we can move on to trying out the module in zipfslawtest.py.

After some very mundate code to read a text file we pass it to generate_zipf_table and then pass the result to print_zipf_table.

Now we can run the code with this command:

python3.8 zipfslawtest.py

This is the first part of the output.

Image for post
Image for post

As you can see Dracula isn’t a good fit for the Zipfian Distribution — few individual texts are.

The code in this project uses a rather naive series of reciprocals of ranks, but more sophisticated methods of calculating the Zipfian probabilities might provide a better fit. However, when counting words these formulae require a value for the number of words in the language of the text. This is of course such a vague concept that it is pretty much impossible to arrive at a suitable value.

This project has been tailored to calculating the Zipfian distribution for words in a piece of text. However, as I stated above, the distribution can be applied to many types of data and often with a better fit. The code in this post could be enhanced to be more general-purpose, creating a probability distribution from a list of any data type.

Explorations in Python

Explorations and experiments in Python

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store