Exploration of one of the most enigmatic mathematical law through lens of data science

Deval Shah
6 min readSep 15, 2017

--

Photo by Maciej Bledowski / Shutterstock

You might be wondering what this mysterious law could be ? It is existing everywhere around us and we are unaware about it.

Read the article to the end and believe me you will learn something really cool !

There is no rocket science required to understand this law, though the way this law appears time and again in varied independent domains is quite remarkable.

The law is immensely simple denoted by a very simple and elegant mathematical equation yet beholds one of the most important pattern occurrences of human history.

.

.

.

Zipf’s Law

Zipf’s Law was proposed by linguist George Kingsley Zipf describing the analogous pattern that appears in language.

Technical Description:-

Given a large corpus of natural language occurrences, the frequency of any word is inversely proportional to its rank in frequency table.

[Frequency is number of times a word appears in a given sample.For eg. if we are given a text file than frequency of any word in that text file is how many times does that word appear in that file.]

This law was proposed related to patterns seen in natural language corpus’s many times.

Let us understand what the above in description is trying to say :-

  1. The most appearing word in a corpus suppose has frequency f
  2. The second most appearing word would have frequency roughly f/2
  3. Then the third most appearing word would have frequency roughly f/3
  4. Then the fourth most appearing word would have frequency roughly f/4

and so on.

In general, Frequency = 1/Rank

What is astonishing is that this law holds true for almost all huge natural language corpus’s out there.For eg : Books,Religious scripts even temperature trends over past years etc.

If you have made this far you might be wondering what makes this simple mathematical law based on language patterns so special.Actually significant part of research done by mankind shows that zipf’s law appear almost everywhere

Let me show you some interesting graphs that will make to make it more clear.

Source:http://www.wordcount.org/main.php

The above image is from wordcount.org which holds archive of about an 86,800 most frequently used English words, ranked in order of commonness. Each word is scaled to reflect its frequency relative to the words that precede.The larger the word, the more we use it.The smaller the word, the more uncommon it is.

Wordcount data currently comes from the British National Corpus®, a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent an accurate cross-section of current English usage.

  1. ‘The’ is most used english word.
  2. ‘Of’ is second most used english word.

and so on.

Visualize the pattern that runs through ..

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

I decided upon coming across Zipf’s Law that I will confirm it on my own whether it hold’s true or not.

I wrote a Python script to run through some of the most random text corpus's I could find on internet from reputable sources to check whether the pattern holds in each one of them.

To my surprise, I was baffled to see that in every corpus I ran the script for plotting the frequency occurrences, Zipf’s Law stood true.

Before jumping into visualizations, look for the pattern prevailing related to frequency-rank and not for which word occurs the most.Also, notice that how varies the texts are that used for the experiment. All the corpus are picked randomly to avoid bias in results.

In all the results, there are some instances where it not always falls to 1/2 to its previous, which is fine because overall pattern approximates the nature.

Code | Corpus used for visualizations → Link

Visualizations Time :)

1. A Cruising Voyage Round the World

A classic English literature from 18th century.

Notice in every subsequent word the frequency is almost half to the previous word

2. Financial Crisis Inquiry Commission Report

3.The Mahabharata of Krishna-Dwaipayana Vyasa — Adi Parva. Translated into English Prose

Even a corpus not originally written in English holds the pattern perfectly.

4. PIRATES OF THE CARIBBEAN: DEAD MAN’S CHEST Script

Script

5. 2nd Presidential Debate between Donald Drumpf and Hillary Clinton.

6. Random Tweets

7. The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle

8. NBA Facebook likes

It even holds true for number of likes given to each NBA Team.

Source : https://precisionlender.com/blog/general/zipfs-laws-lenders/

These visualizations give a glimpse about a law as simple as this, can exist in almost any domain existing in real world.

From statistics point of view, this law holds great importance.

How could it be that the intricate processes of normal human language production conspire to result in a frequency distribution that is so mathematically simple — perhaps “unreasonably” so (Wigner, 1960)?

This question has been a central concern of statistical language theories for the past 70 years. Derivations of Zipf’s law from more basic assumptions are numerous, both in language and in the many other areas of science where this law occurs

Principal of least effort, a theory developed in early 1900’s , was precursor to development of Zipf’s law.

Zipf’s Law is analogous to one of the most widely acclaimed economic principles ‘Pareto Principal’.

Pareto Principal

The principle states that 20% of the invested input is responsible for 80% of the results obtained. Put another way, 80% of consequences stem from 20% of the causes; this is also referred to as the “Pareto rule” or the “80/20 rule.”! — break — This principle serves as a general reminder that the relationship between inputs and outputs is not balanced. For instance, the efforts of 20% of a corporation’s staff could drive 80% of the firm’s profits. In terms of personal time management, 80% of your work-related output could come from only 20% of your time at work. In Pareto’s case, he used the rule to explain how 80% of the wealth is controlled by 20% of the country’s population.

It exists in almost any domain you can imagine such as Science,Software,Sports,Crime Analysis,DNA Analysis etc.I would highly recommend to check out Wiki page for seeing how much universal this distribution is.

The reason for mentioning Pareto principle is that Zipf’s law follows the Pareto distribution where only first 20 % of words results into 80 % of frequency of entire corpus.

Mathematics never fails to amaze !

If you find this article interesting in terms of content and idea articulation ,do share it and give your suggestions.

References

[1] Zipf’s Mystery : VSauce (Great Video . Must Watch)

[2]Wiki : Zipf’s Law

[3]https://www.thoughtco.com/principle-of-least-effort-zipfs-law-1691104

[4]https://en.wikipedia.org/wiki/Principle_of_least_effort

[5]https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4176592/

[6]https://en.wikipedia.org/wiki/Pareto_principle

[7]http://www.investopedia.com/terms/p/paretoprinciple.asp

[8] Gutenberg Corpus

--

--