Generating Zipfian random-variables

Much like the Normal, Poisson, Bernoulli or Exponential distributions, the Zipf distribution is everywhere. I won’t bother reiterating the Wikipedia Article, but in-brief: it models the commonplace log-log relationship between the rank and frequency of events.

For instance, in an English text, ‘theoccurs 7% of the time, ‘be 4%,and3%, etc. When you plot a log-log graph of the rank of the word (the has rank 1, and has rank 3, episode has rank 2653, etc) against its frequency, you get a straight line. Thus, English words have a Zipfian distribution.

Frequency vs rank of words in various languages. Image from Wikipedia (https://en.wikipedia.org/wiki/Zipf's_law)

The distribution has two parameters:

  • s: the skew. s=0 is a uniform distribution. As s increases, high-rank items (like ‘and’) become rapidly more likely than the rare low-ranked items (like ‘cromulent’). English has a Zipfian skew of about 1.
  • N: the cardinality. The total number of items. For English words, this is 171 476 (according to the Oxford English Dictionary).

The cumulative distribution function, where x is the rank of the word, is:

Where H is a generalized harmonic-number:

Generating Zipfian Random Variables

What if we want to generate a Zipfian RV (random variable)? For instance, we may have an ordered corpus of 170 000 English words, a known skew of 1.07, and want to output some words from this corpus.

To do these we need to first create an inverse CDF (cumulative distribution function). We can use this to generate Zipfian RV samples by:

  1. Generate a uniformly-distributed sample 0≤p<1.
  2. Create a Zipfian rank k from this p with the Zipfian inverse CDF.
  3. Output the kth item from our ordered list of items.

Naively, we could just implement this CDF and inverse-CDF by calculating the Harmonic numbers’ sums. However, this involves doing enormous sums and will be very slow for large N. For N=100 million, generating a single variable can take many minutes. This appears to be the method used by the (otherwise fantastic) Apache Commons Math library.

Generating such random variables is important to me for simulating the behavior of millions of people, as part of my work as a Data Engineer at machine-learning company Featurespace (we’re hiring!) I could find no approximation on the internet, so had to find my own way of generating these numbers.

Approximating the CDF

The key to this is approximating the sums in generalized harmonic-numbers. An easy way to approximate a sum is using the Euler-Maclaurian formula. To second-order:

Rewriting as a sum, evaluating the derivatives, and plugging in f(k):

Rearranging and simplifying:

We can then use this approximation to approximate the Zipfian CDF. Notice I’ve multiplied top and bottom through by 12 — both to simplify the equation and because CPUs do division slowly.

Buts this a good approximation? I tried it over a wide range of values - here’s a few key examples, showing its absolute error compared to exact calculation.

I chose the second-order Euler-Maclaurin approximation because this gave an acceptable 0.01 error in the worst-case scenario, and excellent accuracy where I was planning to use it (high N, low s). Anyway, the error is swamped by the uncertainty in s and the fact that the real world isn’t precisely Zipfian. If accuracy at small x and s is important, you might want to fall back on exact calculation.

Here’s is the CDF in Java:

The Inverse CDF

Here’s the key part — we now need to invert the equation to get a rank from a probability. We can’t just invert the above equation — the resulting polynomial is unsolvable. Instead, we can use the Newton-Raphson method to find an x that satisfies this equation for a given p. That is, we find a x that satisfies:

using the iterative process:

As for the practicalities of doing this:

  • x0 is a bit of a guess. x0=1 or x0=N/2 are two reasonable choices.
  • Calculating powers is the bottleneck. I suggest calculating the Math.pow(x,-s-2) and then reusing it by multiplying by x. The makes the code roughly 3 times faster than computing the power each time.
  • Iteration should stop when successive steps of x are within a tolerance. I suggest 0.01. This means about 5 iterations.
  • ‘Overshoot’ is a common problem. Here, the derivative sends us off into x being negative, where the equations are undefined. If this happens, just set x to 1 for the next step.
  • Once iteration is complete, the resulting x should be floored to give an integer, eg 3.98 -> 3. In most programming langauges, just casting to an Integer will do this.
  • D can be precomputed at the beginning of the method.

The result is this Java:

Analysis of the inverse CDF code

  • How fast is it? A sample takes about 1μs to generate. So a million a second. This barely varies with N, s or p.
  • How accurate is it? For large N and small s it’s usually exact. eg For N=100000, s=0.1 and p=0.3, it gives the precise answer k=262437. For small N and large s it’s occasionally off by one. eg N=3, s=1.5 and p=0.738 gives a 2 when it should give a 1. Mean accuracy should never be less than 99%.

I want to speed it up!

Here’s some ideas:

  • Rewrite it in C.
  • Use fewer terms of the Euler-Maclaurin formula.
  • Typically, you’ll want to generate the random-variable many times with the same skew s and cardinality N. You could pre-calculate various values (eg every 0.01 between 0 and 1), cache them, and then interpolate them on demand with a cubic spline.

But what if s=1?

Ah ha! In this case many of the equations above are undefined due to the (1-s) denominator. In this scenario, the integrals need to be redone, which gives you a similar equation with a few logs in place of the powers. In the interest of space I’ll leave this as an exercise to the reader. Or, you can use s=1+Ɛ (eg 1.000001), which gives an extremely close result.