Not every long tail is power law!

Philipp Singer
9 min readJun 16, 2013

--

Originally published at my old Wordpress Blog.

As a fellow researcher you most likely get in contact with long tail distributions frequently. Maybe the most prominent heavily tailed distribution is the frequency of words in a text corpus. But also in the fields of social media research (where I am working) you often stumble upon very skewed distributions. To give some examples: followers on Twitter [1], tags used in social bookmarking systems [2] or the number of comments done on a Wikipedia discussion pages [3]. As it is easy to observe that the distribution at hand might be heavy-tailed, it is often difficult to detect the exact type of distribution your data follows. One of the most often used distributions to explain the heavy tailed distribution of your data is the power law distribution. Unfortunately, it is not uncommon that it is claimed that the heavy tailed distribution follows a power law distribution without even checking it. In this blog post I want to give an idea and main principles of how to detect power law distributions.

To begin with I want to define a power law distribution by the function:

Both c and α are the constants characterizing the power law distribution and ϵ is a constant dependent on x which represents the uncertainty in the observed values. The most interesting parameter isα which is the scaling parameter and determines the slope of the distribution [4,5]. Furthermore, it is important to note that real world data nearly never follows a power law distribution for all x values. Hence, it is crucial to find some minimum xmin. Then, we can say that the tail of the distribution follows a power law [5]. A power law distribution can also be transformed to a log-log scale and then the power law distribution takes the form of a linear function with a slope α [4].

Through the rest of this blog post I will show examples and methodological steps using Python and in particular the excellent Python libary powerlaw 1.0 [6] which is a toolbox for testing if a probability distribution fits a power law. Most of the technical details of the library are clearly described in [7] and I will also use the explanatory datasets of their work for the examples in this post. In detail, we will have a look at three different datasets:

  • Word: The frequency of word usage [8] in the book Moby Dick [9]
  • Neuron: The number of connections each neuron has in the nematode worm C [10]
  • Blackout: The number of people in the United States affected by electricity blackouts from 1984–2002 [9]

As a first step we want to visualize our data in order to get an idea of how the data looks like. There is an excellent blog post about visualizing skewed distributions by Jérôme Kunegis. To begin with, I will use a simple distribution plot in log-log scale for this purpose for which the output is the following:

distribution

A clear observation is that the Word dataset represents a highly skewed distribution. The Neuron distribution seems to be skewed as well while the Blackout dataset does not really look like a distribution with a long tail. As we are now interested whether the data follows a power law distribution we could hypothesize that the Word dataset does follow a power law distribution, the Neuron dataset might follow a power law and the Blackout data does not follow a power law distribution — given the definitions above and the idea that this plot should follow a straigh line in case of a power law. Nevertheless, we can not clearly state this at this state as mentioned in the introduction of this post. Especially the right part of these figures is hard to interprete and we need to investigate this thoroughly. A more frequent and suggested way to visualize such data is to use the complementary cumulated distribution function (CCDF) with the idea that we want to visualize the probability that X will take a value less than or equal to x. If we do this for our three datasets we come up with the following figures:

ccdf

The plots are again on logarithmic axes and now show the complete range of values. Based on the definition of power law, we can again say that the distribution might be power law if the plots show a straight line. Hence, these plots confirm our observations from before that the Word dataset most likely follows power law, while we are unsure about the Neuron data and can pretty much highly neglect a power law fit for the Blackout data. As we can clearly see, these plots are way more descriptive than the distribution plots above. Nevertheless, we still do not have a clear evidence for power law behavior and still need to investigate this in detail.

The primary step is to estimate both xmin and α. As mentioned earlier, we are mostly interested in the tail of a heavy-tailed distribution as the small values regularly do not follow a power law. One method for estimating the best value of xmin (it is also possible to set it yourself) is to create a fit for each value in your data and choose the one resulting in the minimal Kolmogorov-Smirnov distance D between the corresponding fit and your data. One important thing to note is that power laws are also undefined for xmin 0. For finding α one can use the maximum likelihood method. There are also different approaches to take when working with continuous or discrete data. As the methodological details are way over the scope of this blog post please refer to the excellent paper by Clauset et al. [5].

Using the powerlaw Python library, we can do this all in a very simple way:

fit=powerlaw.Fit(data, discrete=True, xmin=1)

As mentioned, we can manually specify a xmin value and state if the data is descrete. If we do not specify a xmin value the best value is determined automatically. The following figures illustrate the probability density functions (PDF) for the data (red solid line), the power law fit for the best xmin fit (green dotted line) and the fit with a fixed xmin of 1 (blue dashed line):

fit_pdf

At a first glance, we can see clear differences for a fixed and best xmin fit. Furthermore, we can see that for the Word dataset, power law seems to be a very good fit, for the Neuron dataset it is a moderate fit and for the Blackout data it is no good fit at all. We can also take a clear look at the resulting xmin, α and also the Kolmogorov-Smirnov distance D. Finally, we also want to investigate the concrete goodness of fit in order to conclude that power law is a good fit for our data at hand. One way to do this is to use bootstrapping and the Kolmogorov-Smirnov test to generate p-values — please refer to [5] for a thorough description. As a second option we can compare the power law fit to fits of other distribution using log-likelihood ratios to specify which fit is better. In literature [5,6] it is suggested to focus on the latter for various reasons. Even though this blog post is basically about the process of finding whether a power law is a good fit for our data, it is often unnecessary to answer this specific question. Contrary, we want to investigate whether power law is the best description for our data or if there are other distributions that would be a better fit. Some potential candidate functions could be exponential, lognormal or Weibull. We can do this by calling:

R is the log-likelihood ratio between both distributions and p is the p-value. Generally, we can say that if R is positive, than the first distribution is a better fit while if it is negative, the second one should be prefered. The p-value specifies the significance for this ratio and we can prefer for example, a p-value below 0.1 for statistical significance. For demonstration I will now compare our fit to the Word dataset to both the exponential as well as the lognormal distribution using a CCDF plot:

word_compare

We can clearly see that the exponential distribution is a much worse fit which is also expressed with the high R value of 3025.03 with a p-value way below 0.1. Contrary, lognormal and power law seem to be an equally good fit to our data which is also expressed with a high p-value stating that there can not be made a clear distinction. Actually, it is also very rare that power law is a better fit than lognormal when working with real data. So finally, we can say with high confidence, that power law is a good fit for our Word dataset. We can proceed similarly with the other datasets.

The methods provided in this blog post should give you a clear route for identifying power law functions in your data. Nevertheless, they also have some kind of limitations that should be kept in mind. First of all, it can be problematic to automatically calculate the xmin value. If you think about it, it could happen that the algorithm finds a xmin value so high that we can get a good power law fit but would only consider a small portion of the data. So it is always necessary to take a look at the resulting xmin values or contrary, specifying fixed values for it beforehand. As shown throughout this post, it is also very difficult to clearly evaluate whether the power law function is the best fit to the data. Contrary, it can be of greater interest to analyze whether power law is the best descriptor available based on a list of candidate distributions. In many real world data examples, it is also a very rare occasion that power law is the best fit, as it is often not possible to claim that it is a better fit than the lognormal or Weibull distribution. For a detailed discussion about this please refer to the inspiring blog post [11].

All methods and results presented in this post can be found step by step in a Python notebook. Please also take a look at the notebook provided by the authors of the powerlaw library.

[1] H. Kwak, C. Lee, H. Park, and S. Moon, “What is twitter, a social network or a news media?,” in Proceedings of the 19th international conference on world wide web, New York, NY, USA, 2010, pp. 591–600.
[2] H. Halpin, V. Robu, and H. Shepherd, “The complex dynamics of collaborative tagging,” in Proceedings of the 16th international conference on world wide web, New York, NY, USA, 2007, pp. 211–220.
[3] D. Laniado, R. Tasso, Y. Volkovich, and A. Kaltenbrunner, “When the wikipedians talk: network and tree structure of wikipedia discussion pages,” in Proceedings of the fifth international aaai conference on weblogs and social media, 2011.
[4] D. Bollen and H. Halpin, “The role of tag suggestions in folksonomies,” Corr, vol. abs/0903.1788, 2009.
[5] A. Clauset, C. R. Shalizi, and M. E. J. Newman, “Power-law distributions in empirical data,” Siam rev., vol. 51, iss. 4, pp. 661–703, 2009.
[6] J. Alstott, “Powerlaw 1.0 https://pypi.python.org/pypi/powerlaw," , 2013.
[7] J. Alstott, E. Bullmore, and D. Plenz, “Powerlaw: a python package for analysis of heavy-tailed distributions,” , 2013.
[8] G. K. Zipf, The psycho-biology of language: an introduction to dynamic philology, Houghton Mifflin Company, 1935.
[9] M. E. J. Newman, “Power laws, Pareto distributions and Zipf’s law,” Contemporary physics, vol. 46, iss. 5, pp. 323–351, 2005.
[10] E. K. Towlson, P. E. Vértes, S. E. Ahnert, W. R. Schafer, and E. T. Bullmore, “The rich club of the c. elegans neuronal connectome.,” J neurosci, vol. 33, iss. 15, pp. 6380–6387, 2013.
[11] C. Shalizi, “So you think you have a power law — well isn’t that special? http://vserver1.cscs.lsa.umich.edu/~crshalizi/weblog/491.html," , 2007.

--

--

Philipp Singer

Data Scientist at UNIQA Insurance Group, PhD in CS, passionate about machine learning, statistics, data mining, programming, blockchain, and many other fields.