Conditions Are Power-Law Distributed: An Example

Kent Beck
Kent Beck
Feb 23, 2019 · 3 min read

I observed to Mike Hill that conditions in code are power-law distributed. That is, there is one condition (if XXX) that is used more than any other, two that are used half that much and so on until you have lots of conditions that are only used once.

I wanted to double check myself, since this is a pretty science-y observation. I did. I was right. (Dang, that’s not very science-y. Let me try again.) The prediction I made matched a new observation.

I’ll walk you through how I got the data (it’s a long Unix command line — feel free to tell me how to do it beautifully with R or something).

First, we want to extract the if statements from our codebase (I picked Ansible for no good reason except that I had it around). This will do it:

> grep -R --include=’*.py’ ‘if ‘ .
./packaging/release/tests/ if isinstance(expected, type):
./packaging/release/changelogs/ if argcomplete:
./packaging/release/changelogs/ if args.verbose > 2:
./packaging/release/changelogs/ elif args.verbose > 1:

This does indeed give us a bunch of if statements. Now we need to strip out the extraneous portions, the “if” at the beginning and the “:” at the end (this being Python).

> grep -R --include=’*.py’ ‘if ‘ . | perl -nle ‘print $1 if /.*if (.*):/’
isinstance(expected, type)
args.verbose > 2
args.verbose > 1

Now we have just the conditions. How many of each are there? First sort them then pass them through uniq -c to count them.

> grep -R --include=’*.py’ ‘if ‘ . | perl -nle ‘print $1 if /.*if (.*):/’ | sort | uniq -c
1 “ “ in v
2 “ at ‘^’ position” in err
1 “#LogService.ClearLog” in _data[u”Actions”]
4 “%s” not in validate

Sort these numerically in reverse order and we can see the heavy hitters.

> grep -R --include=’*.py’ ‘if ‘ . | perl -nle ‘print $1 if /.*if (.*):/’ | sort | uniq -c | sort -n -r
2332 __name__ == ‘__main__’
682 ‘message’ in response
645 state == ‘present’
644 not module.check_mode

What we want eventually is a histogram showing how many single-use conditions there are, how many conditions are used twice, etc. Use “cut” to extract the counts, then the same “sort | uniq -c” trick to get a histogram.

> grep -R --include=’*.py’ ‘if ‘ . | perl -nle ‘print $1 if /.*if (.*):/’ | sort | uniq -c | sort -n -r | cut -c 1–5 | sort -n | uniq -c
28611 1
4817 2
1335 3
623 4

Sure enough, there are lots of conditions (28K) used once, many fewer used twice, many fewer used three times, and on down.

Down at the bottom we have one condition used 2332 times.

Graphing this data we get an inkling that we’re not in Normalistan any more.

Image for post
Image for post

Shifting the axes to logarithmic shows something like a power-law distribution.

Image for post
Image for post

There is a trend in how often a condition “ought” to appear. And there you have it — preferential attachment at work. The more often a condition appears in a codebase, the more likely that condition is to be used the next time a conditional appears.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store