Conditions Are Power-Law Distributed: An Example
I observed to Mike Hill that conditions in code are power-law distributed. That is, there is one condition (if XXX) that is used more than any other, two that are used half that much and so on until you have lots of conditions that are only used once.
I wanted to double check myself, since this is a pretty science-y observation. I did. I was right. (Dang, that’s not very science-y. Let me try again.) The prediction I made matched a new observation.
I’ll walk you through how I got the data (it’s a long Unix command line — feel free to tell me how to do it beautifully with R or something).
First, we want to extract the if statements from our codebase (I picked Ansible for no good reason except that I had it around). This will do it:
> grep -R --include=’*.py’ ‘if ‘ .
./packaging/release/tests/version_helper_test.py: if isinstance(expected, type):
./packaging/release/changelogs/changelog.py: if argcomplete:
./packaging/release/changelogs/changelog.py: if args.verbose > 2:
./packaging/release/changelogs/changelog.py: elif args.verbose > 1:
This does indeed give us a bunch of if statements. Now we need to strip out the extraneous portions, the “if” at the beginning and the “:” at the end (this being Python).
> grep -R --include=’*.py’ ‘if ‘ . | perl -nle ‘print $1 if /.*if (.*):/’
args.verbose > 2
args.verbose > 1
Now we have just the conditions. How many of each are there? First sort them then pass them through uniq -c to count them.
> grep -R --include=’*.py’ ‘if ‘ . | perl -nle ‘print $1 if /.*if (.*):/’ | sort | uniq -c
1 “ “ in v
2 “ at ‘^’ position” in err
1 “#LogService.ClearLog” in _data[u”Actions”]
4 “%s” not in validate
Sort these numerically in reverse order and we can see the heavy hitters.
> grep -R --include=’*.py’ ‘if ‘ . | perl -nle ‘print $1 if /.*if (.*):/’ | sort | uniq -c | sort -n -r
2332 __name__ == ‘__main__’
682 ‘message’ in response
645 state == ‘present’
644 not module.check_mode
What we want eventually is a histogram showing how many single-use conditions there are, how many conditions are used twice, etc. Use “cut” to extract the counts, then the same “sort | uniq -c” trick to get a histogram.
> grep -R --include=’*.py’ ‘if ‘ . | perl -nle ‘print $1 if /.*if (.*):/’ | sort | uniq -c | sort -n -r | cut -c 1–5 | sort -n | uniq -c
Sure enough, there are lots of conditions (28K) used once, many fewer used twice, many fewer used three times, and on down.
Down at the bottom we have one condition used 2332 times.
Graphing this data we get an inkling that we’re not in Normalistan any more.
Shifting the axes to logarithmic shows something like a power-law distribution.
There is a trend in how often a condition “ought” to appear. And there you have it — preferential attachment at work. The more often a condition appears in a codebase, the more likely that condition is to be used the next time a conditional appears.