Toward a False xkcdization of Data
Disclaimer: This post was triggered by reading Scott Galloway’s “The Four” and it can easily be read as an attack on the book. It is not meant as such. I do have some problems with the views expressed in the book and I might even dedicate a post to them. This post however, uses several problems I found within this book to point to what, unfortunately, is a broader problem.
A month ago I didn’t even know who Scott Galloway is, and then somebody shared a link to one of his talks on our Facebook’s Workplace platform (1 out of the Four). I watched the video on Google’s YouTube (2 out of the Four) using my Apple iPad (3 out of the Four) and really loved it. “The Four”, Galloway’s book was immediately pre-ordered to my Amazon Kindle (4 out of the Four!). On October 3rd it arrived and I started reading it. At about 25% of the book I started feeling a bit awkward, things that totally blew me on Galloway’s talk were suddenly prone to a closer inspection of reading. Eventually I found myself developing a rather critical and skeptic approach to the book despite my general tendency to agree with its main points.
This post, however, will only address one point, which played a major role in my disillusionment process, and which I chose to call false xkcdization.
I love xkcd, I think Randall Munroe is a real genius, and even though I sometimes grow tired of it, I eventually find myself returning to xkcd and still appreciating its brilliance. One of the prominent features of xkcd is the plotting style, following the overall style of the comic strips, the plots have the appearance of hand-drawn sketches, as if telling us not to take them too seriously. But this is never used by Munroe as an excuse or permission not to take the underlying data or the presentation too seriously.
In The Four (and as I later found out in his Medium blog as well) Galloway uses an xkcd-like plotting style, which goes great with his general tone of self-doubt and the ever present humorous stance.
Axes of Evil
But then, at 23% of the book I came across this quote
The result? From 1997 to 2005 The Gap more than tripled in revenue, from $6.5 billion to $16.0 billion, while Levi Strauss & Co. sank from $6.9 billion to $4.1 billion.
Now, 6.5*3=19.5 which naturally means that The Gap did not more than triple in revenue. This is, of course, a simple and rather silly mistake, which can (and does) happen to anyone. But after observing it I became much more sensitive, which made me, one page later, take a screenshot of the plot supposedly showing the same data.
There is so much wrong here that I don’t even know where to start. There is the naive linear interpolation between two data points eight years apart (but maybe there was no other choice), and there is the always suspicious decision to draw the horizontal x-axis at a value which is not zero. But most of all, there is the complete disrespect for the idea of a plot actually reflecting the underlying data. The Gap, as quoted above, started with $6.5B in 1997, yet the plot puts it at exactly the same level as 2005 Levi’s (which is $4.1B), the vertical distance between the two in 1997 is blown way out of proportion, just in order to make the shift look much more dramatic than it actually was ,and it really was dramatic, even without the graphic manipulation. Here is a plain old boring Excel plot using the actual data in the text.
Instead of a dramatic David and Goliath we get a much more mundane tale of two opponents with practically equal starting points, one making the right decision leading to impressive growth, the other one slowly losing ground. The discussion of the steps take by The Gap in order to secure this success doesn’t have to change, even if the plot is much less impressive.
And then, a few pages later, came a discussion of tuition fees. Galloway, being a professor in NYU Stern leads a brave and important attack against the ever rising cost of tuition. In order to emphasize his point he includes the following plot, “comparing” the overall inflation to the cost of tuition.
Once again, I don’t even know where to start. The 200% inflation is drawn as an almost flat horizontal line. This choice could make some sense if the plot included several curves, all of them normalized against the inflation curve. But given the data in the plot this choice makes it totally pointless, since the y-axis loses any significance as actually measuring something. Furthermore, the inflation curve isn’t a truly flat line, and since the human brain is trained to detect even slight deviations from flat or perpendicular lines this choice actually gives the impression that the 200% inflation is a gentle, almost unobservable creep upwards. Once again, here is an Excel plot of the data, assuming a uniform annual increase of 3.7% for inflation and 8.3% in tuition over the same range.
And just as before, the main story Galloway wants to tell remains the same, even if the plot is much less dramatic.
But the sins of The Four do not stop at distorted graphical presentation.
28% into the book (Kindle location 1302) comes this quote.
You dedicate thirty-five minutes of each of your days to Facebook. Combined with its other properties, Instagram and WhatsApp, that number jumps to fifty minutes. People spend more time on the platform than any behavior outside of family, work, or sleep.
Which is followed by this plot.
I do not want to argue with Galloway’s interpretation of the numbers, just point to the evident fact that the plot actually gives a number of 60 minutes (35+25) instead of the 50 minutes appearing in the text just above it. Admittedly, the numbers are not that clearly defined, and different definitions or measurement methods can easily lead to different results. Nevertheless, one should either admit the discrepancies between data source or pick one and stick with it. Quoting one and then displaying a plot with different numbers is not a valid option.
And then there is the case of the 2016 digital advertising growth which appears twice in the book, each of them accompanied by a plot.
The bottom line, in both cases, is similar — digital advertising is dominated by two players, the long tail is either dying (-3%) or losing significance (mere +10%). Again, one has to admit that the actual definitions here can be vague, and different decisions as to what counts as “other” in digital advertising can easily lead to different outcomes, but using two significantly different results for the same measurement is not a valid option.
Data and data science are all about collecting data and then using mathematical and statistical tools to infer insights. And while Galloway manages to collect a lot of data and weave them into a story, sometimes one can observe a tendency to miss the second part, that of using math to understand what is going on.
Let’s look at the following quote (Kindle location 1408):
Over the last five years, only thirteen in the S&P 500 have outperformed the index each year — evidence of our winner-take-all economy.
Thirteen out of 500, that’s a really small number, one might be tempted to think, a strong evidence of a winner-take-all dynamics as Galloway says. This might even be true, but this is far from being a logical conclusion from the numbers.
The S&P 500 is a curated list 500 large companies to begin with, and one can expect most of them to show similar annual growth, now let us suppose that each year exactly 50% of them (that is 250 firms) grow more than the average and exactly 250 grow less than the average. Let us now further suppose that those lists of 250 are totally random and independent on the previous year’s lists. That implies that the odds of beating the index on five consecutive years are 1 in 32, which for 500 firms yields the expected number of 500/32=15.625 firms. And while 13 is smaller than this it is far from being an evidence for anything as dramatic as claimed by Galloway (In fact, a more realistic model would be to have slightly less than 250 beating the index on a given year, since the median in such cases is usually lower than the average, bringing the number even closer to 13).
The same disregard for math can be found in the seemingly innocent joke made by Galloway when discussing career advice
Given that most of us — and statistics support me on this — are average.
(Kindle Locations 3015–3016)
This is, of course, far from being true, and statistics support me on this. Even in the most well-behaved distributions, only few of us are indeed average.
Axes of Evil 2
Finally, let’s take a look at this plot, illustrating Galloway’s advice not to pick a career path based on its “sexiness” (Kindle location 3274)
This looks like a classical xkcd plot, having the audacity to quantify the unquantifiable and put it on a plot, but this is also where the false xkcdization is most evident. Randall Munroe would never settle for something so sketchy and non-informative. In xkcd the axes would have been given some measurable meaning, even if this measurement is seemingly absurd, with units of sexiness and fulfillment, Munroe would have also taken care to include the important outliers (the rare jobs in which sexiness and fulfillment go hand in hand, the mass of crap jobs in which none of them is achieved), and add annotations to selected points in order to emphasize the message (and make some jokes).
None of these things happen here, it’s just a pretty meaningless plot trying to put a data-like appearance to Galloway’s thesis.
Play and Respect
I love xkcd as I earlier said, and I believe it has immense influence on our current data-driven culture. One of the best things about it is the way it gave us all a license to be playful with data, and approach data analysis with an open mind and a sense of humor. But playfulness does not mean data can be treated with disrespect. Playing is fun only as long as it is accompanied by respect, even if you are playing with data.