Data Never Lies

rendeiro
Feedzai Techblog
Published in
7 min readJun 15, 2018

The Importance of Data

Products succeed and fail on the market every day as a result of decision making driven by data. When executives want to pitch to trialnew ideas, revamp pricing, ramp-up engineering capacity, or strategically sunset an old product, one requirement is common: you need to show the data!

Metrics are key to inform these strategic product discussions and guide decisions on business direction. Be it A/B test results, trends of AAARR metrics, or the latest NPS report, data finds its way to all kinds of meeting rooms, usually translated to a visual medium as a chart or graphical display of some sort.

Netflix has the luxury of a high volume of new users to A/B test product features. Reviewing their A/B tests between 2012 and 2015, Netflix Product managers found only 39% of these tests resulted positively, with 15% actually hurting their core metric (screen time). ¹

Booking.com goes a step further, all production changes, from copy changes to major features, is wrapped as an A/B test. 90% of these tests fail and are discarded. ²

These two examples spell a clear warning: As a Product Manager, if you’re not measuring the success of the features you roll-out, chances are you’re hurting your Product.

In this article, we’ll cover the value of data visualisation, common pitfalls when presenting data, and discuss guidelines on how to spot distortions in how data is presented, how to prevent them, and how to design visualisations that are fit for purpose.

Data Presentation Hygiene

Although data is ubiquitous in executive decision-making, it is often misrepresented and misunderstood. We easily fall prey to cognitive and perception biases when dealing with data, specially when data is visualised and often by accident; deception does not require intent.

Let’s start with the basics: there are a few simple signals you should be on the lookout for that may tell you something is off.

  • No data point without a source: Did you know that 82% of people will believe a statistics quoted without a source, even if you just made them up on the spot to prove your point? Be wary of figures that don’t reference a source. Request access to the real data.
  • No KPI without a definition: All too often we assume that metric definitions are universal. In fact, there is a lot of leeway in how ratios, percentages, and most aggregates metrics are computed which can be exploited to force a specific interpretation of the data. Furthermore, a simple metric like an average might be meaningless if the distribution of said variable is extreme.
Fig 1. Ad for Camels Cigarettes | source: Google Images
  • Challenge authority pledges: Somewhen in the 70’s, when the ills of tobacco were not well known and publicised, ads supporting tobacco brands were quite different. The one portrayed below by Camel showed us a man with well-combed silver hair, wearing a lab coat and an impeccably knotted tie and elegantly holding a cigarette alongside a bold and misleading claim. Do these doctors condone tobacco smoking? Do they condone this specific brand over others? Is it healthy (or at least, not unhealthy) to smoke? Is the man in the ad a doctor? All of the above are subtly implied to be true, yet likely false. “Repeated nationwide surveys” by itself is a red flag: what demographics were controlled and for how often were the surveys repeated? Perhaps until the results “looked good”?
  • Correlations can be spurious: Finally, just because two variables correlate, neither causation nor indirect relation can be necessarily implied. Website Spurious Correlations is a welcome reminder that high correlation factors between time-series variables can be accidental.
Fig. 2 Spurious Correlation — Arcade Revenues vs. CS PhD in the US | Source: Spurious Correlations by Tyler Vigen

While these few signals and pointers help to diagnose obvious deceptions in data presentation, borrowing from the work of Edward Tufte in his seminal book The Visualisation of Quantitative Data ³, we can more formally define Excellence and Integrity in Statistical Graphics and derive more precise heuristics to measure distortions in data visualisation.

The Power of Visualisation

Data by itself is amenable to manipulation and misrepresentation, graphics add complexity where further deception can hide. So why bother? Why not stick with tables and figures? After all, numerical calculations are sure to be rigorous.

Consider the Anscombe’s Quartet, 4 bivariate data-sets that can be described by the same linear model, 11 different statistical aggregate measures produce the same values for the 4 data-sets. Indistinguishable and hard to process as a set to the human eye as a table of figures.

Yet, this simple data-set quartet shows us that while numbers are rigorous, graphics can be more precise and more immediate.

As illustrated below, you can more clearly understand the data by plotting it. Those four data-sets don’t look nothing like each other when plotted.

Fig 3. Anscombe’s Quartet | source: The Visual Display of Quantitative Data by Edward Tufte

Excellence in Data Visualisation

Having covered the basic hygiene factors on data presentation, let’s now tackle the topic from the opposite perspective: what is excellence in data visualisation?

The influential author Prof. Edward Tufte provides a succinct and complete standard:

“Excellence in statistical graphics consists of complex ideas communicated with clarity, precision, and efficiency.”

— Edward Tufte

As hinted with the example of the Anscombe Quartet, visualisation invites exploration. Richer multivariate infographics can be formulated that encourage the viewer to explore and make comparisons amongst the data.

Below, my all-time favourite statistical graphic: the classic illustration of Charles Minard from early 1800’s depicting Napolean’s army departing the Polish Russian border.

Fig 4. Napoleon’s March by Charles Minard, early 1800’s | source: Wikipedia

A thick band illustrates the size of his army at specific geographic points during their advance and retreat. It displays six types of data in two dimensions, namely: number of troops, distance traveled, direction of travel, temperature, latitude and longitude and location relative to specific dates.

The Right Tool

One should also care to pick the right tool for the job, based on the task at hand and the nature of the data, it is quite trivial to pick the right chart, here’s a useful guide:

Fig 5. Chart Chooser Tool by Dr. Andrew V. Abela | source: http://extremepresentation.typepad.com/

For a dictionary that explains in further detail how specific types of charts apply fit to different communication purposes and data sources, consider the Data Viz Project.⁵

Lying with Data

While, some distortions may be involuntarily and easily avoidable (i.e., don’t use 3d effects on charts, they make it harder to assess real proportions), others might be ill-intended deceptions, more subtle to detect.

A classic example of distortion is truncating the y-axis as demonstrated below.

Fig 6. Is it misleading to truncate the y-axis? | source: design by Ramiro on redbubble.com

Deception can also be achieved on such simple charts by playing with a series of similar tricks. Consider the data-set of 1–10 point scale ratings of the show Dexter, depicted in the chart below.

Fig 7. a) Dexter 1–10 Show Ratings — as-is | source: IMDB

The aforementioned trick of the truncated y-axis clearly adds a dramatic flair here, amplifying the sad decline of the show by the end of S8.

Fig 7. b) Dexter 1–10 Show Ratings — Truncated y-axis | source: IMDB

Alternatively, one could artificially set a high maximum for the y-axis, rendering the differences between data points less salient.

Fig 7. d) Dexter 1–10 Show Ratings — maximised y-axis | source: IMDB

Or, understandably, one could pretend the last 4 episodes of Season 8 never took place…

Fig 7. e) Dexter 1–10 Show Ratings — Lie by omission | source: IMDB

Advanced Lie Detection for Data Visualisation

Generalising these tricks of deception, Tufte proposes a simple metric, which he dubs the Lie factor, formally:

size of the effect shown in the data / size of the real effect in the data

Borrowing an example from the visualisation bible itself, consider the chart below depicting the evolution of oil prices.

Fig 8. OPEC Benchmark Prices 1970–1979 | source: Washington Post, March 29, 1979 (via Visual Display of Quantitative Information by Edward Tufte)

When considering the difference between the values of Jan 1970 and April 1979, respectively $1.80 and $14.54, the increase is of 7x. However, when computing the visual dimension of the graphical elements representing this quantities, the actual difference shown (in ink spent for both graphical elements, the smallest and largest oil drilling rigs pictured) is 67x.

Yielding thus a Lie Factor of 9,5. In other words, the graphical representation is amplifying the actual effect in the data by almost 10x.

Conclusion

Data visualisation is a powerful and important tool guiding key business decisions. It should be wielded with care and consideration.

By fitting the methods to the data at hand and the purpose of the communication, and by applying guidelines and heuristics as the ones here presented, one can ensure a high level of integrity in the graphics produced, achieve greater clarity in the art of communicating data through a visual medium.

Homework ? :-)

Finally, to minimize visual clutter, noise and potential miscommunication in data visualisation, I challenge you to research another heuristic by Prof. Tufte, the data-to-ink ratio, formally:

amount of ink in a chart devoted to data / all ink devoted to print the chart

Sources

  1. Presented at the Product Management Festival 2016 in Zürich by Jan Dante, Director of Product Experimentation at Netflix
  2. Presented at Mind the Product Hamburg Engage 2018 by Lukas Vermeer, Sr. Product Owner Experimentation at Booking.com
  3. The Visual Display of Quantitative Information by Edward Tufte
  4. The Chart Chooser by Dr. Andrew V. Abela
  5. The Data Viz Project: http://datavizproject.com/

--

--

rendeiro
Feedzai Techblog

Product guy, passionate for usability, charts, photography and typography.