How to lie with big data

Maciej Piwoni
DataShop
Published in
6 min readApr 3, 2016

--

Volume of available data increases but our understanding of it is still trailing behind. One of the recurring themes is data-driven marketing. Does more data mean better decisions? I doubt it. The proper insights and understanding make better decisions.

I put together couple of examples how data can be misused to get the ‘right’ results.

I have to confess, the title and theme was inspired by fantastic book by Darrell Huff, “How to lie with statistics”. The book is already 50+ years old but it still rocks.

1. Measuring what’s available, not what’s important

Replacing difficult question with easier one

Every single marketer I met in my life claimed to be focusing on the important metrics only. It is a serious business in the end, no time to waste. Let’s take a look at one of the most common questions. “Where should I spend my marketing money?”.

This is the moment in which a lot of marketers would produce a chart looking like the one below:

Half a true is a lie, as an old saying goes. Measuring where conversion took place or finding the last digital touchpoint is relatively easy. Does it tell the full story of the customer journey? Should you redistribute your marketing budget based on the chart above? Definitely not.

Customer journey was never linear, triggered by single message. Google argue that even “impulse buying” is conditioned by many micro-moments.

To understand complete customer journey is one of most important goals of the digital marketing. it is also the biggest challenges. Only the complete understanding of all touchpoints and their importance allows to spend marketing money effectively. The shortcut is to ask easy question — what happened right behind conversion.

Understanding averages

Avoid choosing irrelevant and misleading measurements

Averages — the most commonly used word and the most commonly misused. Usually average refers to mathematical mean, i.e. sum of all values divided by the number, often rounded. While very common, it could be very misleading.

This brings up the second problem associated with use of the mean. Number itself is quoted without any context or background:

“We managed to achieve in average 28 conversions per channel”.

Number on its own is meaningless. What is needed is context. This can be solved by showing distribution of data or even better by using histogram.

Use histograms for instant visual feedback. They will provide much better picture of underlying data set.

Always be precise with the terms, when using averages:

  • mean: regular meaning of “average” — arithmetic mean,
  • median: the middle value,
  • mode: the most frequent value,
  • range: difference between the min. and max. value.

In the example above, mean (arithmetic average) produces misleading results. Actual performance of the channels is either much higher (Referral) or much lower than mean. It is a typical scenario where use of the mean on it’s own is misleading.

Let’s take a look what our toolbox can tell us:

  • range — spread of 78 conversions means discrepancies between the best and the worst performing channel,
  • median — showing middle value of 12, indicating that more than half of all channels performs below “the average” (27.6 conversions),
  • mode — showing that the most frequent number of conversions is 8. Well below number quoted by the ‘average’.

Above three numbers indicate presence of an outlier — value that is distant from other observations (Referral channel).

Avoiding survivorship bias

Understand lack of visibility

Survivorship bias, or survival bias, is the logical error of concentrating on the people or things that “survived” some process and inadvertently overlooking those that did not because of their lack of visibility

During World War II, the statistician Abraham Wald took survivorship bias into his calculations when considering how to minimise bomber losses to enemy fire. Researchers from the Center for Naval Analyses had conducted a study of the damage done to aircraft that had returned from missions, and had recommended that armour be added to the areas that showed the most damage (shown in red above). Wald noted that the study only considered the aircraft that had survived their missions — the bombers that had been shot down were not present for the damage assessment. The holes in the returning aircraft, then, represented areas where a bomber could take damage and still return home safely. Wald proposed that the Navy instead reinforce the areas where the returning aircraft were unscathed (shown in blue above), since those were the areas that, if hit, would cause the plane to be lost (as per Wikipedia).

What does it mean for marketers?

Quite often marketers focus on understanding customers who converted. A lot of effort and research is spent on building profiles of successful customers (shown in blue below). This approach is an example of survivorship bias. Marketing activities are built around profiles of customers who succeeded, ignoring anybody who didn’t convert along the process. This means optimising conversion will hit a wall at some stage.

Instead, marketers should focus on every stage of the customer journey — marked in yellow above. It provides much bigger scope for driving overall number of conversions up. Try to understand why people are not converting and how to fix it.

Correlation vs. causation

Understanding cause and effect

Does sales of ice creams causes more murders?

During summer months, sales of ice cream increase. Number of murders do, too. Plotting those two numbers on a single chart shows two very similar trends. This may lead to conclusion that sales of ice creams influence number of murders. Illusory correlation may lead to serious misconceptions.

eCommerce is one of the industries that benefits from increased number of available data. Every digital step and touchpoint on customer journey is recorded. Sometimes even the most ‘obvious’ metrics may not behave in the way we expect.

In example above, we have eCommerce store with constant number of traffic. Revenue increases (to the delight of CFO) but conversion rate decreases (to the agony of CMO). Intuitively those two metrics correlate, but not in this case. Use of additional data helps. Again by asking not-so-obvious questions: Does order value change? How behaviour of repeat customer differs from behaviour of the new customers? Is there a change in demographics of the visitors?

Uncovering cause and effect in online marketing is not easy. Difficult questions remain. Does correlation exist between metrics we are focusing it? Is it linear or non-linear, is data set discrete or continuous, just to name the few. One thing is certain though; There is no substitute for hard work.

“If you torture the data long enough, it will confess.”

Ronald H. Coase, Essays on Economics and Economists

--

--

Maciej Piwoni
DataShop

Global Data Strategy Manager. Critical Thinker. Digital Evangelist. Data Geek.