The first three rules of data analysis

Pete Davies
Feb 25, 2016 · 3 min read

Don’t underestimate their significance

1. Correlation ≠ Causation

An oldie and a goodie. Oft-cited, almost as frequently misunderstood. Compare:

  1. “We changed to 1-week sprints and sign-ups went up”
  2. “Sign-ups went up because we changed to 1-week sprints”
  3. “When we have shorter sprints, sign-ups go up.”

Easily confused with each other, the differences are important. One may lead to the next, but the only way to know is with a controlled test. Don’t let bad assumptions about causation turn into long-standing company lore.

Spurious Correlations is fun.

2. If it sounds too good to be true, it probably is

I can’t overstate the importance of this one. There are few things worse than being the a company’s stat-keeper and jumping up from your desk exclaiming that the latest A|B test doubled week 2 retention… only for someone to point out that you have a misplaced decimal point.†

Even worse: when nobody spots the error for months††. In the meantime, many important product, staffing, and budget decisions have been made based on the inaccurate result.

(Also: when you call it too soon because the significance hasn’t kicked in, or you haven’t factored in that user behavior often deviates in reaction to a change and then settles again. Give it time.)

It’s easy to avoid: very very few product changes or new features cause overnight step change improvements in user behavior, but it’s easy to forget this when you (and the team) are pretty sure that thing you just launched (and worked on for months!) truly rivals sliced bread in the all-time great invention stakes.

Double-check everything. It’s good for checking the numbers, and the biases. At Medium we did test result reviews the same way the developers did code reviews. This saved more than a few blushes.

Side note: this is why (in my experience) the best product data analysts have a healthy skepticism about all their analysis. I gave it a label: being data curious.

3. No pie charts

I’m aware of only two legitimately optimal uses of a pie chart:

Every other pie chart I’ve seen would invariably be better expressed with just a table of raw numbers (yup they work great!), bar chart, or column chart. If you know of any other exceptions that prove this rule, let’s have ‘em.

One day I will find the time to write a full post about this vitally important topic. But for now, this comprehensive Priceonomics piece will do just fine.

† I have done this.
†† I have not done this (that I’m aware of).

Pete Davies

Written by

Building new things @gethudson. Previously @Medium, @Automattic, @BBCNews.