Source: Autodesk Research

The big picture, the small pictures, and the numbers that don’t matter

Tara Greenwood
4 min readMay 6, 2019

--

I LOVE abstract thinking. Finding and applying patterns is how I turn data into information. By relying on summary statistics alone we can turn data into misinformation. We’ll look at some fun examples to (literally) illustrate the importance of looking at your total dataset, getting feedback on your results, and looking again.

In 1973, Francis Anscombe made a quartet of eleven-point datasets that can be described with the same four values of mean, variance, correlation, and best fit line, but they are visually quite different. X1 shows a simple linear relationship with a fairly normal distribution; However, X2 doesn’t have a normal distribution or a linear relationship; X3 has a linear relationship, but one outlier throws off its best fit line; and X4 shows an even more extreme example of a non-normal distribution and a single outlier’s influence on the line.

This graphic represents the four datasets defined by Francis Anscombe for which some of the usual statistical properties (mean, variance, correlation and regression line) are the same, even though the datasets are different. Reference: Anscombe, Francis J. (1973) Graphs in statistical analysis. American Statistician, 27, 17–21.

We don’t actually know how Anscombe made those datasets. Now there’s a great site, DrawMyData, where we can create a dataset with specific stats by clicking points on the graph until we get the summary stats we want, then export it as a csv file. Thanks Robert Grant! Alberto Cairo used that to update the quartet with a charismatic set called the Datasaurus:

It must have taken a while.

In “Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing,” Autodesk researchers Justin Matejka and Geroge Fitzmaurice took that Datasaurus set as a starting point and used 200,000 perturbations to produce the Datasaurus Dozen. The Datasaurus Dozen adds eleven new shapes with the same summary statistics. This whole dataset is available for download with the names of each subset and x-y coordinates. Here is my scatterplot of the whole set with its best-fit line:

Pictured: ‘dino’, ‘away’, ‘h_lines’, ‘v_lines’, ‘x_shape’, ‘star’,
‘high_lines’, ‘dots’, ‘circle’, ‘bullseye’, ‘slant_up’,
‘slant_down’, ‘wide_lines’

Clearly, it is not a very well-fitting line. Suppose this model is accepted anyway, and that a given subset with those same stats is assumed to look similar. Remember, there are twelve distinct-looking datasets hidden in that messy scatterplot (you’ll see them at the end; it’s worth it). If someone were familiar with a specific dataset, they may argue for a different linear correlation.

Not pictured: better-fit line

This gives us an example of the Amalgamation Paradox (or Simpson’s paradox). The Amalgamation Paradox is a statistical a phenomenon where a certain trend appears in different groups of data but reverses when they are added together. Here’s a classic example:

Source: Wikipedia

And here is an example within our own data subset:

The red dots represent the best of six lucky guesses.

The amalgamation paradox tells us there may be some confounding variables at play. In the case of the datasaurus dozen, it’s sampling bias. Matejka and Fitzmaurice put a lot of work into biasing each subset’s distribution until it had the same stats as the whole set. This is some truly beautiful sampling bias:

Look at what you can do with 200,000 perturbations!

To wrap this up, it may not be enough to take the first look at our data. If our predictions are coming up strange and inconsistent, the next step may be to examine subsets in isolation. For data scientists and pattern enthusiasts, chasing down confounding variables is all part of the game.

You can see my repository on GitHub: https://github.com/TSGreenwood/datasaurus_visual_explorations

References:

--

--