Are Summary Statistics Enough when Analyzing Data?
The Anscombe’s quartet, the Datasaurus dataset, and the Datasaurus dozens
Hello fellow NLP enthusiasts! Have you ever analyzed some data, studying their summary statistics and without visualizing them? Well, this article may convince you that visualizing data is always a good idea and should never be underestimated. Enjoy! 😄
The Anscombe’s quartet
In 1973, the statistician Francis Anscombe wanted to demonstrate:
- The importance of graphing data when analyzing it, and
- The effect of outliers and other influential observations on statistical properties.
With this goal in mind, he created the Anscombe’s quartet, i.e. four datasets with nearly identical summary statistics, but with very different distributions that can be noticed when plotted. Each dataset consists of 11 two-dimensional points.
Here are the summary statistics of each dataset, courtesy of Wikipedia.
The quartet is still often used to explain the importance of visualizing data before analyzing them by summary statistics, which are inadequate for inferring data relationships on realistic datasets. Moreover, always remember that data may not be clean and there may be some outliers, missing data, errors in how data are created, and so on. Even if you have the domain knowledge of how your data should be distributed, never assume that the data you have are clean.
The Datasaurus
Later, Alberto Cairo created the Datasaurus dataset, made of two-dimensional points with a similar concept to the Anscombe’s quartet: Never trust summary statistics alone and always visualize your data.
Here are the descriptive statistics of the Datasaurus dataset. Try to imagine what the dataset may look like when plotted.
Here is a plot of the dataset. Did you imagine it like this? Maybe the name “Datasaurus” was too much of a spoiler…
The Datasaurus Dozen
Inspired by Anscombe’s Quartet and the Datasaurus, the paper “Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing” shows how to generate new datasets with different distributions but very similar summary statistics to a specified dataset (e.g. the Datasaurus dataset).
The authors generated 12 (a dozen, indeed) new datasets of two-dimensional points with very similar summary statistics to the Datasaurus dataset.
Learn more about the Datasaurus Dozen with this graphical explanation.
Next steps
Possible next steps are:
- Learn more robust statistics, i.e. summary statistics not affected by outliers.
- Learn about simulated annealing as an optimization technique.
Thank you for reading! If you are interested in learning more about NLP and Data Science, remember to follow NLPlanet on Medium, LinkedIn, Twitter, and join our new Discord server!