Anscombe’s Quartet — An Importance of Data Visualization

Muhammad Usman
Nov 3 · 3 min read

Usually people believe “the numerical calculations are exact, but graphs are rough” even though it’s completely wrong. Even I was not right about it before learning data analytics.

If you are new in the data science or its sub fields, believe me this is the first step towards the understanding of the importance of Data Visualization along with the statistics result.

Image from Google

Anscombe’s Quartet is the modal example to demonstrate the importance of data visualization which was developed by the statistician Francis Anscombe in 1973 to signify both the importance of plotting data before analyzing it with statistical properties. It comprises of four data-set and each data-set consists of eleven (x,y) points. The basic thing to analyze about these data-sets is that they all share the same descriptive statistics(mean, variance, standard deviation etc) but different graphical representation. Each graph plot shows the different behavior irrespective of statistical analysis.

Four Data-sets

Apply the statistical formula on the above data-set,

Average Value of x = 9

Average Value of y = 7.50

Variance of x = 11

Variance of y =4.12

Correlation Coefficient = 0.816

Linear Regression Equation : y = 0.5 x + 3

However, the statistical analysis of these four data-sets are pretty much similar. But when we plot these four data-sets across the x & y coordinate plane, we get the following results & each pictorial view represent the different behavior.

Graphical Representation of Anscombe’s Quartet
  • Data-set I — consists of a set of (x,y) points that represent a linear relationship with some variance.
  • Data-set II — shows a curve shape but doesn’t show a linear relationship (might be quadratic?).
  • Data-set III — looks like a tight linear relationship between x and y, except for one large outlier.
  • Data-set IV — looks like the value of x remains constant, except for one outlier as well.

Python code on GitHub !

Output of Python code

Data-sets which are identical over a number of statistical properties, yet produce dissimilar graphs, are frequently used to illustrate the importance of graphical representations when exploring data. This isn’t to say that summary statistics are useless. They’re just misleading on their own. It’s important to use these as just one tool in a larger data analysis process. Visualizing our data allows us to revisit our summary statistics and re-contextualize them as needed.

“Visualization gives you answers to questions you didn’t know you had.” — Ben Schneiderman

Reference Research Paper : https://www.autodeskresearch.com/publications/samestats

    Muhammad Usman

    Written by

    Masters Student in RWTH Aachen University | Software Development Engineer at Symantec Corporation https://stackoverflow.com/users/9180179/usman

    Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
    Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
    Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade