Summary statistics can deceive. Here’s how you can get smarter with data visualization — Anscombe’s quartet

akansha khandelwal
Geek Culture
Published in
4 min readAug 31, 2021

To make the case, let me introduce you to Anscombe’s quartet. It comprises four datasets which have similar statistical summary. They have the same measures of central tendency, correlation, and even the same linear regression model. They look alike in every way. However, you throw in data visualization and a completely different story begins to emerge. In the following blog, you would find detailed notes on the dataset and it’s visualization using python libraries.

The dataset

To explain this concept, here are four datasets that I use, with 11 data points, for x and y values.

The statistical summary for all the four datasets is as follows. I have shared the detailed code for creation of the regression model in the blog below.

Make a scatter plot(detailed code towards the end of the blog) and lo and behold, the graphs looks like below:

  • Dataset 1: We can see a clear positive correlation between x and y values.
  • Dataset 2: Absence of a linear relationship between x and y
  • Dataset 3: While there is linear relationship between x and y, we also have an outlier which has caused the correlation to decrease from 1 to .816
  • Dataset 4: In the fourth plot we can see how one high leverage outlier point has resulted in high correlation value. Leverage is a measure of how far away the independent values of an observation are from those of the other observations

As seen above, it’s clear that we cannot rely solely on statistical summary. As we try to get a feel for the data and analyze it, visualization is a key step.

In the remaining blog, I will take you through the detailed steps of understanding each dataset, arriving at the statistical values and plotting the graph.

Below are the python libraries which i have used to demonstrate this:

  1. Numpy — to perform numerical & statistical operations on a data set
  2. Pandas — For creating data frames
  3. Matplotlib & Seaborn to perform data visualization and develop inferences
  4. LinearRegression from Scikit Library to get the coefficient and intercept by training the model
  5. R2_score from scikit learn to get the r squared values

Let’s start analyzing the datasets:

First dataset

1.Constructing the first dataset

2. Checking the values

3. Checking the statistical summary

We can observe that mean of x1 is 9 and mean of y1 is 7.5

4. Now let’s check the correlation between x1 and y1

Correlation between x1 and y1 is .816

5. Let’s get the Linear Regression Line equation

Reshaping the values to two dimensional array ie from (11,) to (11,1) as Scikit learn learn library needs a 2D array.

Training the model

We can see that after training the model with x1 and y1 values using scikit learn library, we get the intercept as 3.0 and Slope as 0.5

y1=3.0+0.5*x1

6. Now let’s check the r squared value

R squared value is 0.67

In summary we get the below values for Dataset1

Similarly we can analyze the other datasets, code of which is available in my git link here.

To ensure an easy comparison, let’s plot all the four datasets side-by-side.

The resultant plot will be as we have seen earlier

Conclusion:

When it comes to understanding the data, Visualization can lift the veil and save you from making faulty assumptions on the data based on their summary statistics, as we saw in the Anscombe’s quartet dataset.

References:

https://en.wikipedia.org/wiki/Anscombe's_quartet

--

--