Scatter Plots, Why & How

Darío Weitz

Published in

Analytics Vidhya

7 min readFeb 20, 2020

Storytelling, Tips & Warnings

AKA: Scatter Charts, Scatter Graphs, Scatter Diagrams, Dot Charts, XY Graphs

Why: scatter plots are used to determine if a pair of numerical variables are correlated. They are suitable for distribution analysis of two numerical variables. They should not be used to show trends over time. They are also not suitable for comparison analysis. Besides, they are not appropriate when the nature of the message is to show quantities.

Definition 1: a correlation is defined as a measure (a metric) to evaluate the relationship between two variables. You can calculate (using an equation) the correlation coefficient that takes values between 1 to -1: a value close to 0 indicates no correlation; a value close to 1 indicates a very strong direct relationship between the two variables; a value close to -1 indicates a very strong inverse relationship between them; values greater than 0.7 or -0.7 indicate, respectively, strong direct or inverse relationships; values below 0.3 or -0.3 indicate weak or null direct or inverse relationships.

What is the nature of the message: starting from a data set representing two numerical variables that are indicated by points in a Cartesian plane x-y, the message is narrated from the shape that these data points generate, through revealing the existence or not of a correlation. Such correlation can be positive or negative and is obtained by representing a significant number of data points. Even though each point indicates exact numerical values, the purpose of the visualization is to determine whether or not a relationship or correlation exists between the numerical variables represented, rather than focusing on the exact values indicated.

How: a point is drawn for each observation of a pair of numerical variables (A, B), positioning the point vertically according to the value of variable A and horizontally according to variable B. Taken together, the points make up a scattered cloud in the reference plane. An attempt is made to determine if both variables are correlated. It also seeks to establish the intensity of the correlation by the proximity of the plotted points. The sign of the correlation coefficient indicates whether they are correlated or anticorrelated, according to higher values of one variable corresponding to greater or lesser of the other. You can add colors, labels or different visual markers to include other variables, especially categorical ones, but the storytelling of the visualization could be significantly hindered.

Definition 2: a categorical variable, also called qualitative variable, is one that usually takes a limited number of values of mutually exclusive categories or groups. These values can be numerical but do not represent quantities but mutually exclusive groups (i.e. gender: 1- Male; 2- Female; 3- Other).

Example: Richard Florida (1) collected data about factors that might impact regional variations in smoking and obesity, such as income, education, and even the ways people commute to work. Metros where greater shares of people walk and bike to work do better on the Metro Health Index. Conversely, the share of people who drive to work alone is negatively associated with the Metro Health Index. Higher levels of smoking and obesity are consistently associated with higher shares of people who drive to work alone.

Schematic Diagram:

Storytelling: scatter plots (dot charts) share with line plots the idea of mapping quantitative data from two numerical variables. They differ in that individual points are not connected by lines. Instead, they express the message through the distribution of the points in the Cartesian plane.

Three important features of the data set can be found in a scatter plot: 1.- Outliers, piece of data that is very different from all the others in the dataset and does not seem to fit the same pattern. Those anomalous values might represent valuable information to analyze. First of all, it must be verified that the existence of these anomalous values is not due to errors while measuring the data; 2.- Gaps, an interval that contain no data. The visualization of gaps between data justifies an in-depth analysis that explains their presence; 3.- Clusters, isolated groups of data points which can also merit a particular analysis of the reason for their presence in the graph. Of course, gaps and clusters might represent errors in the data collection methodology.

A regression line is habitually added to the scatter plot. Also named as Line of Best Fit or Trend Line, it mathematically expresses the relationship between both numerical variables. A regression line is a straight line that relates an independent or response variable with one or several dependent or explanatory variables. The goal of the regression line is to estimate some unmeasured values of the independent variable by means of the interpolation technique or use it for forecasting purposes via extrapolation. Special care should be taken not to confuse correlation with causation.

Definitions 3 & 4: causality is the “relationship established between cause and effect”. The independent variable is the one that you manipulate, and the dependent variable is the one that you observe.

Tips for scatter plots

Scatter plots improve with the increase in the number of data points;
The Cartesian plane x-y should start at (0,0) so that the presence or absence of correlation is clearly manifested;
Symmetry: the idea of correlation between variable A and variable B is equivalent to that of variable B with respect to A;
You can draw regression lines when one of the variables is independent and the other dependent;
Categorical variables can be added through colors (preferably), visual markers or labels (text). However, doing so adds complexity to the visualization;
If colors are used to indicate different categories, they must be a harmonious combination and not necessarily with some explicit meaning. You must use a qualitative color scale where all colors must be different enough to distinguish the categories but none stand out.
It is convenient to use visual markers (squares, triangles, asterisks, etc.) when there will be a printed output;
A labeled scatter plot uses text to identify each point of data.

Warnings

Scatter plots are usually difficult to understand for non-technical audiences;
Remember that categories are usually not ordered, so colors should not represent some sort of order if it does not exist;
Try not to show many regression lines. More than two regression lines in the screen might confuse the audience;
Several correlation coefficients can be visualized using Correlograms. But, they are very abstract diagrams, not simple to interpret and the numerical values that give rise to them do not appear in the chart.
Try not to use grids. If case they are absolute necessary, they must be blurred;
Do not confuse scatter plots with bubble plots. The latter replaces the individual points with colored disks or bubbles. The area of the bubbles represent a third numerical variable. Bubbles charts focus on proportions rather than correlations.

Alternative: a variation to the standard scatter plot is the Quadrant Chart. Basically, it is a scatter plot divided into four sections or quadrants. The goal is to make the chart more readable and easy to interpret.

In the same way as with a scatter plot, a point is drawn for each observation of a pair of numerical variables. A third categorical variable is represented by means of colors or visual markers. The chart is divided into four regions by means of vertical and horizontal lines. The four sections or quadrants are not necessarily of equal size.

Quadrant charts are used to present data that can be categorized into four regions, such as a SWOT (Strengths, Weaknesses, Opportunities, and Threats) analysis. It can also be used to analyze the result of marketing campaigns. The following chart tries to shed light on the efficiency of various marketing campaigns.

To sum up, you might use a scatter plot when:

You want to show the relationship or correlation that exists between two numerical variables;
You are looking for distribution-based conclusions for two numerical variables;
You want to show or determine the existence of linear or non-linear trends or patterns, positive or negative correlations, clusters, outliers (anomalous values) or gaps in the datasets.

1.- Florida, Richard. Why Some Cities Are Healthier Than Others. CITYLAB, January 5, 2012. https://www.citylab.com/design/2012/01/why-some-cities-are-healthier-others/365/

2.- J. Gert van Dijk, Roland D. Thijs. Vasovagal Syncope. JACC: Clinical Electrophysiology, Volume 3, Issue 9, September 2017, Pages 1054–1055

Scatter Plots, Why & How

Storytelling, Tips & Warnings

Written by Darío Weitz