Improving the scatter plot

The scatter plot is ubiquitous, and deservedly so. But a simple scatter plot:

has far too little information. The enhanced scatter plot I showed in the first figure adds several things:

  • A regression line: That’s the straight line
  • A loess smoothed line: That’s the sort of wavy line. It’s a nonparametric smoother of the data. The fact that it doesn’t deviate much from the straight line is an indicator that linear regression is fairly appropriate here.
  • Labels for each point. Here, they are states and the variables are unemployment and infant mortality.
  • Kernel density plots for each variable, a good univariate graphic.
  • A confidence ellipse, letting you see that DC and Mississippi are bivriate outliers.

I created the fancy version in SAS. My code was:

PROC IMPORT OUT= WORK.UnempIM 
DATAFILE= "C:\personal\Graphics\UnEmpChildMort.csv"
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
RUN;

which gets the data.

proc template;
define statgraph scatdens2;
begingraph; *BEGIN DEFINING THE GRAPH;
entrytitle "Scatter plot with density plots";
*CREATE A TITLE;
layout lattice/columns = 2 rows = 2
columnweights = (.8 .2) rowweights = (.8 .2)
columndatarange = union rowdatarange = union;
*LAYOUT LATTICE...SETS UP A GRID OF GRAPHS;
*COLUMNWEIGHTS AND ROWWEIGHTS SETS
THE RELATIVE SIZE OF THE INDIVIDUAL
COLUMNS AND ROWS;
columnaxes;
columnaxis /label = 'Unemployment (%)'
griddisplay = on;
columnaxis /label = '' griddisplay = on;
endcolumnaxes;
*COLUMNAXES SETS PARTICULAR
CHARACTERISTICS OF COLUMNS;
*THE SECOND ONE HAS NO LABEL (NONE WOULD FIT);
rowaxes;
rowaxis /label = 'Infant Mortality (per XXX)'
griddisplay = on;
rowaxis /label = '' griddisplay = on;
endrowaxes;
layout overlay; *STARTS THE ACTUAL GRAPHING OF DOTS ETC;
scatterplot x = unemployment y = infantmortality/datalabel = stateab;
*GRAPHS THE DOTS;
loessplot x = unemployment y = infantmortality;
loessplot x = unemployment
y = infantmortality/smooth = 1;
ellipse x = unemployment y = infantmortality
/type = predicted;
entry "Prediction ellipse (" {unicode alpha}"=.05)"/autoalign = auto textattrs = (color = red);
endlayout;
densityplot infantmortality/orient = horizontal kernel();
densityplot unemployment / kernel ();
endlayout;
endgraph;
end;
run;

which sets up a template for use by the graph template language (the stuff after a * is commentary on what that part of the code does. For more, see my paper on scatter plots.

and

options nodate nonumber ;
title;
title2;
ods pdf file = "c:\personal\presentations\SASGF14\scatterdens.pdf";
proc sgrender data = UnempIM template = scatdens2;
*NOW WE RENDER THE TEMPLATE WE CREATED;
run;
ods pdf close;

which makes a plot.

Of course, you might want other information in some cases. If either variable was discrete, you might add histograms instead of density plots. You might want to color code the points (e.g. I could have color coded for region of the country. You might want several different loess lines; or, if there were discontinuities, you might want to use wavelets to generate the line.