Statistical Visualization In R — 2

Published in

Analytics Vidhya

9 min readDec 3, 2019

If you have not read the part 1 of R data analysis series kindly go through the following article where we discussed about Many Ways of Reading Data into R -1.
The contents in the article are gist from couple of books which I got introduced during my IIM-B days.
R for Everyone — Jared P. Lander
Practical Data Science with R — Nina Zumel & John Mount
All the code blocks discussed in the article are present in form of R markdown in the Github link.

One of the hardest parts of analysis is producing quality supporting graphics. Conversely, a good graph is one of the best ways to present findings. Fortunately, R provides excellent graphing capabilities, both in base installation and with an ad on packages such as lattice and ggplot2. In this article, we will briefly introduce you to some simple graphs using base graphics and then show their counterparts in ggplot2.

Graphics are used in statistics primarily for two reasons: Exploratory Data Analysis(EDA) and presenting results. Both are incredibly important but must be targeted to different audiences.

Base Graphics

When graphing for the first time with R, most people use base graphics and then move on to ggplot2 when their needs become more complex. This section is here for completeness and because base graphics are just needed, especially for modifying the plots generated by other functions.

Before we go any further we need some data. Most of the datasets built into R are tiny, even by the standards from ten years ago. A good dataset for example graphs is, ironically, included with ggplot2 . In order to access it, ggplot2 must first be installed and loaded. The purpose of this article is to introduce you to some basic statistical plots using base graphics and ggplot2. So we will be using a simple dataset and focus more on charting concepts.

require(ggplot2)
data(diamonds)
head(diamonds)

Base Histograms

The most common graph of data in a single variable is a histogram. This shows the distribution of value for that variable. Creating histogram is very simple and shown below for the carat column in diamonds.

hist(diamonds$carat, main="Carat Histogram", xlab="Carat")

This shows the distribution of carat size. Notice that the title was set using the main argument and x-axis label with xlab argument. Histograms break the data into buckets and the heights of the bars represent the number of observations that fall into each bucket.

Base Scatterplot

It is frequently good to see two variables in comparison with each other; this is where scatterplot is used. We will plot the price of diamonds against the carat using formula notation.

plot(price ~ carat, data = diamonds)

The ~ separating price and carat indicate that we are viewing price against carat where price is the y value and carat is the x value. It is also possible to build a scatterplot by simply specifying the x and y variable without the formula interface.

plot(diamonds$carat, diamonds$price)

Boxplots

Boxplots are often among the first graphs taught to statistics students. It is often used as a statistical mechanism to find outliers in data. Given their ubiquity, it is important to learn them and thankfully R has the boxplot function to help us construct one.

boxplot(diamonds$carat)

The idea behind the boxplot is that that the thick middle line represents the median and the box is bounded by first and third quartiles. That is the middle 50% of data — the Interquartile Range or IQR is held in the box. The lines extend out to 1.5*IQR to both directions. The outlier points are then plotted beyond that.

ggplot2

While R’s basic graphics are extremely powerful and flexible and can be customized to a great extent, using them can be labor-intensive most of the time. Two packages- ggplot2 and lattice were built to make graphing easier. Now we will recreate all the previous graphs and expand the examples with more advanced features.

Initially the ggplot2 syntax is harder to grasp, but the efforts are more than worthwhile. The basic structure of ggplot2 starts with ggplot function, which at it most basic should take the data as its first argument. After initializing the object, we add layers using the + symbol. To start we just discuss geometric layers such as points, lines and histograms. Furthermore, the layer can have different aesthetic mappings and even different data.

ggplot2 Histograms and Densities

As we did above using base graphics, let’s plot the distribution of diamonds carats using ggplot2. This is built using ggplot and geom_histogram as shown below.

ggplot(data = diamonds) + geom_histogram(aes(x=carat))

A similar display is the density plot, which is done by changing geom_histogram to geom_density. We also specify the color to fill in the graphs using the fill argument.

ggplot(data=diamonds) + geom_density(aes(x=carat) + fill="grey50")

Whereas the histogram displays the count of data in buckets, the density plot shows the probability of observation falling within a sliding window along with the variable of interest. The difference between the two is subtle but important where histograms is more of a discrete measurement while density plots are more of continuous measurement.

ggplot2 Scatterplots

Here we not only see the ggplot2 way of making scatterplot but also show some of the power of ggplot2. In the next few examples, we will be using ggplot(diamonds, aes(x=carat.y=price)) repeatedly, which ordinarily would require a lot of redundant typing. Fortunately, we can save ggplot objects to variables and add layers later. Here we are adding the third dimension to the scatterplot using color column.

g <- ggplot(diamonds, aes(x=carat,y=price))
g + geom_point(aes(color=color))

Notice that we set color=color inside aes. This is because the designated color will be determined by the data. Also, see that a legend was automatically generated.

ggplot2 has the ability to make faceted graphs, or small multiples and this done using facet_wrap or facet_grid. facet_wrap takes the levels of one variable, cuts up the underlying data according to them, makes a separate pane for each set and arranges them to fit in the plot. Here row and column placement have no real meaning.

g + geom_point(aes(color=color)) + facet_wrap(~color)

On the other hand facet_grid acts similar but assigns all levels of a variable to either a row or column as shown below.

g + geom_point(aes(color=color)) + facet_grid(cut~clarity)

After understanding how to read one pane in this plot we can easily understand all the panes and make quick comparisons.

ggplot2 Boxplots and Violins Plots

Being a complete graphics package ggplot2 offers geom_boxplot . Even though it is one dimensional, using a y aesthetic , there needs to be some x aesthetic, so we will use 1.

ggplot(diamonds,aes(y=carat,x=1)) + geom_boxplot()

This can be neatly extended to drawing multiple boxplots, one for each level of a variable as shown below.

ggplot(diamonds,aes(y=carat,x=cut)) + geom_boxplot()

Getting fancy we can swap out the boxplots for violin plots using geom_violin. Violin plots are similar to boxplots except that the boxes are curved, giving the sense of the density of the data.

We can add multiple layers (geoms) on the same plot, as seen below. Notice that the order of the layers matters. In the graph on the left, the points are underneath the violins, while in the graphs on the right, the points are on top of the violins. Notice the gridExtra package helps you to arrange the multiple graphs in rows and columns.

require(gridExtra)
p1 <- ggplot(diamonds,aes(y=carat,x=cut)) + geom_point() + geom_violin()
p2 <- ggplot(diamonds,aes(y=carat,x=cut)) + geom_violin() + geom_point()
grid.arrange(p1, p2, ncol=2)

ggplot2 Line Graphs

Line charts are often used when one variable has a certain continuity, but that is not always necessary because there is often a good reason to use a line with categorical data.

Let’s create a simple line plot using economics data from ggplot2 package.

data(economics)
head(economics)
ggplot(economics, aes(x=date,y=pop)) + geom_line()

A common task for line plots is displaying a metric over the course of a year for many years. To prepare the economics data we will use lubridate package which convenient functions for manipulating dates. We need to create two new variables year and month. To simplify things we will subset the data to include only years starting with 2000.

require(lubridate)## Create month and year columns
economics$year <- year(economics$date)
economics$month <- month(economics$date)## Subset the data
econ2000 <- economics[which(economics$year>=2000),]
head(econ2000)

Now let’s create line plots depicting multiple years as follows. The first line of the code block creates the line graph with a separate line for and color for each year. Notice that we converted year to a factor so that it would get a discrete color scale and then the scale was named by using scale_color_discrete(name=”Year”). Last, the title , x-label and y-label were set with labs. All these pieces put together built a professional-looking, publication-quality graph as below.

g <- ggplot(econ2000,aes(x=month,y=pop))
g <- g + geom_line(aes(color=factor(year), group=year))
g <- g + scale_color_discrete(name=”Year”)
g <- g + labs(title="Population Growth", x="Month",y="Population")
g

Themes

A great of ggplot2 is the ability to use themes to easily change the way plot look. While building the theme from scratch can be a daunting task but ggthemes package has put together themes to recreate commonly use styles of graphs. Following are a few stylesThe Economist, Excel, Edward Tufte and The Wall Street Journal.

require(ggthemes)
g2 <- ggplot(diamonds, aes(x=carat,y=price)) + geom_point(aes(color=color))## Lets apply few themes
p1 <- g2 + theme_economist() + scale_color_economist()
p2 <- g2 + theme_excel() + scale_color_excel()
p3 <- g2 + theme_tufte()
p4 <- g2 + theme_wsj()grid.arrange(p1, p2, p3, p4, nrow=2, ncol=2)

In this article we have seen both basic graphs and ggplot2 that are both nicer and easier to create. We have covered histograms, scatterplots, boxplots, violinplots, line plots and density graphs. We have also looked at using colors and small multiples for distinguishing data. It is just a humble introduction to ggplot2 and base plotting in R. There are many other features in ggplot2 such as jittering, stacking, dodging and alpha which we will be covering in the following articles as and when required.

In the next article, we will see some of the commonly used data manipulation methods available in R

Group Manipulation In R — 3

Do share your thoughts and support by commenting and sharing the article among your peer groups.