# Statistical Visualization In R — 2

If you have not read the part 1 of R data analysis series kindly go through the following article where we discussed about Many Ways of Reading Data into R -1.

The contents in the article are gist from couple of books which I got introduced during my IIM-B days.

R for Everyone — Jared P. Lander

Practical Data Science with R — Nina Zumel & John Mount

All the code blocks discussed in the article are present in form of R markdown in the Github link.

One of the hardest parts of analysis is producing quality supporting graphics. Conversely, a good graph is one of the best ways to present findings. Fortunately, R provides excellent graphing capabilities, both in base installation and with an ad on packages such as `lattice`

and `ggplot2`

. In this article, we will briefly introduce you to some simple graphs using base graphics and then show their counterparts in `ggplot2`

.

Graphics are used in statistics primarily for two reasons: `Exploratory Data Analysis(EDA)`

and presenting results. Both are incredibly important but must be targeted to different audiences.

# Base Graphics

When graphing for the first time with R, most people use base graphics and then move on to `ggplot2`

when their needs become more complex. This section is here for completeness and because base graphics are just needed, especially for modifying the plots generated by other functions.

Before we go any further we need some data. Most of the datasets built into R are tiny, even by the standards from ten years ago. A good dataset for example graphs is, ironically, included with `ggplot2`

. In order to access it, `ggplot2`

must first be installed and loaded. The purpose of this article is to introduce you to some basic statistical plots using `base graphics`

and `ggplot2`

. So we will be using a simple dataset and focus more on charting concepts.

`require(ggplot2)`

data(diamonds)

head(diamonds)

# Base Histograms

The most common graph of data in a single variable is a `histogram`

. This shows the distribution of value for that variable. Creating `histogram`

is very simple and shown below for the carat column in `diamonds`

.

`hist(diamonds$carat, main="Carat Histogram", xlab="Carat")`

This shows the distribution of `carat`

size. Notice that the title was set using the `main`

argument and x-axis label with `xlab`

argument. Histograms break the data into buckets and the heights of the bars represent the number of observations that fall into each bucket.

# Base Scatterplot

It is frequently good to see two variables in comparison with each other; this is where `scatterplot`

is used. We will plot the `price`

of diamonds against the `carat`

using `formula`

notation.

`plot(price ~ carat, data = diamonds)`

The `~`

separating `price`

and `carat`

indicate that we are viewing `price`

against `carat`

where `price`

is the `y`

value and `carat`

is the `x`

value. It is also possible to build a scatterplot by simply specifying the `x`

and `y`

variable without the `formula`

interface.

`plot(diamonds$carat, diamonds$price)`

# Boxplots

`Boxplots`

are often among the first graphs taught to statistics students. It is often used as a statistical mechanism to find outliers in data. Given their ubiquity, it is important to learn them and thankfully R has the boxplot function to help us construct one.

`boxplot(diamonds$carat)`

The idea behind the `boxplot`

is that that the thick middle line represents the median and the box is bounded by first and third `quartiles`

. That is the middle 50% of data — the Interquartile Range or IQR is held in the box. The lines extend out to 1.5*IQR to both directions. The outlier points are then plotted beyond that.

# ggplot2

While R’s basic graphics are extremely powerful and flexible and can be customized to a great extent, using them can be labor-intensive most of the time. Two packages- `ggplot2 `

and `lattice`

were built to make graphing easier. Now we will recreate all the previous graphs and expand the examples with more advanced features.

Initially the `ggplot2`

syntax is harder to grasp, but the efforts are more than worthwhile. The basic structure of `ggplot2`

starts with `ggplot`

function, which at it most basic should take the data as its first argument. After initializing the object, we add layers using the `+`

symbol. To start we just discuss `geometric`

layers such as `points`

, `lines`

and `histograms`

. Furthermore, the layer can have different `aesthetic`

mappings and even different data.

# ggplot2 Histograms and Densities

As we did above using `base`

graphics, let’s plot the distribution of `diamonds`

carats using `ggplot2`

. This is built using `ggplot`

and `geom_histogram`

as shown below.

`ggplot(data = diamonds) + geom_histogram(aes(x=carat))`

A similar display is the `density`

plot, which is done by changing `geom_histogram`

to `geom_density`

. We also specify the color to fill in the graphs using the `fill`

argument.

`ggplot(data=diamonds) + geom_density(aes(x=carat) + fill="grey50")`

Whereas the `histogram`

displays the count of data in buckets, the `density`

plot shows the probability of observation falling within a sliding window along with the variable of interest. The difference between the two is subtle but important where `histograms`

is more of a discrete measurement while `density`

plots are more of continuous measurement.

# ggplot2 Scatterplots

Here we not only see the `ggplot2`

way of making `scatterplot`

but also show some of the power of ggplot2. In the next few examples, we will be using `ggplot(diamonds, aes(x=carat.y=price))`

repeatedly, which ordinarily would require a lot of redundant typing. Fortunately, we can save `ggplot `

objects to variables and add layers later. Here we are adding the third dimension to the `scatterplot`

using `color`

column.

`g <- ggplot(diamonds, aes(x=carat,y=price))`

g + geom_point(aes(color=color))

Notice that we set `color=color`

inside `aes`

. This is because the designated color will be determined by the data. Also, see that a legend was automatically generated.

`ggplot2 `

has the ability to make faceted graphs, or small multiples and this done using `facet_wrap`

or `facet_grid`

. `facet_wrap`

takes the levels of one variable, cuts up the underlying data according to them, makes a separate pane for each set and arranges them to fit in the plot. Here row and column placement have no real meaning.

`g + geom_point(aes(color=color)) + facet_wrap(~color)`

On the other hand `facet_grid`

acts similar but assigns all levels of a variable to either a row or column as shown below.

`g + geom_point(aes(color=color)) + facet_grid(cut~clarity)`

After understanding how to read one pane in this plot we can easily understand all the panes and make quick comparisons.

# ggplot2 Boxplots and Violins Plots

Being a complete graphics package `ggplot2`

offers `geom_boxplot`

. Even though it is one dimensional, using a `y`

aesthetic , there needs to be some `x`

aesthetic, so we will use 1.

`ggplot(diamonds,aes(y=carat,x=1)) + geom_boxplot()`

This can be neatly extended to drawing multiple `boxplots`

, one for each level of a variable as shown below.

`ggplot(diamonds,aes(y=carat,x=cut)) + geom_boxplot()`

Getting fancy we can swap out the `boxplots`

for `violin`

plots using `geom_violin`

. `Violin`

plots are similar to `boxplots`

except that the boxes are curved, giving the sense of the density of the data.

We can add multiple layers (geoms) on the same plot, as seen below. Notice that the order of the layers matters. In the graph on the left, the points are underneath the violins, while in the graphs on the right, the points are on top of the violins. Notice the `gridExtra`

package helps you to arrange the multiple graphs in rows and columns.

`require(gridExtra)`

p1 <- ggplot(diamonds,aes(y=carat,x=cut)) + geom_point() + geom_violin()

p2 <- ggplot(diamonds,aes(y=carat,x=cut)) + geom_violin() + geom_point()

grid.arrange(p1, p2, ncol=2)

# ggplot2 Line Graphs

`Line`

charts are often used when one variable has a certain continuity, but that is not always necessary because there is often a good reason to use a line with `categorical`

data.

Let’s create a simple `line`

plot using `economics`

data from `ggplot2`

package.

`data(economics)`

head(economics)

ggplot(economics, aes(x=date,y=pop)) + geom_line()

A common task for line plots is displaying a metric over the course of a year for many years. To prepare the `economics`

data we will use `lubridate`

package which convenient functions for manipulating dates. We need to create two new variables year and month. To simplify things we will subset the data to include only years starting with 2000.

`require(lubridate)## Create month and year columns`

economics$year <- year(economics$date)

economics$month <- month(economics$date)## Subset the data

econ2000 <- economics[which(economics$year>=2000),]

head(econ2000)

Now let’s create line plots depicting multiple years as follows. The first line of the code block creates the line graph with a separate line for and color for each year. Notice that we converted `year`

to a `factor`

so that it would get a discrete color scale and then the scale was named by using `scale_color_discrete(name=”Year”)`

. Last, the `title`

, `x-label`

and `y-label`

were set with `labs`

. All these pieces put together built a professional-looking, publication-quality graph as below.

`g <- ggplot(econ2000,aes(x=month,y=pop))`

g <- g + geom_line(aes(color=factor(year), group=year))

g <- g + scale_color_discrete(name=”Year”)

g <- g + labs(title="Population Growth", x="Month",y="Population")

g

# Themes

A great of `ggplot2`

is the ability to use themes to easily change the way plot look. While building the theme from scratch can be a daunting task but `ggthemes`

package has put together themes to recreate commonly use styles of graphs. Following are a few styles`The Economist`

, `Excel`

, `Edward Tufte`

and `The Wall Street Journal`

.

`require(ggthemes)`

g2 <- ggplot(diamonds, aes(x=carat,y=price)) + geom_point(aes(color=color))## Lets apply few themes

p1 <- g2 + theme_economist() + scale_color_economist()

p2 <- g2 + theme_excel() + scale_color_excel()

p3 <- g2 + theme_tufte()

p4 <- g2 + theme_wsj()grid.arrange(p1, p2, p3, p4, nrow=2, ncol=2)

In this article we have seen both basic graphs and `ggplot2`

that are both nicer and easier to create. We have covered `histograms`

, `scatterplots`

, `boxplots`

, `violinplots`

, `line plots`

and `density`

graphs. We have also looked at using colors and small multiples for distinguishing data. It is just a humble introduction to `ggplot2 `

and `base`

plotting in R. There are many other features in `ggplot2`

such as `jittering`

, `stacking`

, `dodging`

and `alpha`

which we will be covering in the following articles as and when required.

In the next article, we will see some of the commonly used data manipulation methods available in R

Do share your thoughts and support by commenting and sharing the article among your peer groups.