If you have not read the part 1 of R data analysis series kindly go through the following article where we discussed about Many Ways of Reading Data into R -1.
The contents in the article are gist from couple of books which I got introduced during my IIM-B days.
R for Everyone — Jared P. Lander
Practical Data Science with R — Nina Zumel & John Mount
All the code blocks discussed in the article are present in form of R markdown in the Github link.
One of the hardest parts of analysis is producing quality supporting graphics. Conversely, a good graph is one of the best ways to present findings. Fortunately, R provides excellent graphing capabilities, both in base installation and with an ad on packages such as
ggplot2. In this article, we will briefly introduce you to some simple graphs using base graphics and then show their counterparts in
Graphics are used in statistics primarily for two reasons:
Exploratory Data Analysis(EDA) and presenting results. Both are incredibly important but must be targeted to different audiences.
When graphing for the first time with R, most people use base graphics and then move on to
ggplot2 when their needs become more complex. This section is here for completeness and because base graphics are just needed, especially for modifying the plots generated by other functions.
Before we go any further we need some data. Most of the datasets built into R are tiny, even by the standards from ten years ago. A good dataset for example graphs is, ironically, included with
ggplot2 . In order to access it,
ggplot2 must first be installed and loaded. The purpose of this article is to introduce you to some basic statistical plots using
base graphics and
ggplot2. So we will be using a simple dataset and focus more on charting concepts.
The most common graph of data in a single variable is a
histogram. This shows the distribution of value for that variable. Creating
histogram is very simple and shown below for the carat column in
hist(diamonds$carat, main="Carat Histogram", xlab="Carat")
This shows the distribution of
carat size. Notice that the title was set using the
main argument and x-axis label with
xlab argument. Histograms break the data into buckets and the heights of the bars represent the number of observations that fall into each bucket.
It is frequently good to see two variables in comparison with each other; this is where
scatterplot is used. We will plot the
price of diamonds against the
plot(price ~ carat, data = diamonds)
carat indicate that we are viewing
price is the
y value and
carat is the
x value. It is also possible to build a scatterplot by simply specifying the
y variable without the
Boxplots are often among the first graphs taught to statistics students. It is often used as a statistical mechanism to find outliers in data. Given their ubiquity, it is important to learn them and thankfully R has the boxplot function to help us construct one.
The idea behind the
boxplot is that that the thick middle line represents the median and the box is bounded by first and third
quartiles. That is the middle 50% of data — the Interquartile Range or IQR is held in the box. The lines extend out to 1.5*IQR to both directions. The outlier points are then plotted beyond that.
While R’s basic graphics are extremely powerful and flexible and can be customized to a great extent, using them can be labor-intensive most of the time. Two packages-
lattice were built to make graphing easier. Now we will recreate all the previous graphs and expand the examples with more advanced features.
ggplot2 syntax is harder to grasp, but the efforts are more than worthwhile. The basic structure of
ggplot2 starts with
ggplot function, which at it most basic should take the data as its first argument. After initializing the object, we add layers using the
+ symbol. To start we just discuss
geometric layers such as
histograms. Furthermore, the layer can have different
aesthetic mappings and even different data.
ggplot2 Histograms and Densities
As we did above using
base graphics, let’s plot the distribution of
diamonds carats using
ggplot2. This is built using
geom_histogram as shown below.
ggplot(data = diamonds) + geom_histogram(aes(x=carat))
A similar display is the
density plot, which is done by changing
geom_density. We also specify the color to fill in the graphs using the
ggplot(data=diamonds) + geom_density(aes(x=carat) + fill="grey50")
histogram displays the count of data in buckets, the
density plot shows the probability of observation falling within a sliding window along with the variable of interest. The difference between the two is subtle but important where
histograms is more of a discrete measurement while
density plots are more of continuous measurement.
Here we not only see the
ggplot2 way of making
scatterplot but also show some of the power of ggplot2. In the next few examples, we will be using
ggplot(diamonds, aes(x=carat.y=price)) repeatedly, which ordinarily would require a lot of redundant typing. Fortunately, we can save
ggplot objects to variables and add layers later. Here we are adding the third dimension to the
g <- ggplot(diamonds, aes(x=carat,y=price))
g + geom_point(aes(color=color))
Notice that we set
aes. This is because the designated color will be determined by the data. Also, see that a legend was automatically generated.
ggplot2 has the ability to make faceted graphs, or small multiples and this done using
facet_wrap takes the levels of one variable, cuts up the underlying data according to them, makes a separate pane for each set and arranges them to fit in the plot. Here row and column placement have no real meaning.
g + geom_point(aes(color=color)) + facet_wrap(~color)
On the other hand
facet_grid acts similar but assigns all levels of a variable to either a row or column as shown below.
g + geom_point(aes(color=color)) + facet_grid(cut~clarity)
After understanding how to read one pane in this plot we can easily understand all the panes and make quick comparisons.
ggplot2 Boxplots and Violins Plots
Being a complete graphics package
geom_boxplot . Even though it is one dimensional, using a
y aesthetic , there needs to be some
x aesthetic, so we will use 1.
ggplot(diamonds,aes(y=carat,x=1)) + geom_boxplot()
This can be neatly extended to drawing multiple
boxplots, one for each level of a variable as shown below.
ggplot(diamonds,aes(y=carat,x=cut)) + geom_boxplot()
Getting fancy we can swap out the
violin plots using
Violin plots are similar to
boxplots except that the boxes are curved, giving the sense of the density of the data.
We can add multiple layers (geoms) on the same plot, as seen below. Notice that the order of the layers matters. In the graph on the left, the points are underneath the violins, while in the graphs on the right, the points are on top of the violins. Notice the
gridExtra package helps you to arrange the multiple graphs in rows and columns.
p1 <- ggplot(diamonds,aes(y=carat,x=cut)) + geom_point() + geom_violin()
p2 <- ggplot(diamonds,aes(y=carat,x=cut)) + geom_violin() + geom_point()
grid.arrange(p1, p2, ncol=2)
ggplot2 Line Graphs
Line charts are often used when one variable has a certain continuity, but that is not always necessary because there is often a good reason to use a line with
Let’s create a simple
line plot using
economics data from
ggplot(economics, aes(x=date,y=pop)) + geom_line()
A common task for line plots is displaying a metric over the course of a year for many years. To prepare the
economics data we will use
lubridate package which convenient functions for manipulating dates. We need to create two new variables year and month. To simplify things we will subset the data to include only years starting with 2000.
require(lubridate)## Create month and year columns
economics$year <- year(economics$date)
economics$month <- month(economics$date)## Subset the data
econ2000 <- economics[which(economics$year>=2000),]
Now let’s create line plots depicting multiple years as follows. The first line of the code block creates the line graph with a separate line for and color for each year. Notice that we converted
year to a
factor so that it would get a discrete color scale and then the scale was named by using
scale_color_discrete(name=”Year”). Last, the
y-label were set with
labs. All these pieces put together built a professional-looking, publication-quality graph as below.
g <- ggplot(econ2000,aes(x=month,y=pop))
g <- g + geom_line(aes(color=factor(year), group=year))
g <- g + scale_color_discrete(name=”Year”)
g <- g + labs(title="Population Growth", x="Month",y="Population")
A great of
ggplot2 is the ability to use themes to easily change the way plot look. While building the theme from scratch can be a daunting task but
ggthemes package has put together themes to recreate commonly use styles of graphs. Following are a few styles
Edward Tufte and
The Wall Street Journal.
g2 <- ggplot(diamonds, aes(x=carat,y=price)) + geom_point(aes(color=color))## Lets apply few themes
p1 <- g2 + theme_economist() + scale_color_economist()
p2 <- g2 + theme_excel() + scale_color_excel()
p3 <- g2 + theme_tufte()
p4 <- g2 + theme_wsj()grid.arrange(p1, p2, p3, p4, nrow=2, ncol=2)
In this article we have seen both basic graphs and
ggplot2 that are both nicer and easier to create. We have covered
line plots and
density graphs. We have also looked at using colors and small multiples for distinguishing data. It is just a humble introduction to
base plotting in R. There are many other features in
ggplot2 such as
alpha which we will be covering in the following articles as and when required.
In the next article, we will see some of the commonly used data manipulation methods available in R
Do share your thoughts and support by commenting and sharing the article among your peer groups.