Khangjrakpam Arjun|South Asian University, New Delhi|
April 29, 2019
The simple graph has brought more information to the data analyst’s mind than any other device.
— John Tukey
Data exploration is the art of looking at our data, rapidly generating hypotheses, quickly testing them, then repeating again and again and again.
The goal of data exploration is to generate many promising leads that we can later explore in more depth. Visualization is a great place to start with R programming because the payoff is so clear: we get to make elegant and informative plots that help you understand data. In this article we’ll dive into visualization, learning the basic structure of a ggplot2 plot, and powerful techniques for turning data into plots. Here I shall be focussing on ggplot2, one of the core members of the tidyverse. To access the datasets, help pages, and functions, we load the tidyverse by running this code:
We shall use the mpg dataframe found in ggplot2. mpg contains observations collected by the US Environment Protection Agency on 38 models of cars:
Amongst the variables of mpg are:
- model: model name
- displ: engine displacement, in litres
- year: year of manufacture
- cyl: number of cylinders
To learn more about mpg dataset we can get its help page by running ? mpg.
To plot mpg, we run this code to put displ on the x-axis and cty on the y-axis:
ggplot(data = mpg) +geom_point(mapping = aes(x = displ, y = cty))
The plot shows a negative relationship between engine size (displ) and fuel efficiency (cty). In other words, cars with big engines use more fuel. Does this confirm or refute our hypothesis about fuel efficiency and engine size?
The greatest value of a picture is when it forces us to notice what we never expected to see.
— John Tukey
We can convey information about our data by mapping the aesthetics in our plot to the variables in your dataset. For example, we can map the colors of our points to the trans variable to reveal the type of transmission of each car:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = trans))
To map an aesthetic to a variable, we associate the name of the aesthetic to the name of the variable inside aes(). ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling.
In the preceding example, we mapped trans to the color aesthetic, but we could have mapped class to the size aesthetic in the same way. In this case, the exact size of each point would reveal its trans affiliation. We get a warning here because mapping an unordered variable (trans) to an ordered aesthetic (size) is not a good idea:
Or we could have mapped trans to the alpha aesthetic, which controls the transparency of the points, or the shape of the points:
What happened to auto(s5), auto(s6),etc.? ggplot2 will only use six shapes at a time. By default, additional groups will go unplotted when we use this aesthetic.
One way to add additional variables is with aesthetics. Another way, particularly useful for categorical variables, is to split our plot into facets, subplots that each display one subset of the data. To facet our plot by a single variable, use facet_wrap(). The first argument of facet_wrap() should be a formula, which you create with ~ followed by a variable name (here “formula” is the name of a data structure in R, not a synonym for “equation”). The variable that we pass to facet_wrap() should be discrete:
To facet our plot on the combination of two variables, we add facet_grid() to our plot call. The first argument of facet_grid() is also a formula. This time the formula should contain two variable names separated by a ~:
A geom is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. Scatterplots break the trend; they use the point geom.
On the other hand, we could set the linetype of a line. geom_smooth() will draw a different line, with a different linetype, for each unique value of the variable that we map to linetype:
Here geom_smooth() separates the cars into three lines based on their drv value, which describes a car’s drivetrain. One line describes all of the points with a 4 value, one line describes all of the points with an f value, and one line describes all of the points with an r value. Here, 4 stands for four-wheel drive, f for front-wheel drive, and r for rear-wheel drive.
If we place mappings in a geom function, ggplot2 will treat them as local mappings for the layer. It will use these mappings to extend or overwrite the global mappings for that layer only. This makes it possible to display different aesthetics in different layers:
These are some of the basic features which we can use to generate graphs using ggplot2.
We can generally use geoms and stats interchangeably. For example, we can create a plot using stat_count() instead of geom_bar():
This works because every geom has a default stat, and every stat has a default geom. This means that we can typically use geoms without worrying about the underlying statistical transformation.
We might want to draw greater attention to the statistical transformation in our code. For example, we might use stat_summary(), which summarizes the y values for each unique x value, to draw attention to the summary that we’re computing:
There’s one more piece of magic associated with bar charts. We can color a bar chart using either the color aesthetic, or more usefully, fill:
Note what happens if we map the fill aesthetic to another variable, like clarity: the bars are automatically stacked. Each colored rectangle represents a combination of cut and clarity:
The stacking is performed automatically by the position adjustment specified by the position argument.
position = “dodge” places overlapping objects directly beside
one another. This makes it easier to compare individual values:
Coordinate systems are probably the most complicated part of ggplot2. The default coordinate system is the Cartesian coordinate system where the x and y position act independently to find the location of each point. There are a number of other coordinate systems that are occasionally helpful:
- coord_flip() switches the x- and y-axes. This is useful (for example) if we want horizontal boxplots. It’s also useful for long labels — it’s hard to get them to fit without overlapping on the x-axis:
- coord_polar() uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart:
We have seen much more than how to make scatterplots, bar charts, and boxplots. We learned a foundation that we can use to make any type of plot with ggplot2. To see this, let’s make a code template:
ggplot(data = <DATA>) +
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
Our new template takes seven parameters, the bracketed words that appear in the template. The seven parameters in the template compose the grammar of graphics, a formal system for building plots. The grammar of graphics is based on the insight that we can uniquely describe any plot as a combination of a dataset, a geom, a set of mappings, a stat, a position adjustment, a coordinate system, and a faceting scheme. We could use this method to build any plot that we imagine. In other words, we can use the code template that we’ve learned in this article to build hundreds of thousands of unique plots.
Reference: R for Data Science by Hadley Wickham and Garrett Grolemund
#R #DataScience #Visualization #ggplot2