Getting Good at ggplot2
My take on using the ggplot2 package
Today marks the second year of my university life and my first time using ggplot2 since I’m currently taking a module in data visualization. It would be interesting if I write as I learn, so let’s see how it goes.
From what I’ve known previously, ggplot2 is a really good R package that works on data visualization. If implemented correctly, the package can produce statistically meaningful graphs and charts.
In this article, I’d like to share about how to start your ggplot2 journey and maybe gg (getting good) at it.
Setting up your R workspace
The first thing to do when using ggplot2 is to make sure that the package is installed. To install, run
install.packages("ggplot2")
Select any of the options if exists until you finish the installation. Next, just like Python, after we install, we have to import/load the package to our workspace by running
library(ggplot2)
The mtcars dataset
After setting up ggplot2, we need some dataset to start with. R has a special built-in dataset called mtcars
that you can have a look at.
According to the R documentation, the mtcars
dataset contains information on 32 cars from a 1973 issue of Motor Trend magazine. You can read more about what each column stands for, or running str(mtcars)
to find out the dataset’s structure.
Starting our first plot
In my opinion, the simplest plot that can be done with this dataset is simply the scatter plot.
Let’s say we want to plot the mpg
column on the x-axis and disp
column on the y-axis. We can plot them as follows.
ggplot(data = mtcars, aes(x = mpg, y = disp)) +
geom_point()
The ggplot(...)
function creates the figure (the canvas) of the plot along with the data points, while geom_point()
“confirms” the addition of the points on the plot.
Both these functions must go together, otherwise the scatter plot will lose some elements.
The data
argument is the dataset that you want to plot with, while aes
stands for aesthetics, the scales that we want to map on our data.
Quick tip, we can actually shorten the code by just running
ggplot(mtcars, aes(mpg, disp)) +
geom_point()
Now that you’ve run the code and seen the plot, it actually looks good right?
Adding color to the plot
Yes, we definitely can add some color to the plot. Usually, we use color to classify the points based on another column. For example, using the previous plot, we want to classify these points based on the value of cyl
. Then, our code now becomes
ggplot(mtcars, aes(mpg, disp, color = cyl)) +
geom_point()
The plot now becomes like this.
So far it looks good, but if you observe the dataset closer, the cyl
column has only 3 different values: 4, 6, and 8.
This means data type can affect our visualization. To resolve it, we can convert the cyl
column into factors by applying factor()
on cyl
as shown below.
ggplot(mtcars, aes(mpg, disp, color = factor(cyl))) +
geom_point()
Now the plot looks different (and probably makes more sense), right?
Size does matter!
Besides color, we can also adjust the points’ size based on another column. Simply add the size
parameter on the aes
function and you’re good to go!
For now, let’s continue from our latest code and use thehp
column as the size
parameter.
ggplot(mtcars, aes(mpg, disp, color = factor(cyl), size = hp)) + geom_point()
Invisibility!
Other than size, we can also apply transparency to the points. To do so, we simply add the alpha
parameter inside the geom_point()
function. If we want to apply 30% transparency, we run
ggplot(mtcars, aes(mpg, disp, color = factor(cyl), size = hp)) +
geom_point(alpha = 0.3)
Axis scaling
This part should be quick. Let’s say you’re plotting something whose value is too big, like a country’s population. We might need to scale the x-axis or the y-axis using the logarithmic scale so that the points can fit inside the figure in a better way. The code below is self-explanatory since it scales both the x-axis and the y-axis.
ggplot(mtcars, aes(mpg, disp, color = factor(cyl), size = hp)) +
geom_point(alpha = 0.3) +
scale_x_log10() +
scale_y_log10()
Dividing into subgraphs
Sometimes, colors may not help us that much on dividing the plot. Completely separating them into subgraphs may be the better solution. To do so, we apply faceting, adding facet_wrap(~ column)
to your code and replace column
with the divider column’s name.
Let’s go back to this code.
ggplot(mtcars, aes(mpg, disp, color = factor(cyl))) +
geom_point()
Instead of using cyl
for the color
parameter, we can use that column as a facet wrapper instead. The code now changes to this!
ggplot(mtcars, aes(mpg, disp)) +
geom_point() +
facet_wrap(~ cyl)
To zero and beyond!
As you can see in each of our plots, the value mpg
always starts from 10 and definitely doesn’t include 0. Sometimes, it is important to include 0 for one of our axes to give a better impression of how far the data is from the lowest point, like an exam mark distribution.
Now let’s go back to our plot whose code is
ggplot(mtcars, aes(mpg, disp, color = factor(cyl))) +
geom_point()
To include zero in the mpg
axis, which is the x-axis, we do the following.
ggplot(mtcars, aes(mpg, disp, color = factor(cyl))) +
geom_point() +
expand_limits(x = 0)
Similarly, for y-axis, we can add the parameters inside the expand_limits
function. We can also expand both axes to include zero by doing
ggplot(mtcars, aes(mpg, disp, color = factor(cyl))) + geom_point() + expand_limits(x = 0, y = 0)
Of course, the expand_limits
function does not limit itself to include only zero. We can include more values outside our data range, let’s say x = 50
.
ggplot(mtcars, aes(mpg, disp, color = factor(cyl))) + geom_point() + expand_limits(x = 50)
Note that you can’t use the same parameter twice. What this means is if you need to include different values of x (or y), you can define x
and/or y
as a range. For example, to include both x = 0
and x = 50
you have to set x = 0:50
in the expand_limits
function. Pretty awesome feature, isn’t it?
Adding title
Simply add the ggtitle()
function and you’re good! For example,
ggplot(mtcars, aes(mpg, disp, color = factor(cyl))) +
geom_point() +
ggtitle("Plot of disp against mpg")
Alternatively, you can put the title inside the labs()
function just like this.
ggplot(mtcars, aes(mpg, disp, color = factor(cyl))) +
geom_point() +
labs(title = "Plot of disp against mpg")
Labeling the plot
With the above plot, you can see the axis and color labels are simply the column names. We can change it into a new name by doing something similar to adding a plot title, which is using labs()
. We can add a subtitle to the plot.
ggplot(mtcars, aes(mpg, disp, color = factor(cyl))) +
geom_point() +
labs(
title = "Plot of disp against mpg",
subtitle = "This plot uses ggplot2",
x = "Miles per US gallon",
y = "Displacement (cu.in.)",
color = "Cylinders"
)
Finishing with style
After setting the title, we might need to give the plot a certain theme, or maybe a background.
Suppose we save our latest plot to my_plot
.
my_plot <- ggplot(mtcars, aes(mpg, disp, color = factor(cyl))) +
geom_point() +
labs(
title = "Plot of disp against mpg",
subtitle = "This plot uses ggplot2",
x = "Miles per US gallon",
y = "Displacement (cu.in.)",
color = "Cylinders"
)
The easiest way is to use the quick theme functions by appending them to my_plot
just like below
my_plot + theme_gray()
my_plot + theme_bw()
my_plot + theme_linedraw()
my_plot + theme_light()
my_plot + theme_minimal()
my_plot + theme_classic()
my_plot + theme_void()
my_plot + theme_dark()
The second easiest way is to just use the themes provided by the ggthemes
package.
install.packages("ggthemes")
library(gthemes)
Let’s see if we apply each theme to my_plot
. I’ll leave the end result to those who run this ;)
my_plot + theme_tufte()
my_plot + theme_economist()
my_plot + theme_stata()
my_plot + theme_wsj()
my_plot + theme_calc()
my_plot + theme_hc()
I’ve put a link on this part’s title, so you can try the more complex form of styling a plot.
Finally, beyond the scatter plot!
Of course, data visualization doesn’t stop at the scatter plot. We have different types of data visualizations, which can be added after your ggplot
code.
- Line plot, simply change
geom_point()
togeom_line()
.
- Bar plot, simply change
geom_point()
togeom_col()
.
- Histogram
This one’s a bit different because you simply need to define the x-axis and the number of bins, but not the y-axis (it’s for the values count). For example, to plot the mpg column with 5 bins, we can use
ggplot(mtcars, aes(mpg)) +
geom_histogram(bins = 5)
- Box plot, simply change
geom_point()
togeom_boxplot()
. This plot works best if the x-axis is categorical (i.e. factor). Let’s replace thempg
column with the factorizedcyl
column, resulting in this code.
ggplot(mtcars, aes(factor(cyl), disp)) +
geom_boxplot()
Note that you can combine some of these visualizations, for example, combining the line plot and the scatter plot.
ggplot(mtcars, aes(mpg, disp)) +
geom_point() +
geom_line()
And there we go, you’ve covered all the basic functions in ggplot2! Pat yourself on the back for reading this or even trying it out!
Here’s the full R code if you want to experiment with more new things. Cheers :)
Last Notes
- This article doesn’t cover how to read non-built-in data such as CSVs or TSVs, you can refer to this documentation on reading CSV files instead.
- Besides
mtcars
, R has other built-in datasets which you can try out, such asiris
anddiamonds
. - You can find more references of the
ggplot
function and theaes
function in the R documentation. They also explained why the usage ofcolour
instead ofcolor
in theaes
function is the same.