Getting Good at ggplot2

My take on using the ggplot2 package

Russell Saerang
Data Folks Indonesia
8 min readAug 9, 2021

--

Image by Analytics Vidhya

Today marks the second year of my university life and my first time using ggplot2 since I’m currently taking a module in data visualization. It would be interesting if I write as I learn, so let’s see how it goes.

From what I’ve known previously, ggplot2 is a really good R package that works on data visualization. If implemented correctly, the package can produce statistically meaningful graphs and charts.

In this article, I’d like to share about how to start your ggplot2 journey and maybe gg (getting good) at it.

Setting up your R workspace

The first thing to do when using ggplot2 is to make sure that the package is installed. To install, run

Select any of the options if exists until you finish the installation. Next, just like Python, after we install, we have to import/load the package to our workspace by running

The mtcars dataset

After setting up ggplot2, we need some dataset to start with. R has a special built-in dataset called mtcars that you can have a look at.

mtcars dataset will look like this

According to the R documentation, the mtcars dataset contains information on 32 cars from a 1973 issue of Motor Trend magazine. You can read more about what each column stands for, or running str(mtcars) to find out the dataset’s structure.

Starting our first plot

In my opinion, the simplest plot that can be done with this dataset is simply the scatter plot.

Let’s say we want to plot the mpg column on the x-axis and disp column on the y-axis. We can plot them as follows.

Running the code above produces this plot!

The ggplot(...) function creates the figure (the canvas) of the plot along with the data points, while geom_point() “confirms” the addition of the points on the plot.

Both these functions must go together, otherwise the scatter plot will lose some elements.

The data argument is the dataset that you want to plot with, while aes stands for aesthetics, the scales that we want to map on our data.

Quick tip, we can actually shorten the code by just running

Now that you’ve run the code and seen the plot, it actually looks good right?

Adding color to the plot

Yes, we definitely can add some color to the plot. Usually, we use color to classify the points based on another column. For example, using the previous plot, we want to classify these points based on the value of cyl. Then, our code now becomes

The plot now becomes like this.

Apply color based on the cyl column

So far it looks good, but if you observe the dataset closer, the cyl column has only 3 different values: 4, 6, and 8.

This means data type can affect our visualization. To resolve it, we can convert the cyl column into factors by applying factor() on cyl as shown below.

Now the plot looks different (and probably makes more sense), right?

Converting cyl into factors produces a more reasonable plot!

Size does matter!

Besides color, we can also adjust the points’ size based on another column. Simply add the size parameter on the aes function and you’re good to go!

For now, let’s continue from our latest code and use thehp column as the size parameter.

Size really does matter now!

Invisibility!

Other than size, we can also apply transparency to the points. To do so, we simply add the alpha parameter inside the geom_point() function. If we want to apply 30% transparency, we run

Now there’s a clear distinction between different points!

Axis scaling

This part should be quick. Let’s say you’re plotting something whose value is too big, like a country’s population. We might need to scale the x-axis or the y-axis using the logarithmic scale so that the points can fit inside the figure in a better way. The code below is self-explanatory since it scales both the x-axis and the y-axis.

Notice that the ticks aren’t equally spaced because it has been scaled logarithmically

Dividing into subgraphs

Sometimes, colors may not help us that much on dividing the plot. Completely separating them into subgraphs may be the better solution. To do so, we apply faceting, adding facet_wrap(~ column) to your code and replace column with the divider column’s name.

Let’s go back to this code.

Instead of using cyl for the color parameter, we can use that column as a facet wrapper instead. The code now changes to this!

Now the cyl column is used as a facet wrapper instead of a color component!

To zero and beyond!

As you can see in each of our plots, the value mpg always starts from 10 and definitely doesn’t include 0. Sometimes, it is important to include 0 for one of our axes to give a better impression of how far the data is from the lowest point, like an exam mark distribution.

Now let’s go back to our plot whose code is

To include zero in the mpg axis, which is the x-axis, we do the following.

As you can see, now the mpg axis includes zero!
How about including x = 50 in our plot?

Similarly, for y-axis, we can add the parameters inside the expand_limits function. We can also expand both axes to include zero by doing

Of course, the expand_limits function does not limit itself to include only zero. We can include more values outside our data range, let’s say x = 50.

Note that you can’t use the same parameter twice. What this means is if you need to include different values of x (or y), you can define x and/or y as a range. For example, to include both x = 0 and x = 50 you have to set x = 0:50 in the expand_limits function. Pretty awesome feature, isn’t it?

Adding title

Simply add the ggtitle() function and you’re good! For example,

A plot with a title, at last!

Alternatively, you can put the title inside the labs() function just like this.

Labeling the plot

With the above plot, you can see the axis and color labels are simply the column names. We can change it into a new name by doing something similar to adding a plot title, which is using labs(). We can add a subtitle to the plot.

Renaming them makes the whole plot cleaner!

Finishing with style

After setting the title, we might need to give the plot a certain theme, or maybe a background.

Suppose we save our latest plot to my_plot.

The easiest way is to use the quick theme functions by appending them to my_plot just like below

The second easiest way is to just use the themes provided by the ggthemes package.

Let’s see if we apply each theme to my_plot. I’ll leave the end result to those who run this ;)

I’ve put a link on this part’s title, so you can try the more complex form of styling a plot.

Finally, beyond the scatter plot!

Of course, data visualization doesn’t stop at the scatter plot. We have different types of data visualizations, which can be added after your ggplot code.

  • Line plot, simply change geom_point() to geom_line().
The data isn’t actually suitable for line plots but I just want to show the visual difference
  • Bar plot, simply change geom_point() to geom_col().
The data isn’t actually suitable for line plots but I just want to show the visual difference
  • Histogram

This one’s a bit different because you simply need to define the x-axis and the number of bins, but not the y-axis (it’s for the values count). For example, to plot the mpg column with 5 bins, we can use

Histogram of the mpg column
  • Box plot, simply change geom_point() to geom_boxplot(). This plot works best if the x-axis is categorical (i.e. factor). Let’s replace the mpg column with the factorized cyl column, resulting in this code.
We’ve got an outlier on the dataset with cyl = 6, interesting!

Note that you can combine some of these visualizations, for example, combining the line plot and the scatter plot.

Fusion between scatter plot and line plot

And there we go, you’ve covered all the basic functions in ggplot2! Pat yourself on the back for reading this or even trying it out!

Here’s the full R code if you want to experiment with more new things. Cheers :)

Full R code used in this article :)

Last Notes

  • This article doesn’t cover how to read non-built-in data such as CSVs or TSVs, you can refer to this documentation on reading CSV files instead.
  • Besides mtcars, R has other built-in datasets which you can try out, such as iris and diamonds.
  • You can find more references of the ggplot function and the aes function in the R documentation. They also explained why the usage of colour instead of color in the aes function is the same.

--

--