10 Levels of ggplot2: From Basic to Beautiful

Published in

CompassRed Data Blog

9 min readDec 13, 2019

Recently I discovered WIRED’s “Levels” series on YouTube. The concept is simple. An expert of some interesting skill (i.e., ice sculpture, origami, or knife making) explains the concept in many levels — from easy to complex.

Level 1 origami cicada vs. Level 11 origami cicada from **11 Levels of Origami: Easy to Complex**

This format is a wonderful way to explore many different skills.

One skill that is essential for anyone who works with data to learn is how to build a graph that tells a story. The ggplot2 package is one of the best tools to do that. In this article I’ll explore how to build a graph with ggplot2 from basic to beautiful.

Data To Explore

We’ll be using a dataset near and dear to my heart. Philadelphia Parking Tickets. You can get some more information about the dataset here.

Also, as we’re exploring the data, you can follow along with all of the code for it here:

rharrington31/10_levels_of_ggplot2

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Level 0: A Basic Plot

I can’t even count this as a real level. It’s a pre-level. Fundamentally, this is what a ggplot2 code block will always look like. A ggplot2 code block will always include:

Data (which we see in the tickets object)
Mapping (which we see in the aes(x = violation_desc) parameter)
Geometry (which we see from geom_bar)

For more details about how this works, check out this article. Our code block ends up looking like:

tickets %>% 
  ggplot(aes(x = violation_desc)) +
  geom_bar()

Here’s what we get from this: hot garbage. There’s good news, though. We can only get better from here!

Level 1: Data Cleaning

There are a lot of fundamental issues with our Level 0 graph. The biggest issue, though, comes before we ever even make the graph. Before we do anything else, we need to clean our data.

The Level 0 graph has 95 distinct values on the x-axis. This is far too many for a person to be able to interpret. Trimming the number of distinct values to 10 will dramatically improve the interpretability of the graph. We can start doing this by inspecting the distinct values. Here’s a sample of the values with their counts:

 1 METER EXPIRED CC     281060
 2 METER EXPIRED        181329
 3 OVER TIME LIMIT      156859
 4 EXPIRED INSPECTION   138575
 5 STOP PROHIBITED CC   115898
 6 STOPPING PROHIBITED   47395
 7 PARKING PROHBITED     47232
 8 PARKING PROHBITED CC  45082
 9 OVER TIME LIMIT CC    24585
10 PASSENGR LOADNG ZONE  24359

Immediately we start to see how to improve the number of distinct values that we have. First, for our purposes, we can remove any of the CC suffixes. We notice that there are some words that are misspelled like PROHBITED. There are also words that are similar to each other like STOP and STOPPING. Using the stringr package’s str_replace function can help us make those changes.

Once we make all of these changes, we’re still left with 81 distinct values. A great method to fix this is to use the fct_lump function from the forcats package. The function is simple enough — it allows you to decide how many distinct values you’d like to keep based upon some other value. Everything else is then designated as “Other”. Check those steps out here.

After implementing these changes we’re almost able to read all of the words on the x-axis. There are a bunch of ways to take care of that. We’ll get to it at Level 4.

Level 2: Use Default Themes

The gray background of the default ggplot2 theme is borderline iconic (and never changing). There are plenty of other themes to choose from, though. I have always been a fan of the simplicity of theme_minimal. You should explore some of the others here. When we add this code in, we’re left with:

Level 3: `scale_y_continuous` + `scales::comma`

The next level is to control how our axes look. It turns out that scientific notation (0e+00, 1e+05, etc.) is a less-than-ideal way of looking at an axis. We can use the scale_y_continuous function coupled with the comma function from the scales package to adjust the y-axis. You can see that in code here. We end up with a much nicer y-axis.

Level 4: fct_reorder + coord_flip()

After some of that prep work, we can now take care of making it easier to read our x-axis. One great technique to use for this is to pair the fct_reorder function with the coord_flip function. Similar to fct_lump, the fct_reorder function is also from the forcats package and does what it says: it reorders factors based upon another variable. We add this to our data prep process by using mutate to change the order of violation descriptions. After doing this, the bars of our graph will be ordered from low to high.

The coord_flip function can then be included as part of our ggplot2 code. This function also does what it says: it flips the axes so that our x-axis becomes our y-axis and vice versa. Our code ends up looking like this. After implementing it we end up with our first passable graph.

This would be a great graph to use internally or as part of a draft. I would want to make more changes if it were being shared externally.

Level 5: Window Dressing

There are a few simple tweaks that we can make in order to make the graph presentable:

Change the titles of the axes
Add a title for the graph
Add some color

We can take care of the first two issues by adding the labs function at the end of our ggplot2 code. Any of the titles on the plot (title, subtitle, x-axis, y-axis, caption, legends, etc.) can be altered in this way.

To add color, we need to go to our geom_col function. We can pass an additional parameter to it called fill. We’re able to designate the fill color in with this parameter in several ways — for example, as a name (“red”) or a hex code (“#E83536”). Now we have a graph that is very presentable. We’re well past “basic” and on our way to “beautiful”!

Level 6: Build Your Own Theme

Once we’ve applied some basic window dressing, we now have the opportunity to dive into the theme function. There are many components that can be modified via theme. Changes can be made at a high level to the entire graph and then further fine tuned. For example, in this case, we might decide to change the family of all of the text to Roboto with a default size of 10. From there, we are able to specifically change the format of the titles of the axes by specifying axis.text. If we need to fine tune even further (change each individual axis on its own), we could have also changed axis.text.x or axis.text.y. After these implementations, our code becomes this and we end up with a highly customized looking graph.

Level 6: Feeling Cute, Might Graph Later

At this point, we have something that looks like it could be our own theme. If we’re making multiple plots, it would be worthwhile to officially build our own theme. Here’s what that could look like:

theme_compassred <- function () { 
  theme_minimal(base_size = 10, base_family = "Roboto") %+replace% 
    theme(axis.title = element_text(face = "bold"),
          axis.text = element_text(face = "italic"),
          plot.title = element_text(face = "bold",
                                    size = 12)
          
    )
}

Even further, we can set this theme for every plot automatically:

theme_set(theme_compassred())

Level 7: Layer Multiple Geometries

One of the beautiful features of ggplot2 is the ease by which multiple geometries can be used in sync with each other. Currently, we have been using geom_col on its own. However, we can make it easier to read the plot by adding a geom_label as well.

When we have multiple geom_ functions as part of our code, it is possible for the geometries to either use the same aesthetic mapping or to have different aesthetic mappings. In this case, geom_label can use the same mappings for its x and y coordinates, but requires an additional mapping called label. There are also several parameters that we must pass to geom_label in order to properly place the labels on the graph. Here’s the code and the graph:

Level 8: Highlight a Key Field

At this point with our graph, the end user still has to do quite a bit of work to interpret the results. The key goal of our last levels is to reduce the cognitive load of the end user and make it easier for them to immediately understand the goal of our graph.

One great way to accomplish this is by highlighting a key field. We can do so with four small changes:

Transform the data by creating a new field called highlight
Move the fill to inside the aesthetic mapping and set it to highlight
Change colors with scale_fill_manual
Remove the legend with guides

After those small code changes, we end up with:

Level 9: Annotate

In this level, our goal is still to remove the cognitive load on the end user. Highlighting the value that we care about helps a lot, but we can literally spell out our takeaway even further. There are many ways to do this, but we’ll do so by taking advantage of an additional geom_label, a geom_curve, and some reorganization of our code.

At a high level, here is how our code is reorganized:

# Build an aggregated data frame of all of the tickets
tickets_agg <- 
   mutate(...)# Create the note that we'd like to share
ticket_note <-
   paste("Your text here")# Create a dataframe to position your arrow
arrow_position <-
   data.frame(...)# Build your graph
ggplot() +
   geom_col(data = tickets_agg,
            mapping = aes(x    = field,
                          y    = count,
                          fill = highlight)) +
   geom_label(data = tickets_agg,
              mapping = aes(x     = field,
                            y     = count,
                            label = count)) +
   geom_label(data = filter(tickets_agg,
                            highlight = T),
              mapping = aes(x     = field,
                            y     = count,
                            label = ticket_note)) +
   geom_curve(data = arrow_position,
              mapping = aes(x    = x_start,
                            y    = y_start,
                            xend = x_end,
                            yend = y_end))

In Level 7 we mentioned that when there are multiple geom_ functions in our code that sometimes we need to give the different geom_ functions different data and mappings. We take full advantage of this here.

Level 10: Tidy It Up

The last level is to reduce the cognitive load on the end user by removing everything that isn’t necessary. This is actually a relatively straightforward exercise, but requires you to put yourself in the shoes of the end user. You must answer the question “what can be removed while still making the graph clear?” Here are all of the things that we remove here:

We don’t need all of the grid lines. Having labels for all of the bars makes the grid lines redundant.
Similarly, we don’t need the values of our x-axis. These are also redundant because of the labels.
We don’t need either of our axis labels. The title makes it very clear what each axis represents. They end up just being noise.

Putting all of this together, we end up with our final, Level 10 graph.

Putting it All Together

Overall, a lot of simple steps (and a couple of more complicated ones) can take your graph from basic to beautiful. Here’s what that transformation looks like, one step at a time: