10 Levels of ggplot2: From Basic to Beautiful

Ryan Harrington
Dec 13, 2019 · 9 min read

Recently I discovered WIRED’s “Levels” series on YouTube. The concept is simple. An expert of some interesting skill (i.e., ice sculpture, origami, or knife making) explains the concept in many levels — from easy to complex.

Level 1 origami cicada vs. Level 11 origami cicada from 11 Levels of Origami: Easy to Complex

This format is a wonderful way to explore many different skills.

One skill that is essential for anyone who works with data to learn is how to build a graph that tells a story. The ggplot2 package is one of the best tools to do that. In this article I’ll explore how to build a graph with ggplot2 from basic to beautiful.

Data To Explore

Also, as we’re exploring the data, you can follow along with all of the code for it here:

Level 0: A Basic Plot

  • Data (which we see in the tickets object)
  • Mapping (which we see in the aes(x = violation_desc) parameter)
  • Geometry (which we see from geom_bar)

For more details about how this works, check out this article. Our code block ends up looking like:

tickets %>% 
ggplot(aes(x = violation_desc)) +
geom_bar()

Here’s what we get from this: hot garbage. There’s good news, though. We can only get better from here!

Level 0: Hot Garbage.

Level 1: Data Cleaning

The Level 0 graph has 95 distinct values on the x-axis. This is far too many for a person to be able to interpret. Trimming the number of distinct values to 10 will dramatically improve the interpretability of the graph. We can start doing this by inspecting the distinct values. Here’s a sample of the values with their counts:

 1 METER EXPIRED CC     281060
2 METER EXPIRED 181329
3 OVER TIME LIMIT 156859
4 EXPIRED INSPECTION 138575
5 STOP PROHIBITED CC 115898
6 STOPPING PROHIBITED 47395
7 PARKING PROHBITED 47232
8 PARKING PROHBITED CC 45082
9 OVER TIME LIMIT CC 24585
10 PASSENGR LOADNG ZONE 24359

Immediately we start to see how to improve the number of distinct values that we have. First, for our purposes, we can remove any of the CC suffixes. We notice that there are some words that are misspelled like PROHBITED. There are also words that are similar to each other like STOP and STOPPING. Using the stringr package’s str_replace function can help us make those changes.

Once we make all of these changes, we’re still left with 81 distinct values. A great method to fix this is to use the fct_lump function from the forcats package. The function is simple enough — it allows you to decide how many distinct values you’d like to keep based upon some other value. Everything else is then designated as “Other”. Check those steps out here.

After implementing these changes we’re almost able to read all of the words on the x-axis. There are a bunch of ways to take care of that. We’ll get to it at Level 4.

Level 1: Less Hot Garbage

Level 2: Use Default Themes

Level 2: Merely Garbage

Level 3: scale_y_continuous + scales::comma

Level 3: Approaching Okay

Level 4: fct_reorder + coord_flip()

The coord_flip function can then be included as part of our ggplot2 code. This function also does what it says: it flips the axes so that our x-axis becomes our y-axis and vice versa. Our code ends up looking like this. After implementing it we end up with our first passable graph.

This would be a great graph to use internally or as part of a draft. I would want to make more changes if it were being shared externally.

Level 4: Not Too Shabby!

Level 5: Window Dressing

  1. Change the titles of the axes
  2. Add a title for the graph
  3. Add some color

We can take care of the first two issues by adding the labs function at the end of our ggplot2 code. Any of the titles on the plot (title, subtitle, x-axis, y-axis, caption, legends, etc.) can be altered in this way.

To add color, we need to go to our geom_col function. We can pass an additional parameter to it called fill. We’re able to designate the fill color in with this parameter in several ways — for example, as a name (“red”) or a hex code (“#E83536”). Now we have a graph that is very presentable. We’re well past “basic” and on our way to “beautiful”!

Level 5: The Midway Point

Level 6: Build Your Own Theme

Level 6: Feeling Cute, Might Graph Later

At this point, we have something that looks like it could be our own theme. If we’re making multiple plots, it would be worthwhile to officially build our own theme. Here’s what that could look like:

theme_compassred <- function () { 
theme_minimal(base_size = 10, base_family = "Roboto") %+replace%
theme(axis.title = element_text(face = "bold"),
axis.text = element_text(face = "italic"),
plot.title = element_text(face = "bold",
size = 12)

)
}

Even further, we can set this theme for every plot automatically:

theme_set(theme_compassred())

Level 7: Layer Multiple Geometries

When we have multiple geom_ functions as part of our code, it is possible for the geometries to either use the same aesthetic mapping or to have different aesthetic mappings. In this case, geom_label can use the same mappings for its x and y coordinates, but requires an additional mapping called label. There are also several parameters that we must pass to geom_label in order to properly place the labels on the graph. Here’s the code and the graph:

Level 7: Onions Have Layers

Level 8: Highlight a Key Field

One great way to accomplish this is by highlighting a key field. We can do so with four small changes:

  • Transform the data by creating a new field called highlight
  • Move the fill to inside the aesthetic mapping and set it to highlight
  • Change colors with scale_fill_manual
  • Remove the legend with guides

After those small code changes, we end up with:

Level 8: Ombre

Level 9: Annotate

At a high level, here is how our code is reorganized:

# Build an aggregated data frame of all of the tickets
tickets_agg <-
mutate(...)
# Create the note that we'd like to share
ticket_note <-
paste("Your text here")
# Create a dataframe to position your arrow
arrow_position <-
data.frame(...)
# Build your graph
ggplot() +
geom_col(data = tickets_agg,
mapping = aes(x = field,
y = count,
fill = highlight)) +
geom_label(data = tickets_agg,
mapping = aes(x = field,
y = count,
label = count)) +
geom_label(data = filter(tickets_agg,
highlight = T),
mapping = aes(x = field,
y = count,
label = ticket_note)) +
geom_curve(data = arrow_position,
mapping = aes(x = x_start,
y = y_start,
xend = x_end,
yend = y_end))

In Level 7 we mentioned that when there are multiple geom_ functions in our code that sometimes we need to give the different geom_ functions different data and mappings. We take full advantage of this here.

Level 10: Tidy It Up

  • We don’t need all of the grid lines. Having labels for all of the bars makes the grid lines redundant.
  • Similarly, we don’t need the values of our x-axis. These are also redundant because of the labels.
  • We don’t need either of our axis labels. The title makes it very clear what each axis represents. They end up just being noise.

Putting all of this together, we end up with our final, Level 10 graph.

Level 10: 🔥🔥🔥

Putting it All Together

And here’s how our final block of code ends up looking:

CompassRed Data Blog

We live for data and analytics.

Thanks to Eugene Olkhov

Ryan Harrington

Written by

CompassRed Data Blog

We live for data and analytics.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade