Brandyn Lilley
5 min readJul 18, 2021

Scatterplots with labels in R

This article will show you how to format a chart to consistently look like this one and add specific labels to your data.

The data

I’ve used a comprehensive dataset from the World Development Indicators dataset published by The World Bank (https://datacatalog.worldbank.org/dataset/world-development-indicators). It’s truly a marvellous dataset with so many indicators you’ll spend most of your time puzzling over the endless plotting possibilities. Your creativity only limits the depth of insight you can gain from this dataset.

Data wrangling

The dataset is in pretty good shape, and we’ll only need to do some minor wrangling to make it a little more malleable for our purposes.

Required packages

Start by importing these packages:

library(tidyverse) # A staple data wrangling
library(janitor)# Great for cleaning your column names, etc.
library(readxl)# To import your dataset from *.xlsx
library(writexl)# If you want to save your intermediate dataframes
library(ggrepel)# Great for labeling data in ggplot
library(bbplot)# My favourite ggplot theme modifier from the BBC

The bbplot package is excellently explained here: https://bbc.github.io/rcookbook/#how_to_create_bbc_style_graphics

In short, it is a fantastic way to create beautiful BBC style graphics that consistently look good. In addition, I found it to be an outstanding shortcut to keep my charts looking standardised and professional for my presentations.

Import, clean and pivot to long-form

In the following code block, I import the raw data from an Excel file into a data frame named “raw_data” using the readxl package.

While the data is still in its wide form, I use the janitor package to clean the column names. This eliminates all the funny characters and spaces that usually drive us nuts when trying to reference anything in our code.

I then pivot my data frame into long-form by moving all the years into their own column named “year” and the associated values into a column I creatively named “value”.

You’ll notice that each of my years was prefixed by an annoying “x”, so I got rid of it by using “substr” to only select characters 2 to 5. So, for example, “x1969” becomes “1969”.

I converted the “year” column to a factor, but that was probably unnecessary. I don’t think anything turns on it for this article.

# Import raw dataraw_data <- readxl::read_xlsx("c:/workspace/repos/world_bank_data/inputs/world_bank_data.xlsx", sheet = "Data")# Clean data headingswide_data <- janitor::clean_names(raw_data)# Create long form dataframelong_data <- wide_data %>% 
pivot_longer(cols = x1960:x2020,
names_to = "year",
values_to = "value") %>%
mutate(year = substr(year, start = 2, stop = 5)) %>%
mutate(year = as.factor(year))

Back to wide form

To make my data frame more manageable, I hone in on the specific subset of data that I want to plot. We always have our “long_data” data frame to create further subsets if we wish to perform further analysis.

In this case, I’m puzzled by the high debt burdens that some countries face, so I focus on interest payments as a percentage of revenue and interest payments as a percentage of expense. I assumed there would be a strong correlation and was curious about which countries bear the heaviest burden. I also narrowed my results down to 2019 since a large portion of the 2020 data is missing.

interest_df <- long_data %>% 
select(!indicator_code) %>%
filter(indicator_name == "Interest payments (% of revenue)" |
indicator_name == "Interest payments (% of expense)") %>%
filter(year == "2019")

Since we want to chart a scatterplot, we’ll need values for both the “x” and “y” arguments for the aes mapping in ggplot. So I find it easiest at this stage to pivot my data back to wide form so I can have each variable in its own column.

interest_wide <- interest_df %>% 
pivot_wider(names_from = indicator_name, values_from = value) %>%
janitor::clean_names()

Get plotting

I use geom_text_repel to plot the labels from the “country_name” column, and it does a pretty great job. However, ggrepel might not display all the values because of too many overlaps, which is unfortunate if there are a few specific data points that we really want to show in the chart.

I deal with this by first letting ggrepel have its best stab at plotting as many labels as it can in the line “geom_text_repel(aes(label = country_name), size = 3)”.

To get my specific labels, I create a new list called “countrylist” containing the names of the specific countries that I want to show. In this case, South Africa and Egypt.

I then duplicate my wide form data frame and create a column named “plotname”. This “plotname” column is initially identical to the “country_name” column. I then use an “ifelse” to relook at the “plotname” column and check if each variable is in the “countrylist”. If it is, then it leaves the country name there, but if it’s not then it deletes the name from “plotname”. In this case, I end up with a “plotname” column that only contains the words “South Africa” and “Egypt”.

bbcstyle()” and “finalise_plot” do all the hard work formatting the chart theme, and there’s a great explanation of how it all works at https://bbc.github.io/rcookbook/#how_to_create_bbc_style_graphics.

# Specific labels I want to plotcountrylist <- c("South Africa", "Egypt")interest_wide2 <- interest_wide %>% 
mutate(plotname = country_name) %>%
mutate(plotname = ifelse(plotname %in% countrylist, plotname, ""))
# Plot correlationinterest_plot <- ggplot(interest_wide2, aes(x = interest_payments_percent_of_expense,
y = interest_payments_percent_of_revenue)
)
interest_plot2 <- interest_plot +
geom_point() +
geom_jitter()
interest_plot3 <- interest_plot2 +
geom_text_repel(aes(label = country_name), size = 3) +
geom_text_repel(aes(label = plotname), size = 3)
interest_plot4 <- interest_plot3 +
bbc_style() +
theme(axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12)) +
labs(title = "Crushing debt burden",
subtitle = "Data for 2019",
x = "Interest payments\n(% of expenses)",
y = "Interest payments\n(% of revenue)")
finalise_plot(plot_name = interest_plot4,
source = "Source: World Bank",
save_filepath = "c:/workspace/repos/world_bank_data/outputs/figures/interest_burden.png",
width_pixels = 640,
height_pixels = 450
)

Conclusion

Once you have the look and feel of your chart just the way you like it, you’ll find it pretty easy to replicate it with some copy-pasting. The bbplot package has been a real game-changer for me in helping me get good looking graphics out.