Maximizing the Power of Scatter Plots in R with ggplot2 — part 6 of “R for Applied Economics” guide

Dima Diachkov
11 min readFeb 4, 2023

--

Today we are going to dive into scatter plots in various forms. This form of visualization is one of the most important, especially when it concerns the comparison of a set of objects with common features and dissimilarities Also, this type of chart is one of the best assistants for multidimensional data exploration and clustering. So it seems that if you want to succeed in economic analysis in R, you have to master scatter plots…

I think that this task is appointed for you, Frodo; and that if you do not find a way, no one will — Elrond

Elrond by MidJourney AI

The Power of Scatter Plots in ggplot2

Scatter plots are a crucial component of exploratory data analysis, and ggplot2, a well-known data visualization library in the R programming language, provides a comprehensive platform for creating visually appealing and insightful scatter plots. In this post, we will delve into the world of scatter plots, exploring why they are so universal and effective, as well as how to use ggplot2 to create stunning scatter plots for macroeconomic variables.

You may already know, that ggplot2 is a popular data visualization library in R language that provides an elegant and flexible way to create complex plots. We covered basic functions earlier (part 2 of the guide). It is built on the Grammar of Graphics (part 3 of the guide), a systematic approach to plotting data that allows users to layer multiple elements and make customizations to the plot.

Why Scatter Plots?

Scatter plots are one of the most versatile and effective types of plots for visualizing the relationship between two variables. They are especially useful when dealing with continuous data, as they can easily reveal patterns, trends, and outliers. Moreover, scatter plots are a simple but powerful tool for exploratory data analysis, helping us to quickly identify relationships between variables.

Scatter Plots in Economics

When it comes to macroeconomic variables, scatter plots can be used to visualize relationships between, for example, inflation and economic growth, interest rates and exchange rates, or GDP and unemployment. Furthermore, scatter plots provide a quick and straightforward way to identify relationships between variables, making them an essential tool in exploratory data analysis. They are literally everywhere.

These are just some random examples, that I found on Google, with full source codes, that are ready to use for your own projects.

Source: http://rafalab.dfci.harvard.edu/dsbook/ggplot2.html
Source: http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html

They may be static or dynamic, and everything is customizable, you will enjoy using them as soon as you master the basics.

With ggplot2 (and some supporting packages may be used) we can easily create scatter plots with different markers, colors, and labels, and add regression lines, smooth curves, and other features to better illustrate the relationship between the variables.

Quick tips for creating effective scatter plots in ggplot2

  • Choose appropriate scales for both x and yaxes, so that the patterns in the data are clearly visible
  • Use meaningful labels and titles, and consider adding a caption to provide additional context
  • Use color and shape to represent additional variables or groupings
  • Avoid overloading the plot with too many elements, and consider using facets or small multiples to compare multiple scatter plots side-by-side
  • To enhance the visualizations of scatter plots in ggplot2, we can use packages such as ggthemes, ggrepel, and scales. These packages provide additional themes, labels, and geoms for ggplot2, allowing us to create more compelling and informative plots.

How to make your first ggplot2 scatter plot

Plain and simple. Here’s an example of creating a basic ggplot2 scatter plot with a test dataset for illustrative purposes -mtcars which is available to us as part of the ggplot2 package.

# chunk 1
library(ggplot2)

# Load data
data(mtcars)

# Create scatter plot with 2 dimensions
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
xlab(“Weight (1000 lbs)”) +
ylab(“Miles per Gallon”)
Output for the code above

Simple, but already shows a story. But let’s make the story more visible, more explicit.

# chunk 2

# Create a ggplot object using the mtcars data set
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl), size = hp, shape = factor(am))) +

# Add the geom_point layer to the plot, which will create the scatter plot
geom_point() +
# Label the x-axis as "Weight (1000 lbs)"
xlab("Weight (1000 lbs)") +
# Label the y-axis as "Miles per Gallon"
ylab("Miles per Gallon") +
# Add a discrete color scale, labeling the scale as "Number of Cylinders"
scale_color_discrete(name = "Number of Cylinders") +
# Add a continuous size scale, labeling the scale as "Horsepower"
scale_size_continuous(name = "Horsepower") +
# Add a discrete shape scale, labeling the scale as "Transmission"
scale_shape_discrete(name = "Transmission") +
# Apply the classic ggplot2 theme to the plot
theme_classic()
Output for the code above

Just to clarify: we’ve added color to represent the number of cylinders in the cars, size to represent the horsepower of the cars, and shape to represent the type of transmission (automatic or manual). We’ve also added labels for the color, size, and shape scales, and applied the theme_classic style for a clean and easy-to-read visual.

It is neat and descriptive. What patterns do you see here? Right, several patterns at the same time, which we can manipulate (hide or emphasize), depending on the scope of our analysis. But now let’s move to real world data, that we collected earlier (inflation and unemployment by country).

Hands-on session: plotting of multidimensional realtime macroeconomic data with ggplot2 scatter plots

Data prep

We already know to write user-defined functions (recap — part 4) and we made some pretty impressive stuff for web parsing earlier. I just modified it a little bit more since it seems that so far we will stick to the website “Trading Economics”. Basically, all that we are doing in the guide series is interconnected, hence I will continue to announce upcoming objectives and reference previously prepared scripts and concepts.

Firstly, we have to parse data (recap — part 1) and join it (recap how to do it — part 5).

# chunk 3
parse_tradecon_table <- function(link = "", table_number = 1, indicator_name = "value")
{
# we call the package inside of the function so it is called automatically every time you use the function
library(rvest)
library(dplyr)

# here we check the provided link for being non-empty string
if(link == "")
{stop("No URL provided")}

# then we try to parse the URL, but if it fails - we print error message and stop function
try(parsed_data <- read_html(link), stop("Something went wrong...Please, check the link you provided."))
try(parsed_table <- html_table(parsed_data), stop("Something went wrong...Seems like there are no tables available."))
try(df <- as.data.frame(parsed_table[[table_number]]), stop(paste0("Something went wrong...Seems like the link does not have table number ",table_number, " or any tables at all")))

output_df <- df %>%
select(Country, Last) %>%
rename(!!indicator_name := Last, country = Country)

return(output_df)
}

# Extraction stage
infl_df <- parse_tradecon_table("https://tradingeconomics.com/country-list/inflation-rate?continent=europe", indicator_name = "inflation")
unemp_df <- parse_tradecon_table("https://tradingeconomics.com/country-list/unemployment-rate?continent=europe", indicator_name = "unemployment")
gdp_df <- parse_tradecon_table("https://tradingeconomics.com/country-list/gdp-annual-growth-rate?continent=europe", indicator_name = "gdp_growth")

# Transformation stage - in our case this is a merge by country
merged_df <- infl_df %>%
full_join(unemp_df, by = c("country")) %>%
full_join(gdp_df, by = c("country"))

european_union <- c("Austria","Belgium","Bulgaria","Croatia","Cyprus",
"Czech Republic","Denmark","Estonia","Finland","France",
"Germany","Greece","Hungary","Ireland","Italy","Latvia",
"Lithuania","Luxembourg","Malta","Netherlands","Poland",
"Portugal","Romania","Slovakia","Slovenia","Spain",
"Sweden","United Kingdom")

merged_df$eu_country <- factor(ifelse(merged_df$country %in% european_union, "EU-countries", "Other countries"))

I would like to clarify what is going on here and highlight some differences from the script we made earlier. This new R code defines a function called parse_tradecon_table that takes three arguments: link, table_number, and indicator_name. The function scrapes the data from the provided link, parses the data, and returns a data frame with selected columns. As far as we are focused on TradEcon tables, this function is specifically designed to work only for this website’s indicators with regional breakdowns in order to save us some time during the transformation stage up next.

Here is a step-by-step explanation of the code:

  1. Load required libraries: The rvest and dplyr libraries are loaded for scraping the data and manipulating the data frame, respectively.
  2. Check the provided link: If the provided link is an empty string, the function stops with an error message “No URL provided”.
  3. Parse the URL: The read_html function is used to parse the URL, and an error message is printed if something goes wrong.
  4. Extract tables: The html_table function is used to extract tables from the parsed data, and an error message is printed if no tables are found.
  5. Extract relevant data: The data frame is extracted from the selected table using the as.data.frame function, and an error message is printed if the selected table number does not exist.
  6. Rename columns: The columns are renamed using the rename function and the dplyr pipe operator %>%. The !! operator is used to unquote indicator_name so it can be used as a variable. Probably you have never met this operator before because not many people know about it actually (oh, I have to make a special issue of this guide for hidden or unpopular functionalities of R).
  7. Finally, we return the data frame clean and simplified (during one of our next classes we will look into loops and how to make your code iteratively go through a bunch of economic websites and collect/merge data into one dataframe).

After defining the function, the code uses the function to scrape data from three different links, each for a different macroeconomic variable: inflation rate, unemployment rate, and GDP annual growth rate. The resulting data frames are then joined together by the country column. Last step: attachment of attribute, that defines parts of EU.

Now let’s plot what we have with a few colors.

# chunk 4
# Load necessary packages
library(ggrepel)

# Plot data with ggplot2
ggplot(merged_df, aes(x = inflation, y = gdp_growth, color = eu_country, size = (unemployment^2))) +
geom_point() +
geom_text_repel(aes(label = substr(country, 1, 3)), size = 3) +
scale_color_manual(name = "EU Status", values = c("EU-countries" = "red", "Other countries" = "black")) +
scale_size_continuous(name = "Unemployment", range = c(2, 7), guide = FALSE) +
xlab("Inflation rate") +
ylab("GDP growth rate") +
ggtitle("Scatter plot of EU and non-EU countries by Inflation vs GDP Growth in Europe (size by Unemployment)") +
theme(legend.title = element_text(size = 12, face = "bold"),
legend.text = element_text(size = 10),
plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
axis.title = element_text(size = 12, face = "bold"))
Output for the code above

What do we see in our story? We have a relatively dense cluster of European counties, growing with a rate of more than 0 with relatively low inflation. However, some eastern European countries form a subcluster with higher inflation. Countries outside the EU are way more dispersed because of many reasons. For example, we see that a non-standard approach to the monetary system in Turkey, the consequences of war for Ukraine, weird slowdown of Lichtenstein's economy, made these countries deviate from the average performance of peers. Well, we have a lot of things here, however, we will explore them later.

Now, I wanted to show you one more thing — a facet, which is extremely useful. They will help you a lot to explore huge, multidimensional datasets which can be grouped by some attribute.

The facet_wrap function in ggplot2 is used to display multiple small plots, known as facets, in a single figure. Each plot is created for a unique combination of one or more categorical variables. In other words, it allows you to display the same plot for different subsets of the data, based on the values of one or more categorical variables. The facet_wrap function takes the facets argument, which is a formula that specifies the categorical variable(s) to use for splitting the data into facets. The resulting facets are displayed in a grid, with each row or column in the grid representing a different value of the categorical variable.

By the way, you can control the layout of the facets using the nrow and ncol arguments in the facet_wrap function.

Okay, let’s Roll.

# chunk 5 - facet demo
# Plot data with ggplot2
ggplot(merged_df, aes(x = inflation, y = gdp_growth, color = unemployment)) +
geom_point() +
geom_text_repel(aes(label = substr(country, 1, 3)), size = 3) +
scale_color_continuous(name = "Unemployment rate, %", low = "blue", high = "red") +
scale_size_continuous(name = "Unemployment rate, %", range = c(2, 7), guide = FALSE) +
xlab("Inflation rate") +
ylab("GDP growth rate") +
ggtitle("Scatter plot of EU and non-EU countries by Inflation, GDP Growth, and Unemployment in Europe") +
theme(legend.title = element_text(size = 12, face = "bold"),
legend.text = element_text(size = 10),
plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
axis.title = element_text(size = 12, face = "bold")) +
facet_wrap(~eu_country, ncol = 2)
Output for the code above

This produced scatter is very informative and easy to read. One of our main allies here — scale_color_continuous function, which is used to create a color scale based on the unemployment rate. The xlab and ylab functions are used to label the x-axis and y-axis, respectively.

By the way, the plot is given a title using the ggtitle function and the overall appearance of the plot is formatted using the theme function. The legend is given a title and the text size is set using the legend.title and legend.text arguments, respectively. The axis titles are given a bold font and larger font size using the axis.title argument.

The overall result is a scatter plot that shows the relationship between the inflation rate, GDP growth rate, and unemployment rate for countries, with the dots colored according to the unemployment rate and the plot was split into two facets based on EU/non-EU status. Probably, pencils down at this point. But do not worry, a lot of functions are still waiting to be found.

As usual, the FULL code is available at the designated Github repo for your convenience: https://raw.githubusercontent.com/TheLordOfTheR/R_for_Applied_Economics/main/Part6.R

Conclusion

In conclusion, scatter plots are an essential tool for macroeconomic analysis and ggplot2 provides a flexible and user-friendly platform for creating scatter plots. By combining ggplot2 with packages such as ggrepelwe can take our scatter plots to the next level and more effectively communicate the relationships between macroeconomic variables. The part of the guide was relatively basic, just to show you the standard functionality plus a couple of tricks. But, let’s be in touch, more guides are coming!

Please clap 👏 and subscribe if you want to support me. Thanks!❤️‍🔥

--

--

Dima Diachkov

Balancing passion with reason. In pursuit of better decision making in economic analysis and finance with data science via R+Python