How to use dplyr to join datasets in R — part 5 of “R for Applied Economics” guide

Dima Diachkov
6 min readJan 31, 2023

--

In this series of posts, we have been exploring the power of R for applied economics. Today, we will be taking a deeper dive into dplyr package.

As we have previously discussed, R is a powerful tool for data manipulation and visualization. One of the key advantages of R is its ability to handle multidimensional data with ease. This makes it an ideal choice for analyzing and plotting economic data.

But how do we map datasets to each other and merge them? How to do it effectively and transparently in R?

So, today we join to practice joins😉

I would rather share one lifetime with you than face all the Ages of this world alone.” — Arwen

Arwen and Aragorn by MidJourney AI

What is dplyr?

dplyr is a powerful and widely used package in R for data manipulation. It provides a set of easy-to-use and efficient tools for working with dataframes, including functions for filtering, grouping, and summarizing data. Some of the key features of dplyr include its ability to handle large datasets, its consistent and intuitive syntax, and its flexibility in working with different data types.

Most popular functions of dplyr

Here are some of the most popular functions of dplyr:

  1. filter(): filters rows of a dataframe based on a specified condition.
  2. arrange(): sorts rows of a dataframe based on one or more columns.
  3. select(): selects specific columns of a dataframe.
  4. group_by(): groups data into subsets based on one or more columns.
  5. summarize(): computes summary statistics for grouped data.
  6. … and joins. There are many of them, but I will carefully guide you through this dark forest.

Using dplyr to join dataframes

In R, data can often be stored in multiple dataframes, and it’s common for economic researched or analyst to combine these dataframes into a single dataframe for analysis. This process is known as “joining” dataframes, and dplyr makes it super simple to do so.

Here are examples of how to perform different types of joins in dplyr:

Left Join

df_left_join <- left_join(df1, df2, by = c("column_name1" = "column_name2"))

In a left join, all the rows from the left dataframe (df1) are kept and the matching rows from the right dataframe (df2) are included. If there is no match in the right dataframe, the corresponding row in the result will contain NA values for columns from df2.

Right Join

df_right_join <- right_join(df1, df2, by = c("column_name1" = "column_name2"))

In a right join, all the rows from the right dataframe (df2) are kept and the matching rows from the left dataframe (df1) are included. If there is no match in the left dataframe, the corresponding row in the result will contain NA values for columns from df1.

Full Join

df_full_join <- full_join(df1, df2, by = c("column_name1" = "column_name2"))

In a full join, all rows from both dataframes are included in the result, and if there is no match in either dataframe, the corresponding row in the result will contain NA values for columns from the non-matching dataframe.

Inner Join

library(dplyr)
df_inner_join <- inner_join(df1, df2, by = c("column_name1" = "column_name2"))

In an inner join, only the rows that match in both dataframes are included in the result.

Visually, all these join types can be represented as follows:

Credits: Hadley Wickham | R for Data Science

Please let me know in the comments if you need to know how to do joins for a set of keys (not just one column) OR any other related problem. So far, I hope, you might have noticed that ALL by arguments are the same. Please go back and check it (or just trust me).

So basically, in many situations we just have to pick the right type of join and mapping of keys is usually a straightforward task. In economical research, you must be aware of the keys that you need to use. For example, in our research activity, we often merge data by country key (very aggregated level), sometimes by client ID (with highly detailed information), and sometimes by transaction key (with extremely granular datasets). The type of join will help you to achieve what is needed in your research.

So basically what you need to know is:

  1. Left Join: To keep all the rows from one dataframe and match the corresponding rows from the other.
  2. Right Join: To keep all the rows from the second dataframe and match the corresponding rows from the first.
  3. Full Join: To keep all rows from both dataframes, creating missing values where there are no matches.
  4. Inner Join: To only keep the rows that have matching values in both dataframes.

Practice: joining of datasets on inflation and unemployment by country

Now we will use dplyr to join datasets on inflation and unemployment in the EU, which we parsed from the website during the last class (part 4).
Please examine this piece of code.

# chunk 1
library(dplyr)

# chunk 2
# local functions (here we will create a generic functions)
parse_web_table <- function(link = "", table_number = 1)
{
# we call the package inside of the function so it is called automatically every time you use the function
library(rvest)

# here we check the provided link for being non-empty string
if(link == "")
{stop("No URL provided")}

# then we try to parse the URL, but if it fails - we print error message and stop function
try(parsed_data <- read_html(link), stop("Something went wrong...Please, check the link you provided."))
try(parsed_table <- html_table(parsed_data), stop("Something went wrong...Seems like there are no tables available."))
try(df <- as.data.frame(parsed_table[[table_number]]), stop(paste0("Something went wrong...Seems like the link does not have table number ",table_number, " or any tables at all")))

return(df)
}

# chunk 3
# let's call the function with the desired link
infl_df <- parse_web_table("https://tradingeconomics.com/country-list/inflation-rate?continent=europe") %>%
select(Country, Last) %>%
rename(infl = Last)

# let's call the function with another desired link
unemp_df <- parse_web_table("https://tradingeconomics.com/country-list/unemployment-rate?continent=europe") %>%
select(Country, Last) %>%
rename(unempl = Last)

The code calls our parse_web_table function with two links and processes the resulting data frames by selecting specific columns and renaming columns. The processed data frames are stored in the infl_df and unemp_df objects. We use dplyr right away, did you notice? select() and rename() . These functions quickly filtered columns and then we renamed them.
Let’s have a look at the content (just run glimpse() on both dataframes)

The content of two dataframes

We see keys and values. Good. Let’s merge them.

Our case is standard. We have two identical (= same keys) datasets and we just merge them by the key (Country) with inner_join(). By the way, if key name is identical in both dataframes, you just write it once! Convenient, isn’t it?

# chunk 4
join_df <- infl_df %>%
inner_join(unemp_df, by = "Country")

# always check the result
join_df %>% glimpse()
Output for the code above

Alright, we see that all countries have now two values: inflation and unemployment.

Now let’s finish this task and draw something simple to make sure that it works. I create a variable with a list of EU countries, then I check every country based on this variable and form attribute eu_country in the dataset.

#chunk 5
european_union <- c("Austria","Belgium","Bulgaria","Croatia","Cyprus",
"Czech Republic","Denmark","Estonia","Finland","France",
"Germany","Greece","Hungary","Ireland","Italy","Latvia",
"Lithuania","Luxembourg","Malta","Netherlands","Poland",
"Portugal","Romania","Slovakia","Slovenia","Spain",
"Sweden","United Kingdom")

join_df$eu_country <- factor(ifelse(join_df$Country %in% european_union, "EU-countries", "Other countries"))

plot(join_df$infl, join_df$unempl, col = join_df$eu_country)
legend("topleft", legend = levels(join_df$eu_country), pch = 19, col = factor(levels(join_df$eu_country)))
The output for the code above

Awesome! Country groups are colored and joined dataframe sparkles with new colors (wordplays are my weakness) too. Our merger was executed and hence we will be able to create more sophisticated solutions for visualization, modeling, and other applications.

As usual, the FULL code is available at the designated Github repo for your convenience: https://raw.githubusercontent.com/TheLordOfTheR/R_for_Applied_Economics/main/Part5.R

Conclusion

Overall, the dplyr package is a powerful tool for data manipulation in R. Whether you need to join dataframes, compute summary statistics, or create new columns, dplyr has you covered. Give it a try and see how it can simplify your data analysis process! Today you have mastered the various ..._join() types, but other functions are yet to be learned. We will explore dplyr deeper very soon. Stay tuned!

Please clap 👏 and subscribe if you want to support me. Thanks!❤️‍🔥

--

--

Dima Diachkov

Balancing passion with reason. In pursuit of better decision making in economic analysis and finance with data science via R+Python