How to Merge Multiple Data Sets in R With Binds and Joins
Let me share the glory of the full_join() function with you using the R language!
Have you ever been working with multiple tables of data in R and had trouble merging them into a single table? Did you go to Google to tell you to “just use rbind” or see four different ways of joining other people’s data that doesn’t make sense? Well, I have too and so have many of my students.
Here is a concise little guide to how to use rbind, several kinds of joins, and how to resolve common issues when using these functions!
For your convenience, I have made an R Markdown file (.rmd) available on my GitHub at this link if you would rather take a look through the notated code and run it cell by cell or clone it to your machine.
To get started, make a new R Markdown file in your RStudio, use a regular .R file, or use a cloud version of R. We’re not going to be too fancy here, so you should not have any issues. I am assuming that you can get an R environment working either in RStudio or in the cloud. Let me know if you are having problems in the responses at the end, and I will do what I can to help you troubleshoot the problem!
Let’s get started!
Objectives
1. Learn about binds and joins
2. Combine all data into a single table
3. Resolve all issues to eliminate error messages for the final table (NA’s are okay as long as the structure is right)
Load Libraries and Create the Data
The only library we need is tidyverse. It has many useful features and libraries, but this is just a good habit to have when it comes to being proficient with R. Tidyverse includes the dplyr package, which is where our joins are located. If you want to be a minimalist, just use the dplyr package rather than tidyverse, and you will do just fine.
Let’s get started with R code. Note if you are new: The “#” character is a comment meaning that anything following it will not be run by the computer. I used comments to explain much of the code simply with plain English.
# Tidyverse allows us to quickly and easily load many useful packages and functions with a single line of code# You can install packages and load the package by uncommenting the next two lines
# install.packages("tidyverse")
# library(tidyverse)# I like require better because it does the same thing in one line of code rather than two.
# Load tidyverse
require(tidyverse)
Next, let’s deal with creating the data. I am going to make three small but subtly different data frames starting with inputting the data into vectors by hand. You’ll see what I mean. Here’s the code:
Create the first data frame and check it comes out right
# Create dataset 1
price <- c(1, 4, 6, 7, 8)
item <- c("candy", "socks", "tea", "coffee", "pens")
retailer <- c("Meijer", "Wal-Mart", "Publix", "Tim Hortons", "Staples")# Combine vectors into data frame
df1 <- data.frame(price, item, retailer)# Print top rows
head(df1)
If you’re using an R Markdown file, the output should look like this. If you’re outputting to the console, the same results should appear there.

Create the second data frame and check it comes out right
# Create dataset 2
Retailer <- c("Best Buy", "Amazon", "Newegg", "eBay")
Item <- c("GPS", "SD Card Reader", "RTX 3090", "overpriced RTX 2080ti")
Price <- c(200, 20, 1499, 1100)# Combine vectors into data frame
df2 <- data.frame(Retailer, Item, Price)# Print top rows
head(df2)
Just like df1, the output here should look like:

Create the third data frame and check it comes out right
# Create dataset 3
items <- c("keyboard", "mouse", "Raspberry Pi 4", "books", "dry erase board", "anti-virus software")
Manufacturer <- c("Logitech", "Logitech", "Raspberry Pi Foundation", "Penguin Publishing", "Office Max", "Kaspersky Labs")
price <- c(30, 40, 55, 120, 20, 50)
cost_of_production <- c(10, 12, 40, 30, 4, 8)# Combine vectors into data frame
df3 <- data.frame(cost_of_production, price, items, Manufacturer)# Print top rows
head(df3)
Just like df1 and df2, output should look like:

So, there are intentional differences here. Before moving on, ask yourself what looks different between the three data frames. Keep track of these differences while we explore the next section where we get a bunch of common errors. Don’t worry, the answers will be given over time, but it will help you get better with R if you take a few moments now to really examine the code to this point. Good job on making it this far! Let’s keep going :)
Common Errors and Issues with Binds and Joins
Beware! There will be quite a few errors coming up. Don’t worry, I messed these things up a lot too when I was trying to get the hang of this when learning R!
rbind()
So, if you use Google to find how to combine multiple data sets in R, the rbind() function is bound to come up. For your convenience, here’s a link to the documentation.
So let’s just go for it, right? Try this code next:
# First, let's experiment with cbind and rbind. Often when the data requires merging different datasets together, this is the first thing many people try.# rbind Documentation https://www.rdocumentation.org/packages/SparkR/versions/2.4.6/topics/rbind# Try rbind first with just the first two data sets
rbind_test <- rbind(df1, df2)# Print top rows
head(rbind_test)# This gave us an error! Why? Let's try cbind next to see if that helps us avoid error messages
The output should look like this:

cbind()
So, what gives? Let’s try cbind() and see if we can just mash them together. Here is the link to the documentation for cbind().
Here’s the code:
# cbind Documentation: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/cbind# cbind test with just the first two data sets
cbind_test <- cbind(df1, df2)# Print top rows
head(cbind_test)# Whoa! We got a different error that time? What gives? Why is R so particular about things?
Here’s the output:

So, rbind() said the names do not match and cbind() says we have different numbers of rows. What can we do? Let’s try using some kind of join function!
Joins
One of the qualities of R is that is somewhat of a mashup of other languages and concepts from other languages. There is a whole art and science of joining tables and pieces of tables together from the world of databases and SQL. We will not be getting into that here because I like my sanity, but that that’s the origin story.
Here is the documentation link to the different types of joins with the dplyr package. We will not be using all of them, but we’ll get through the more popular ones.
Here are some common issues back to back:
# Well, rbind and cbind do not work, right? They work, so we must be doing something wrong here. Let's see if we can use joins to get our data to cooperate# Documentation link for joins: https://dplyr.tidyverse.org/reference/join.html# Let's try different types of joins# Left Join
left_join_test <- left_join(df1, df2, by = "Price")
head(left_join_test)
# Come on! Another error? Why? Let's try different tables
Output:

Another way to do left_join():
# Maybe data frames 1 and 3 will go together
left_join_test_2 <- left_join(df1, df3, by = "price")
head(left_join_test_2)# Something happened without an error! But it doesn't look quite right :( Why are there two columns of NA's present?
Output:

Okay, so the left_join has a parameter called “by” where we set that equal to “price”. That gave us a result. Not the one we wanted, but a result nonetheless. Our problem surrounding combining these data sets are because of both the column names not being exactly the same for joins and not being the same length for binds.
So let’s try to get all the column names to be the same. There are a lot of ways to do this, but this one is the simplest for what most people need.
Rename Some Columns
We can do this a lot of ways, but this is the simplest illustration I could think to use. At the end of the day, we need columns that mean the same thing to have the exact same name.
Here’s the code:
# I am going to change df1's names to be the same as df2. It's definitely possible to go the other way, but I think this looks better.
# If you are using RStudio, you can click on the object name in the Environment tab or do "view(object)" to see the data# Change df1's names to match df2
names(df1)[1] <- "Price" # Column 1 is price, but we want it to be Price like df2
names(df1)[2] <- "Item" # same as Price
names(df1)[3] <- "Retailer" # same deal again# Check the names worked
head(df1)# There we go! Now df1 and df2 have the same names but in a different order.
Here’s the output:

Let’s revisit rbind()
Now that df1 and df2 have the same column names, let’s revisit our old friend rbind()!
Here’s the code:
# Let's try rbind again to try to merge df1 and df2. They have the same column names, so this should be great, right?rbind_test_2 <- rbind(df1, df2)# Check result
rbind_test_2# Look at that! We got our Price, Item, and Retailer columns combined!
# rbind will "bind" the rows from the second data set to the first. df2 "binds" to df1.
# Now let's see what happens with adding df3 into the equation
Here’s the output:

Incorporate df3
With a successful binding of df1 and df2, R automatically knows to bind the right columns together. Our next task is to get df3 to cooperate. Let’s take a look at some common errors after we rename the “Price” and “Item” columns to be in line with our previous schema.
Here’s the code:
# We know we need to change the column names, but df3 has different parameters. What's going to happen when we try to bind them together?# Let's change the names we know we need to change
names(df3)[2] <- "Price"
names(df3)[3] <- "Item"# use rbind
rbind_test_3 <- rbind(rbind_test_2, df3)# check data
rbind_test_3# Oh no! number of columns do not match. How can we resolve this?
# We could try taking out columns, but the data would not match up. We would only have two out of four columns be the same anyway. That doesn't really get us that far. We'll probably need some kind of join to accomplish this
Here’s the output:

The rbind() function does not want to cooperate because it is trying to put three columns with five, which it does not understand. We need a different tactic.
Try Inner Join Again
Inner join kind of worked last time. Maybe we can try that again? Let’s see!
Here’s the code:
# Inner Join
IJtest <- inner_join(rbind_test_2, df3)
IJtest # 0 observations of 5 variables? Why?# What if we swap which is the right and which is the left table?
IJtest2 <- inner_join(df3, rbind_test_2)
IJtest2 # Still 0 observations of 5 rows# When we inner join, we are looking for where ALL the rows in the first table have matching values in the second table. Since we do not have all matches, we are not actually joining anything. The only step that is completed is R figuing out we're trying to get 5 columns.
Here’s the output:

When we inner join, we are looking for where ALL the rows in the first table have matching values in the second table. Since we do not have all matches, we are not actually joining anything. The only step that is completed is R figuring out we’re trying to get 5 columns.
Dang. Let’s try something else.
Try Left Join
Left join should help us out and get the right number of columns. It merges by columns that overlap, so maybe this will get our result.
Here’s the code:
# Left Join
LJtest <- left_join(rbind_test_2, df3)
LJtest# Alright, well we got some of the data. However, the only thing we accomplished is adding the two extra columns to the picture. Left joining is still just looking for matches that aren't there
Here’s the output:

We only have 9 rows, when we should have 15. We’re getting closer!
Try Right Join
The right_join() function does a similar thing to left join, but it has a different order of operations and how the tables get combined. Let’s take a look.
Here’s the code:
# Right Join
RJtest <- right_join(rbind_test_2, df3)
RJtest# Right join is interesting because we get the five columns, but only the six rows of df3. This is because we return the rows from the second table (right table) and all rows from the first table (left table) that have matches. The only thing that matches these conditions is just the data in df3. Kind of weird, but that's how it works.
Here’s the output:

Not quite what we wanted, but we can’t be that far away from getting it!
Try Full Join
Using full_join() should combine all of our rows and columns the right way, at least in theory. Let’s test it out!
Here’s the code:
# Full Join
FJtest <- full_join(rbind_test_2, df3)
FJtest# Success! We got the right table. We have 5 columns with 15 rows with all the data. If you want a slightly more appealing looking table, we can switch the tables around in the full_join() function.
# According to the documentation: full_join will "return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing." That is exactly what we did here. The next cell shows what we could have done from the beginning to simplify our efforts.
Here’s the output:


We did it! We got all three tables to cooperate with a full_join() function! Since I did this in an R Markdown file, it breaks up the tables like this, so I left it like this and took another screenshot. Depending how you do it, it may be a single table, broken up like mine, or printed to the console.
Another Way We Could Have Done This
Just like most problems in R, there is more than one way to solve the problem. Here is an alternative solution or two that can help you get a better understanding of the magic of the full_join() function!
Here’s the code:
# We could have done all the name changes at one time. I copy/pasted from above for consistency and simplicity.
# Change df1's names to match df2
names(df1)[1] <- "Price" # Column 1 is price, but we want it to be Price like df2
names(df1)[2] <- "Item" # same as Price
names(df1)[3] <- "Retailer" # same deal again# Change df3's names to match df2
names(df3)[2] <- "Price"
names(df3)[3] <- "Item"# rbind df1 and df2
FJintermediate <- full_join(df1, df2)
# RBfinal <- rbind(df1, df2) # gives the same result# Full Join both sets together
FJfinal <- full_join(df3, FJintermediate)
# FJfinal2 <- full_join(df3, RBfinal) # gives the same result# Check final data
FJfinal
# FJfinal2 # gives the same result# Note that there is a way to do the reassignment of column names within full_join(), but it is a little dense and harder to explicitly see what is happening with the code. The documentation shows how to do this.
Here’s the output:

Note: The last five rows are on the other table like we just saw above. This is long enough so I didn’t include it for brevity.
Conclusion
Whew! That was a bit of a process, but I wanted to thoroughly show with simple examples how using the full_join() function in R can be a real life saver. Will it solve everything? No. However, I hope that you can take what you learned here today to apply to your problems in the future.
Happy coding!