How to Merge Multiple Data Sets in R With Binds and Joins

Tyler Harris
Oct 14, 2020 · 12 min read

Let me share the glory of the full_join() function with you using the R language!

Image for post
Image for post
Photo by Martin W. Kirst on Unsplash

Have you ever been working with multiple tables of data in R and had trouble merging them into a single table? Did you go to Google to tell you to “just use rbind” or see four different ways of joining other people’s data that doesn’t make sense? Well, I have too and so have many of my students.

Here is a concise little guide to how to use rbind, several kinds of joins, and how to resolve common issues when using these functions!

For your convenience, I have made an R Markdown file (.rmd) available on my GitHub at this link if you would rather take a look through the notated code and run it cell by cell or clone it to your machine.

To get started, make a new R Markdown file in your RStudio, use a regular .R file, or use a cloud version of R. We’re not going to be too fancy here, so you should not have any issues. I am assuming that you can get an R environment working either in RStudio or in the cloud. Let me know if you are having problems in the responses at the end, and I will do what I can to help you troubleshoot the problem!

Let’s get started!

Objectives

1. Learn about binds and joins
2. Combine all data into a single table
3. Resolve all issues to eliminate error messages for the final table (NA’s are okay as long as the structure is right)

Load Libraries and Create the Data

The only library we need is tidyverse. It has many useful features and libraries, but this is just a good habit to have when it comes to being proficient with R. Tidyverse includes the dplyr package, which is where our joins are located. If you want to be a minimalist, just use the dplyr package rather than tidyverse, and you will do just fine.

Let’s get started with R code. Note if you are new: The “#” character is a comment meaning that anything following it will not be run by the computer. I used comments to explain much of the code simply with plain English.

# Tidyverse allows us to quickly and easily load many useful packages and functions with a single line of code# You can install packages and load the package by uncommenting the next two lines
# install.packages("tidyverse")
# library(tidyverse)
# I like require better because it does the same thing in one line of code rather than two.
# Load tidyverse
require(tidyverse)

Next, let’s deal with creating the data. I am going to make three small but subtly different data frames starting with inputting the data into vectors by hand. You’ll see what I mean. Here’s the code:

Create the first data frame and check it comes out right

# Create dataset 1
price <- c(1, 4, 6, 7, 8)
item <- c("candy", "socks", "tea", "coffee", "pens")
retailer <- c("Meijer", "Wal-Mart", "Publix", "Tim Hortons", "Staples")
# Combine vectors into data frame
df1 <- data.frame(price, item, retailer)
# Print top rows
head(df1)

If you’re using an R Markdown file, the output should look like this. If you’re outputting to the console, the same results should appear there.

Image for post
Image for post
Code to Create df1 Object with R

Create the second data frame and check it comes out right

# Create dataset 2
Retailer <- c("Best Buy", "Amazon", "Newegg", "eBay")
Item <- c("GPS", "SD Card Reader", "RTX 3090", "overpriced RTX 2080ti")
Price <- c(200, 20, 1499, 1100)
# Combine vectors into data frame
df2 <- data.frame(Retailer, Item, Price)
# Print top rows
head(df2)

Just like df1, the output here should look like:

Image for post
Image for post
Code to Create df2 Object with R

Create the third data frame and check it comes out right

# Create dataset 3
items <- c("keyboard", "mouse", "Raspberry Pi 4", "books", "dry erase board", "anti-virus software")
Manufacturer <- c("Logitech", "Logitech", "Raspberry Pi Foundation", "Penguin Publishing", "Office Max", "Kaspersky Labs")
price <- c(30, 40, 55, 120, 20, 50)
cost_of_production <- c(10, 12, 40, 30, 4, 8)
# Combine vectors into data frame
df3 <- data.frame(cost_of_production, price, items, Manufacturer)
# Print top rows
head(df3)

Just like df1 and df2, output should look like:

Image for post
Image for post
Code to Create df3 with R

So, there are intentional differences here. Before moving on, ask yourself what looks different between the three data frames. Keep track of these differences while we explore the next section where we get a bunch of common errors. Don’t worry, the answers will be given over time, but it will help you get better with R if you take a few moments now to really examine the code to this point. Good job on making it this far! Let’s keep going :)

Common Errors and Issues with Binds and Joins

Beware! There will be quite a few errors coming up. Don’t worry, I messed these things up a lot too when I was trying to get the hang of this when learning R!

rbind()

So, if you use Google to find how to combine multiple data sets in R, the rbind() function is bound to come up. For your convenience, here’s a link to the documentation.

So let’s just go for it, right? Try this code next:

# First, let's experiment with cbind and rbind. Often when the data requires merging different datasets together, this is the first thing many people try.# rbind Documentation https://www.rdocumentation.org/packages/SparkR/versions/2.4.6/topics/rbind# Try rbind first with just the first two data sets
rbind_test <- rbind(df1, df2)
# Print top rows
head(rbind_test)
# This gave us an error! Why? Let's try cbind next to see if that helps us avoid error messages

The output should look like this:

Image for post
Image for post
Error Message for the rbind() Function with R

cbind()

So, what gives? Let’s try cbind() and see if we can just mash them together. Here is the link to the documentation for cbind().

Here’s the code:

# cbind Documentation: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/cbind# cbind test with just the first two data sets
cbind_test <- cbind(df1, df2)
# Print top rows
head(cbind_test)
# Whoa! We got a different error that time? What gives? Why is R so particular about things?

Here’s the output:

Image for post
Image for post
cbind() Function Error with R

So, rbind() said the names do not match and cbind() says we have different numbers of rows. What can we do? Let’s try using some kind of join function!

Joins

One of the qualities of R is that is somewhat of a mashup of other languages and concepts from other languages. There is a whole art and science of joining tables and pieces of tables together from the world of databases and SQL. We will not be getting into that here because I like my sanity, but that that’s the origin story.

Here is the documentation link to the different types of joins with the dplyr package. We will not be using all of them, but we’ll get through the more popular ones.

Here are some common issues back to back:

# Well, rbind and cbind do not work, right? They work, so we must be doing something wrong here. Let's see if we can use joins to get our data to cooperate# Documentation link for joins: https://dplyr.tidyverse.org/reference/join.html# Let's try different types of joins# Left Join
left_join_test <- left_join(df1, df2, by = "Price")
head(left_join_test)
# Come on! Another error? Why? Let's try different tables

Output:

Image for post
Image for post
left_join() Error Message with R

Another way to do left_join():

# Maybe data frames 1 and 3 will go together
left_join_test_2 <- left_join(df1, df3, by = "price")
head(left_join_test_2)
# Something happened without an error! But it doesn't look quite right :( Why are there two columns of NA's present?

Output:

Image for post
Image for post
left_join Error Message with R

Okay, so the left_join has a parameter called “by” where we set that equal to “price”. That gave us a result. Not the one we wanted, but a result nonetheless. Our problem surrounding combining these data sets are because of both the column names not being exactly the same for joins and not being the same length for binds.

So let’s try to get all the column names to be the same. There are a lot of ways to do this, but this one is the simplest for what most people need.

Rename Some Columns

We can do this a lot of ways, but this is the simplest illustration I could think to use. At the end of the day, we need columns that mean the same thing to have the exact same name.

Here’s the code:

# I am going to change df1's names to be the same as df2. It's definitely possible to go the other way, but I think this looks better.
# If you are using RStudio, you can click on the object name in the Environment tab or do "view(object)" to see the data
# Change df1's names to match df2
names(df1)[1] <- "Price" # Column 1 is price, but we want it to be Price like df2
names(df1)[2] <- "Item" # same as Price
names(df1)[3] <- "Retailer" # same deal again
# Check the names worked
head(df1)
# There we go! Now df1 and df2 have the same names but in a different order.

Here’s the output:

Image for post
Image for post

Let’s revisit rbind()

Now that df1 and df2 have the same column names, let’s revisit our old friend rbind()!

Here’s the code:

# Let's try rbind again to try to merge df1 and df2. They have the same column names, so this should be great, right?rbind_test_2 <- rbind(df1, df2)# Check result
rbind_test_2
# Look at that! We got our Price, Item, and Retailer columns combined!
# rbind will "bind" the rows from the second data set to the first. df2 "binds" to df1.
# Now let's see what happens with adding df3 into the equation

Here’s the output:

Image for post
Image for post
rbind() Function Works When Column Names Are Exactly the Same with R

Incorporate df3

With a successful binding of df1 and df2, R automatically knows to bind the right columns together. Our next task is to get df3 to cooperate. Let’s take a look at some common errors after we rename the “Price” and “Item” columns to be in line with our previous schema.

Here’s the code:

# We know we need to change the column names, but df3 has different parameters. What's going to happen when we try to bind them together?# Let's change the names we know we need to change
names(df3)[2] <- "Price"
names(df3)[3] <- "Item"
# use rbind
rbind_test_3 <- rbind(rbind_test_2, df3)
# check data
rbind_test_3
# Oh no! number of columns do not match. How can we resolve this?
# We could try taking out columns, but the data would not match up. We would only have two out of four columns be the same anyway. That doesn't really get us that far. We'll probably need some kind of join to accomplish this

Here’s the output:

Image for post
Image for post
rbind() Function Error with R

The rbind() function does not want to cooperate because it is trying to put three columns with five, which it does not understand. We need a different tactic.

Try Inner Join Again

Inner join kind of worked last time. Maybe we can try that again? Let’s see!

Here’s the code:

# Inner Join
IJtest <- inner_join(rbind_test_2, df3)
IJtest # 0 observations of 5 variables? Why?
# What if we swap which is the right and which is the left table?
IJtest2 <- inner_join(df3, rbind_test_2)
IJtest2 # Still 0 observations of 5 rows
# When we inner join, we are looking for where ALL the rows in the first table have matching values in the second table. Since we do not have all matches, we are not actually joining anything. The only step that is completed is R figuing out we're trying to get 5 columns.

Here’s the output:

Image for post
Image for post
inner_join() Function Kind of Working with R

When we inner join, we are looking for where ALL the rows in the first table have matching values in the second table. Since we do not have all matches, we are not actually joining anything. The only step that is completed is R figuring out we’re trying to get 5 columns.

Dang. Let’s try something else.

Try Left Join

Left join should help us out and get the right number of columns. It merges by columns that overlap, so maybe this will get our result.

Here’s the code:

# Left Join
LJtest <- left_join(rbind_test_2, df3)
LJtest
# Alright, well we got some of the data. However, the only thing we accomplished is adding the two extra columns to the picture. Left joining is still just looking for matches that aren't there

Here’s the output:

Image for post
Image for post
left_join() Function Kind of Works with R

We only have 9 rows, when we should have 15. We’re getting closer!

Try Right Join

The right_join() function does a similar thing to left join, but it has a different order of operations and how the tables get combined. Let’s take a look.

Here’s the code:

# Right Join
RJtest <- right_join(rbind_test_2, df3)
RJtest
# Right join is interesting because we get the five columns, but only the six rows of df3. This is because we return the rows from the second table (right table) and all rows from the first table (left table) that have matches. The only thing that matches these conditions is just the data in df3. Kind of weird, but that's how it works.

Here’s the output:

Image for post
Image for post
right_join() Function Kind of Working with R

Not quite what we wanted, but we can’t be that far away from getting it!

Try Full Join

Using full_join() should combine all of our rows and columns the right way, at least in theory. Let’s test it out!

Here’s the code:

# Full Join
FJtest <- full_join(rbind_test_2, df3)
FJtest
# Success! We got the right table. We have 5 columns with 15 rows with all the data. If you want a slightly more appealing looking table, we can switch the tables around in the full_join() function.
# According to the documentation: full_join will "return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing." That is exactly what we did here. The next cell shows what we could have done from the beginning to simplify our efforts.

Here’s the output:

Image for post
Image for post
Successful full_join() Function Part 1
Image for post
Image for post
Successful full_join() Function Part 2

We did it! We got all three tables to cooperate with a full_join() function! Since I did this in an R Markdown file, it breaks up the tables like this, so I left it like this and took another screenshot. Depending how you do it, it may be a single table, broken up like mine, or printed to the console.

Another Way We Could Have Done This

Just like most problems in R, there is more than one way to solve the problem. Here is an alternative solution or two that can help you get a better understanding of the magic of the full_join() function!

Here’s the code:

# We could have done all the name changes at one time. I copy/pasted from above for consistency and simplicity.
# Change df1's names to match df2
names(df1)[1] <- "Price" # Column 1 is price, but we want it to be Price like df2
names(df1)[2] <- "Item" # same as Price
names(df1)[3] <- "Retailer" # same deal again
# Change df3's names to match df2
names(df3)[2] <- "Price"
names(df3)[3] <- "Item"
# rbind df1 and df2
FJintermediate <- full_join(df1, df2)
# RBfinal <- rbind(df1, df2) # gives the same result
# Full Join both sets together
FJfinal <- full_join(df3, FJintermediate)
# FJfinal2 <- full_join(df3, RBfinal) # gives the same result
# Check final data
FJfinal
# FJfinal2 # gives the same result
# Note that there is a way to do the reassignment of column names within full_join(), but it is a little dense and harder to explicitly see what is happening with the code. The documentation shows how to do this.

Here’s the output:

Image for post
Image for post
Final Success Image with full_join() in R

Note: The last five rows are on the other table like we just saw above. This is long enough so I didn’t include it for brevity.

Conclusion

Whew! That was a bit of a process, but I wanted to thoroughly show with simple examples how using the full_join() function in R can be a real life saver. Will it solve everything? No. However, I hope that you can take what you learned here today to apply to your problems in the future.

Happy coding!

The Startup

Medium's largest active publication, followed by +752K people. Follow to join our community.

Tyler Harris

Written by

I write about technology and business. Working on a PhD in IT. Have a MS in IT & a BS in Economics with CompTIA Security+, Network+, and A+ certifications.

The Startup

Medium's largest active publication, followed by +752K people. Follow to join our community.

Tyler Harris

Written by

I write about technology and business. Working on a PhD in IT. Have a MS in IT & a BS in Economics with CompTIA Security+, Network+, and A+ certifications.

The Startup

Medium's largest active publication, followed by +752K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store