Data Analysis in R- Series(VII); Performing Filtering Joins using dplyr
Introduction
Welcome to part seven of the “Data Analysis in R” series. In this article, we will be discussing filtering joins in R. If you haven’t, you should read part six of this series first. In part six, I introduce joins; what they are, what they are used for, and all the various kinds of joins, as well as some of the basic things you need to know before getting into it fully. If you have read part six, let’s get started with part seven.
Recall that in R, various types of joins can be performed using functions from the dplyr
package.
- Semi Join: It returns only the rows from the first data frame that have a match in the second dataset.
- Anti Join: It returns only the rows from the first data frame that do not have a match in the second dataset.
The appropriate join type depends on the specific requirements of your data integration and analysis tasks. Additionally, we will discuss bind_rows() and bind_cols(), which are other ways to merge data in R.
Semi Join
A semi join returns only the rows from the first (left) dataset that have a match in the second (right) dataset. It is used to filter rows from the first dataset that have matching values in the specified columns in the second dataset. It is helpful when you want to find common elements between datasets without duplicating rows.
“A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, whereas a semi join will never duplicate rows of x.”- statseducation.com
Anti Joins
An anti join returns only the rows from the first (left) dataset that do not have a match in the second (right) dataset. It is used to filter out rows from the first dataset that do not have corresponding matching values in the specified columns of the second dataset. It helps in finding elements that are unique to the first dataset compared to the second dataset.
Dataset
# Load the required library
library(dplyr)
# Create the customers table
customers <- data.frame(
CustomerID = c(1, 2, 3, 4, 5),
CustomerName = c("Alice", "Bob", "Charlie", "David", "Eva"))
# Create the orders table
orders <- data.frame(
OrderID = c(101, 102, 103, 104, 105, 106),
CustomerID = c(3, 2, 4, 1, 3, 6))
Now we have two data frames, one named customers
and one named orders
, with CustomerID
as the common field.
Inner Join vs Semi Join vs Anti Join- An example
Inner_Join
# Inner Join
inner_join_result <- inner_join(customers, orders, by = "CustomerID")
print(inner_join_result)
#sample output
CustomerID CustomerName OrderID
1 1 Alice 104
2 2 Bob 102
3 3 Charlie 101
4 3 Charlie 105
5 4 David 103
The inner join returns rows where there is a match between the “CustomerID” in the “customers” table and the “CustomerID” in the “orders” table. It keeps only the matched rows from both tables.
Semi_Join
# Semi Join
semi_join_result <- semi_join(customers, orders, by = "CustomerID")
print(semi_join_result)
#sample output
CustomerID CustomerName
1 1 Alice
2 2 Bob
3 3 Charlie
4 4 David
The semi join returns only the rows from the “customers” table where there is a match between the “CustomerID” in the “customers” table and the “CustomerID” in the “orders” table. It filters out the customers who do not have any orders. Notice how only one instance of the record with customer id 3 is returned in the output, this is because the semi_join() function only returns unique records.
Anti_Join
# Anti Join
anti_join_result <- anti_join(customers, orders, by = "CustomerID")
print(anti_join_result)
#sample output
CustomerID CustomerName
1 5 Eva
The anti join returns only the rows from the “customers” table that do not have matching values in the “CustomerID” column with the “CustomerID” column of the “orders” table. It filters out the customers who have placed orders and retains only the customers who have not placed any orders.
bind_rows()
Additionally, we will discuss the bind_rows()
verb which is another way to merge data in R.
The bind_rows()
function in dplyr
is used to combine multiple data frames vertically, stacking them one on top of another, instead of joining them side by side. It combines rows from different data frames to create a single larger data frame. This function is useful when you have multiple datasets with the same columns and want to stack them together.
Let us create a dataset once again to demonstrate how this function works. This dataset contains data about students taking the same subject but in different classes/sections.
# Load the required libraries
library(dplyr)
# Create dataset 1
section1 <- data.frame(
ID = c(1, 2, 3, 4),
Name = c("Alice", "Bob", "Charlie", "David"),
Score = c(85, 92, 78, 95))
# Create dataset 2
section2 <- data.frame(
ID = c(2, 3, 5, 6),
Name = c("Eva", "Frank", "Grace", "Hannah"),
Score = c(88, 90, 85, 89))
Now let’s apply bind_rows()
# Combine the datasets using bind_rows()
all_classes <- section1 %>%
bind_rows(section2)
# Print the combined dataset
print(all_classes)
#sample output
ID Name Score
1 1 Alice 85
2 2 Bob 92
3 3 Charlie 78
4 4 David 95
5 2 Eva 88
6 3 Frank 90
7 5 Grace 85
8 6 Hannah 89
In this example, bind_rows()
has combined section1
and section2
vertically to create all_classes
. The resulting dataset contains all the rows from section1
followed by all the rows from section2
.
This could be a bit confusing, though. How do we know what section a student belongs to? You can mutate each data frame individually to add a class column that lets us know what section the student belongs to when joined. Let’s demonstrate this below:
# Add the "class" column to section1 data frame
section1 <- section1 %>%
mutate(Class = "Section1")
#print
section1
#sample output
ID Name Score Class
1 1 Alice 85 Section1
2 2 Bob 92 Section1
3 3 Charlie 78 Section1
4 4 David 95 Section1
# Add the "class" column to section2 data frame
section2 <- section2 %>%
mutate(Class = "Section2")
#print
section2
#sample output
ID Name Score Class
1 2 Eva 88 Section2
2 3 Frank 90 Section2
3 5 Grace 85 Section2
4 6 Hannah 89 Section2
Now, when we bind the rows, we can easily find out what section each student belongs to.
# Combine the datasets using bind_rows()
all_classes <- bind_rows(section1, section2)
# Print the combined dataset
print(all_classes)
#sample_output
ID Name Score Class
1 1 Alice 85 Section1
2 2 Bob 92 Section1
3 3 Charlie 78 Section1
4 4 David 95 Section1
5 2 Eva 88 Section2
6 3 Frank 90 Section2
7 5 Grace 85 Section2
8 6 Hannah 89 Section2
** Note that with bind_rows()
, the column names from both datasets must match for it to work correctly. If the column names do not match, bind_rows()
will return an error. In my examples, I have demonstrated two ways to execute the bind_rows() function. You can also pass more than two data frames separated by commas.
Let us perform a small analysis using this combined dataset to show how it could be utilized.
all_classes %>%
group_by(Class) %>%
summarize(average_score = mean(Score))
#sample output
Class average_score
<chr> <dbl>
1 Section1 87.5
2 Section2 88
bind_cols()
bind_cols
is a function that allows you to bind columns of different data frames together. It is used to combine data frames horizontally, such that the columns from one data frame are added next to the columns from another data frame. This function is useful when you have data with the same number of rows and want to merge them side by side. It also does not require the data frames to have any columns in common.
Let’s demonstrate bind_cols
with a simple example:
# Load the required library
library(dplyr)
# Create two data frames
students <- data.frame(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
scores <- data.frame(Score = c(85, 92, 78))
# Combine the data frames using bind_cols
class_info <- bind_cols(students, scores)
# Print the combined data frame
print(class_info)
#sample output
ID Name Score
1 1 Alice 85
2 2 Bob 92
3 3 Charlie 78
In this example, we created two data frames students
and scores
. students
contains columns "ID" and "Name," and scores
contains a column "Score." We used bind_cols
to merge these two data frames horizontally based on the fact the they both contain the same number of rows, resulting in the class_info
.
That’s it for part seven of this series!, if you have made it this far, I applaud you, well done really. Remember, it is okay not to understand new concepts immediately you encounter them, so do not feel discouraged if that happens to be the case. Feel free to come back to this article as many times as you need and practice as many times as it takes for you to understand what you’re learning, you will get there eventually.
I hope you enjoyed this article and found it useful, and I also hope you’ll tune in for part eight where we will discuss Data Visualization using ggplot. I strongly encourage you to find datasets to practice these dplyr concepts.
Kindly share this article with those who need it if you found it helpful, and peruse my Medium profile to find the other parts of this series. Happy learning!