Data Analysis in R- Series(VII); Performing Filtering Joins using dplyr

7 min readJul 29, 2023

Introduction

Welcome to part seven of the “Data Analysis in R” series. In this article, we will be discussing filtering joins in R. If you haven’t, you should read part six of this series first. In part six, I introduce joins; what they are, what they are used for, and all the various kinds of joins, as well as some of the basic things you need to know before getting into it fully. If you have read part six, let’s get started with part seven.

Recall that in R, various types of joins can be performed using functions from the dplyr package.

Semi Join: It returns only the rows from the first data frame that have a match in the second dataset.
Anti Join: It returns only the rows from the first data frame that do not have a match in the second dataset.

The appropriate join type depends on the specific requirements of your data integration and analysis tasks. Additionally, we will discuss bind_rows() and bind_cols(), which are other ways to merge data in R.

Semi Join

A semi join returns only the rows from the first (left) dataset that have a match in the second (right) dataset. It is used to filter rows from the first dataset that have matching values in the specified columns in the second dataset. It is helpful when you want to find common elements between datasets without duplicating rows.

“A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, whereas a semi join will never duplicate rows of x.”- statseducation.com

Anti Joins

An anti join returns only the rows from the first (left) dataset that do not have a match in the second (right) dataset. It is used to filter out rows from the first dataset that do not have corresponding matching values in the specified columns of the second dataset. It helps in finding elements that are unique to the first dataset compared to the second dataset.

Dataset

# Load the required library
library(dplyr)

# Create the customers table
customers <- data.frame(
  CustomerID = c(1, 2, 3, 4, 5),
  CustomerName = c("Alice", "Bob", "Charlie", "David", "Eva"))

# Create the orders table
orders <- data.frame(
  OrderID = c(101, 102, 103, 104, 105, 106),
  CustomerID = c(3, 2, 4, 1, 3, 6))

Now we have two data frames, one named customers and one named orders, with CustomerID as the common field.

Inner Join vs Semi Join vs Anti Join- An example

Inner_Join

# Inner Join
inner_join_result <- inner_join(customers, orders, by = "CustomerID")
print(inner_join_result)


#sample output
   CustomerID CustomerName OrderID
1          1        Alice     104
2          2          Bob     102
3          3      Charlie     101
4          3      Charlie     105
5          4        David     103

The inner join returns rows where there is a match between the “CustomerID” in the “customers” table and the “CustomerID” in the “orders” table. It keeps only the matched rows from both tables.

Semi_Join

# Semi Join
semi_join_result <- semi_join(customers, orders, by = "CustomerID")
print(semi_join_result)


#sample output
   CustomerID   CustomerName
1          1        Alice
2          2          Bob
3          3      Charlie
4          4        David

The semi join returns only the rows from the “customers” table where there is a match between the “CustomerID” in the “customers” table and the “CustomerID” in the “orders” table. It filters out the customers who do not have any orders. Notice how only one instance of the record with customer id 3 is returned in the output, this is because the semi_join() function only returns unique records.

Anti_Join

# Anti Join
anti_join_result <- anti_join(customers, orders, by = "CustomerID")
print(anti_join_result)


#sample output
  CustomerID   CustomerName
1       5        Eva

The anti join returns only the rows from the “customers” table that do not have matching values in the “CustomerID” column with the “CustomerID” column of the “orders” table. It filters out the customers who have placed orders and retains only the customers who have not placed any orders.

bind_rows()

Additionally, we will discuss the bind_rows() verb which is another way to merge data in R.

The bind_rows() function in dplyr is used to combine multiple data frames vertically, stacking them one on top of another, instead of joining them side by side. It combines rows from different data frames to create a single larger data frame. This function is useful when you have multiple datasets with the same columns and want to stack them together.

Let us create a dataset once again to demonstrate how this function works. This dataset contains data about students taking the same subject but in different classes/sections.

# Load the required libraries
library(dplyr)

# Create dataset 1
section1 <- data.frame(
  ID = c(1, 2, 3, 4),
  Name = c("Alice", "Bob", "Charlie", "David"),
  Score = c(85, 92, 78, 95))

# Create dataset 2
section2 <- data.frame(
  ID = c(2, 3, 5, 6),
  Name = c("Eva", "Frank", "Grace", "Hannah"),
  Score = c(88, 90, 85, 89))

Now let’s apply bind_rows()

# Combine the datasets using bind_rows()
all_classes <- section1 %>%
bind_rows(section2)

# Print the combined dataset
print(all_classes)

#sample output
  ID    Name Score
1  1   Alice    85
2  2     Bob    92
3  3 Charlie    78
4  4   David    95
5  2     Eva    88
6  3   Frank    90
7  5   Grace    85
8  6  Hannah    89

In this example, bind_rows() has combined section1 and section2 vertically to create all_classes. The resulting dataset contains all the rows from section1 followed by all the rows from section2.

This could be a bit confusing, though. How do we know what section a student belongs to? You can mutate each data frame individually to add a class column that lets us know what section the student belongs to when joined. Let’s demonstrate this below:

# Add the "class" column to section1 data frame
section1 <- section1 %>% 
mutate(Class = "Section1")

#print
section1

#sample output
  ID    Name     Score   Class
1  1    Alice    85      Section1
2  2    Bob      92      Section1
3  3    Charlie  78      Section1
4  4    David    95      Section1



# Add the "class" column to section2 data frame
section2 <- section2 %>% 
mutate(Class = "Section2")

#print
section2

#sample output
  ID    Name   Score   Class
1  2    Eva    88      Section2
2  3    Frank  90      Section2
3  5    Grace  85      Section2
4  6    Hannah 89      Section2

Now, when we bind the rows, we can easily find out what section each student belongs to.

# Combine the datasets using bind_rows()
all_classes <- bind_rows(section1, section2)

# Print the combined dataset
print(all_classes)

#sample_output
  ID    Name   Score  Class
1  1   Alice    85     Section1
2  2     Bob    92     Section1
3  3 Charlie    78     Section1
4  4   David    95     Section1
5  2     Eva    88     Section2
6  3   Frank    90     Section2
7  5   Grace    85     Section2
8  6  Hannah    89     Section2

** Note that with bind_rows(), the column names from both datasets must match for it to work correctly. If the column names do not match, bind_rows() will return an error. In my examples, I have demonstrated two ways to execute the bind_rows() function. You can also pass more than two data frames separated by commas.

Let us perform a small analysis using this combined dataset to show how it could be utilized.

all_classes %>%
group_by(Class) %>%
summarize(average_score = mean(Score))


#sample output
  Class    average_score
  <chr>            <dbl>
1 Section1          87.5
2 Section2          88

bind_cols()

bind_cols is a function that allows you to bind columns of different data frames together. It is used to combine data frames horizontally, such that the columns from one data frame are added next to the columns from another data frame. This function is useful when you have data with the same number of rows and want to merge them side by side. It also does not require the data frames to have any columns in common.

Let’s demonstrate bind_cols with a simple example:

# Load the required library
library(dplyr)

# Create two data frames
students <- data.frame(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
scores <- data.frame(Score = c(85, 92, 78))

# Combine the data frames using bind_cols
class_info <- bind_cols(students, scores)

# Print the combined data frame
print(class_info)


#sample output
   ID  Name   Score
1  1   Alice    85
2  2     Bob    92
3  3 Charlie    78

In this example, we created two data frames students and scores. students contains columns "ID" and "Name," and scores contains a column "Score." We used bind_cols to merge these two data frames horizontally based on the fact the they both contain the same number of rows, resulting in the class_info.

That’s it for part seven of this series!, if you have made it this far, I applaud you, well done really. Remember, it is okay not to understand new concepts immediately you encounter them, so do not feel discouraged if that happens to be the case. Feel free to come back to this article as many times as you need and practice as many times as it takes for you to understand what you’re learning, you will get there eventually.

I hope you enjoyed this article and found it useful, and I also hope you’ll tune in for part eight where we will discuss Data Visualization using ggplot. I strongly encourage you to find datasets to practice these dplyr concepts.

Kindly share this article with those who need it if you found it helpful, and peruse my Medium profile to find the other parts of this series. Happy learning!

Connect with me on LinkedIn and Twitter :)