R to Python: A Guide to Recreating Dplyr’s Convenient Joins in Python

Rob W
Artificially Intelligent
5 min readMay 1, 2023
Photo by Alexander Grey

Introduction

If you are one of the many R users who is making the shift to python, you may find yourself depending on the convenience of some of R’s most beloved libraries. At the surface, the jump from the convenience and simplicity of R can seem a bit daunting as the python landscape, while ample, can often produce what feels like too many translations for a given piece of functionality. Finding straight forward translations is not always easy.

The purpose of this is to share an informed perspective about which methods to lean on as you shift your workflow over to python.

For this article, we’ll be exploring alternatives to dplyr’s convenient methods for performing all sorts of joins.

Stay tuned for other articles that address some of these key questions.

What you’ll learn

For this article, we’ll be breaking down primary translations for R’s join functions within the dplyr library.

By reading this article, you can expect to learn primary python translations for the following functions:

  • left_join
  • right_join
  • inner_join
  • outer_join
  • anti_join
  • semi_join

Joins Overview

Join functions are used to combine two or more data frames based on a common column.

Below is a quick review of the varying definitions

  • left_join: returns all rows from the left table, and any matching rows from the right table
  • right_join: returns all rows from the right table, and any matching rows from the left table
  • inner_join: returns only the records that appear in both tables
  • outer_join: This function returns all rows from both tables, and fills in any missing values with NA
  • anti_join: This function returns all rows from the left table that do not have a match in the right table.
  • semi_join: This function returns all rows from the left table that have a match in the right table.

Comparison Time

Before we jump in, we’ll be using two sample dataframes. I’ll include the code to generate these two sample dataframes so you can follow along:

# libraries
import pandas as pd
# sample df1
df1 = pd.DataFrame({
'join_column': ['A', 'B', 'C', 'D'],
'col1': [10, 20, 30, 40],
'col2': [100, 200, 300, 400]
})
# sample df2
df2 = pd.DataFrame({
'join_column': ['B', 'D', 'E', 'F'],
'col3': ['X', 'Y', 'Z', 'W'],
'col4': ['P', 'Q', 'R', 'S']
})

With that out of the way, let’s dive in

Left Join

R

In r you get to enjoy the simplicity and flow of dplyr, you specify your dataframe of interest, df1 in this case, and join to it df2 on the common field

df3 <- df1 %>%
left_join(df2, by = "join_column")

Python

Luckily in python things aren’t all that different. The main difference is syntactical, merge versus left_join, on rather than by and so forth.

One other difference is that we’re able to use .merge as a method on the left dataframe. One of the interesting and more convenient aspects of merge is the how parameter. This allows us to determine the type of join we’re interested in. Not that it’s all that tough to remember left_join, right_join, inner_join, but hey… sometimes you’ve got to enjoy the simple things.

df3 = df1.merge(df2, on = "join_column", how = "left")

One other note is that you can also call merge as a function, in which case the command would look like this:

df3 = pd.merge(df1, df2, on = "join_column", how = "left")

Not super different, but useful to be aware that both function and method are from pandas and both are available.

The result:

Right Join

Near self explanatory, the only thing changing here is that we’re performing a right join, also the same thing as swapping which dataframe we call first.

We’ll breeze through the examples here.

R

Only thing changing here is the function

df3 <- df1 %>%
right_join(df2, by = "join_column")

Python

df3 = df1.merge(df2, on = "join_column", how = "right")

The result:

Inner Join

Similar to right join, the next few functions are super simple, so we’ll breeze through.

R

df3 <- df1 %>%
inner_join(df2, by = "join_column")

Python

df3 = df1.merge(df2, on = "join_column", how = "inner")

The result:

Outer Join

R

df3 <- df1 %>%
outer_join(df2, by = "join_column")

Python

df3 = df1.merge(df2, on = "join_column", how = "outer")

Anti Join

Here’s where Python gets a tad bit more involved, but in its defense, anti joins are not the most common.

R

df3 <- df1 %>%
anti_join(df2, by = "join_column")

Python

df3 = df1.merge(df2["join_column"], on = "join_column", how = "left", indicator=True)
.query('_merge == "left_only"')
.drop('_merge', 1)

There are a few different pieces to explain here.

When we populate the indicator parameter to True, a new column called “_merge” is added, which in the case of a left join says “left_only” or “both”. As you might guess, “left_only” means that there wasn’t a match in the right dataframe and in the case of “both”, there was a match.

We indicate True so that we can then use .query to specify that we only want records that were not found in the right dataframe.

And finally, as a bit of cleanup, we drop the “_merge” column.

And the last piece I’ll mention, we also have to subset the right dataframe to the join column only. Otherwise we will have added all of the null columns from the right dataframe as well. Which would be pointless and against the traditional use of the anti_join.

Like I said, a lot more involved, but still simple enough conceptually.

The results:

Semi Join

As a reminder, semi joins are the the opposite of the anti_join. We will only return records from the left dataframe in which there are matches with the right dataframe.

Effectively nothing changes in R.

R

df3 <- df1 %>%
anti_join(df2, by = "join_column")

Python

Similar to what we saw above with anti joins; however, in this case we query where _merge is equal to “both” which requires that there was a match.

pd.merge(df1, df2['join_column'], on = "column_name", how = "left", indicator=True)\
.query('_merge == "both"')\
.drop('_merge', 1)

Conclusion:

We’ve covered a lot of ground quickly. When it comes to join pandas steps in similar fashion to dplyr in R. We’ve explored simple and effective python translations for key join functionality in R, specifically, dplyr.

Specifically, we’ve explored the following types of joins:

  • left_join
  • right_join
  • inner_join
  • outer_join
  • anti_join
  • semi_join

We hope that this article proves helpful. Let us know what alternative methods you make use of in Python rather than pandas’ merge functionality.

--

--

Rob W
Artificially Intelligent

I’m a Data Science Leader sharing lessons learned & tips of the trade! Twitter: @data_lessons