Joins in Pandas
Introduction to Joins in Pandas
Pandas is a popular Python library for data analysis that provides powerful tools for manipulating and analyzing data. One of the most common operations when working with data is joining two or more data sets together.
In Pandas, joins can be performed using the merge
function. The merge
function combines two DataFrames based on the values of one or more columns. There are four types of joins in Pandas: inner join, left join, right join, and outer join.
Inner Join
An inner join returns only the rows that have matching values in both Data Frames. Here’s some sample code that demonstrates an inner join:
import pandas as pd
# create the first DataFrame
df1 = pd.DataFrame({
'employee_id': ['1', '2', '3', '4'],
'name': ['John', 'Mary', 'David', 'Sarah']
})
# create the second DataFrame
df2 = pd.DataFrame({
'employee_id': ['2', '3', '4', '5'],
'department': ['HR', 'Finance', 'IT', 'Marketing']
})
# perform an inner join on the 'employee_id' column
result = pd.merge(df1, df2, on='employee_id', how='inner')
print(result)
Output:
employee_id name department
0 2 Mary HR
1 3 David Finance
2 4 Sarah IT
In this example, the merge
function performs an inner join on the employee_id
column. The resulting DataFrame contains only the rows that have matching values in both DataFrames.
Left Join
A left join returns all the rows from the left DataFrame and the matching rows from the right DataFrame. If there are no matching rows in the right DataFrame, the result will contain NaN
values. Here's some sample code that demonstrates a left join:
import pandas as pd
# create the first DataFrame
df1 = pd.DataFrame({
'employee_id': ['1', '2', '3', '4'],
'name': ['John', 'Mary', 'David', 'Sarah']
})
# create the second DataFrame
df2 = pd.DataFrame({
'employee_id': ['2', '3', '4', '5'],
'department': ['HR', 'Finance', 'IT', 'Marketing']
})
# perform a left join on the 'employee_id' column
result = pd.merge(df1, df2, on='employee_id', how='left')
print(result)
Output:
employee_id name department
0 1 John NaN
1 2 Mary HR
2 3 David Finance
3 4 Sarah IT
In this example, the merge
function performs a left join on the employee_id
column. The resulting DataFrame contains all the rows from the left DataFrame and the matching rows from the right DataFrame. The row with employee_id
equal to '1'
doesn't have a matching row in the right DataFrame, so the department
value is NaN
.
Right Join
A right join returns all the rows from the right DataFrame and the matching rows from the left DataFrame. If there are no matching rows in the left DataFrame, the result will contain NaN
values. Here's some sample code that demonstrates a right join:
import pandas as pd
# create the first DataFrame
df1 = pd.DataFrame({
'employee_id': ['1', '2', '3', '4'],
'name': ['John', 'Mary', 'David', 'Sarah']
})
# create the second DataFrame
df2 = pd.DataFrame({
'employee_id': ['2', '3', '4', '5'],
'department': ['HR', 'Finance', 'IT', 'Marketing']
})
# perform a right join on the 'employee_id' column
result = pd.merge(df1, df2, on='employee_id', how='right')
print(result)
Output:
employee_id name department
0 2 Mary HR
1 3 David Finance
2 4 Sarah IT
3 5 NaN Marketing
In this example, the merge
function performs a right join on the employee_id
column. The resulting DataFrame contains all the rows from the right DataFrame and the matching rows from the left DataFrame. The row with employee_id
equal to '5'
doesn't have a matching row in the left DataFrame, so the name
value is NaN
.
Outer Join
An outer join returns all the rows from both DataFrames. If there are no matching rows, the result will contain NaN
values. Here's some sample code that demonstrates an outer join:
import pandas as pd
# create the first DataFrame
df1 = pd.DataFrame({
'employee_id': ['1', '2', '3', '4'],
'name': ['John', 'Mary', 'David', 'Sarah']
})
# create the second DataFrame
df2 = pd.DataFrame({
'employee_id': ['2', '3', '4', '5'],
'department': ['HR', 'Finance', 'IT', 'Marketing']
})
# perform an outer join on the 'employee_id' column
result = pd.merge(df1, df2, on='employee_id', how='outer')
print(result)
Output:
employee_id name department
0 1 John NaN
1 2 Mary HR
2 3 David Finance
3 4 Sarah IT
4 5 NaN Marketing
In this example, the merge
function performs an outer join on the employee_id
column. The resulting DataFrame contains all the rows from both DataFrames. The row with employee_id
equal to '1'
doesn't have a matching row in the right DataFrame, so the department
value is NaN
. Similarly, the row with employee_id
equal to '5'
doesn't have a matching row in the left DataFrame, so the name
value is NaN
.
Joins are a powerful tool in Pandas that allow you to combine multiple DataFrames based on the values of one or more columns. In this article, I covered the four types of joins in Pandas: inner join, left join, right join, and outer join. I also provided sample code to illustrate how each type of join works.
If you find this blog useful, then you can share a few claps, comment, and follow me for more data science stuff.