Python Pandas-Merging Using the ‘how’ Argument
In reality, Data Science projects often involves gathering information from variety of sources which might require data from multiple tables. Therefore, in order to conduct analysis, there is a need to join the tables. Merging in Python Pandas is a very effective way to successfully carry out this operation.
This tutorial aims to take you through the step by step process of different methods of merging data frames with pandas library in Python using the “how” argument.
First, Import datasets and convert to tables
import pandas as pd
df = pd.read_csv(‘link.csv’)
#convert year column to dateobject
df[‘Year’] = pd.to_datetime(df[‘Year’], format = ‘%Y’).dt.year
#groupby country
df= pd.DataFrame(df1.groupby([‘Country’, ‘Year’],as_index=False).sum())
print(df)
df1 = pd.read_excel (r’path were file is stored.xlsx’)
print(df1)
So, we have two tables: df and df1
df columns= Country, Year and Value
df1 columns= Country Name, Country Code, Year and value
In order to merge both tables, a primary key is needed. Notice that the column that signifies Country has different names for both tables. Therefore, let’s rename table df1 ‘Country Name’ column to ‘Country’ and group by the column.
Now, merging both dataframes
Notice that we merged left, this implies that the table above contain only rows that match with table df only. That is, any extra countries contained in table df1 that is not in table df is not included in the above table df2
Now let’s use right, inner and outer merge
The inner merge remove rows that do not match in both dataframes. This is the default pandas merge in Python if you do not specify the kind of merge you want.
Kindly let me know of any comments, suggestions or questions you might have :)