Update a DataFrame Based on Common Values of Two DataFrames

Published in

The Startup

5 min readSep 9, 2020

Imagine you have a spreadsheet of employee data. You have their name, their age, their contact information, etc. This data changes over time as employees enter and/or leave the company, or simply update their information.

In today’s tutorial, I will show you how to update a pandas DataFrame in this professional context to keep track of the employee status: if they are still present in the new data(frame), then we just update their information; otherwise we assume they have left the company. Brand new employees will simply be added to our dataset. We’ll assume we get a new listing of our employees in regular intervals and all current employees are included in that listing. We then use that listing to update our dataset with the latest data.

For instance, according to the dataset at this point in time

The company only has four employees.

However, we just got a new employee listing and it looks like this

Over time we got two new employees (Kevin and Stanley with ids 5 and 6), but the employee with id 2 is missing. In the context of this demo we’ll interpret this as that employee leaving the company. We’ll update the first DataFrame with the new data, but the ids won’t reset. So at the end, the employee dataset will look like this

So that we know who has been a part of our company over time, we see that Jim (id 2) is still in the dataset, and Kevin and Stanley are now a part of the company. However, Jim, the only “original” employee that was not in the latest listing, now has a “in_company” value of False, that is, he left the company. Notice also that Michael and Pam are one year older and Dwight has changed his email address. For employees that leave the company we’ll keep whatever was the last data we had about them (age and email in this demo).

The code

Okay, now that you have seen the results of the code, let’s dive into the code.

Main script

As you can see in the code gist above, we only need a handful of lines to update the DataFrame. A good chunk of the script is to create the DataFrames since we are not loading any external files to keep the demo simple.

So, the DataFrames are exactly the same you’ve seen in the beginning, but to remind ourselves, this is the original employee dataset

And this is the latest listing we got, that is, the data used to update the dataset

Again, notice how Jim is not in the new listing, that is, we assume he has left the company. On the other hand, new people have joined and others have changed their information.

Moving on to the important code: line 23 of the code gist. This is probably the most important line, as we locate all rows whose employee id is in the “id” column of both the original and the new DataFrames. Since the data we want to change for employees that left the company is in the “in_company” column, we also specify that column to change the values of all matching cells to False.

To make that code clearer, the original["id"].isin(new_data["id"]) part returns a pandas Series of boolean values where True means the employee id is present in both DataFrames and False otherwise.

As expected, only Jim, the second employee of the four original employees, returns False because his id is only present in the first DataFrame. Michael, Pam and Dwight are in both DataFrames, so the id check returns True. The tilde (~) that precedes this check

~original["id"].isin(new_data["id"])

“inverts” the values of the check. In other words, only Jim has a value of True now. This is important because the values of the “in_company” column will only be changed for rows that return True in this check. In other words, we find which employees are in both DataFrames, and select the other ones to update the employment status accordingly.

The rest of the code is pretty simple. On line 27 we append the new DataFrame to the original DataFrame that now has the employment statuses updated. Keep in mind we can safely append the rows of new data because the DataFrames’ columns match, most importantly their column names and data types. However, this operation creates duplicate entries for the employees that were already recorded.

This is the purpose of line 30. It looks at the “id” column to find duplicate ids and only keeps their last occurrence. Since the new DataFrame had the up to date information of all employees, we assume the age and email in that new DataFrame is what we want to keep about them.

The operations on lines 32 and 35 are just some housekeeping. We sort the resulting DataFrame by employee “id” from oldest to most recent employee, and reset the DataFrame index (not the ids!) to have a nice default index.

One last time, this is our employee data, updated with the most recent information and employment status

Based on the latest listing, Jim is no longer part of the company, but Kevin and Stanley joined at some point. Michael, Pam and Dwight also updated their information.

Conclusion

I had a specific scenario in mind and this was the demo I created as a proof of concept. I think it is an important data transformation in pandas, to change the values of a column in a DataFrame, based on the presence of the values of another column in a different DataFrame.

I’ve struggled to put this transformation into words when searching it online, but I hope this article now offers you a solution if you come across this problem in the future.

As usual, the code is available on my GitHub repository.

Update a DataFrame Based on Common Values of Two DataFrames

The code

Conclusion

Written by José Fernando Costa