The Basics of Merging Dataframes

Joel Sherman
4 min readJul 26, 2020

--

As I continue to learn Python’s data science toolkit (which seems as vast and as ever-expanding as the universe, by the way), I’ve come across some useful information in how to manipulate pandas dataframes, and also some of the basics with generating plots in Matplotlib. I wrote about those experiences here and here because the best way to learn is to try to explain things to others, right?

In this installment of learning by explaining, I will tackle some of the basics of merging dataframes in Pandas. But before I get into it, you may be wondering what I mean by “merging” dataframes and maybe more importantly, why you should care about the topic.

Photo by Robert Anasch on Unsplash

What is “Merging” and Why Is It Important?

When doing data analysis, we often need to work with more than one data table or data source. Kaggle is nice for providing clean, single csv files for your machine learning project, but this often isn’t how we do our work in the real-world. It’s estimated and commonly cited that the bulk (80%) of a data scientist’s time is spent data wrangling and cleaning. Merging comes into place when we need to combine multiple datasets or disparate sources of data into one unified frame for analysis.

Quite literally, we can combine data vertically (stack two or more tables with similar columns to combine rows) or horizontally (stack two or more tables with similar rows to combine columns), and Pandas offers a variety of tools for doing this, including append(), concat(), join() and, you guessed it, merge(). And when I use the term “merge”, I refer to any and all of these methods unless I specifically state “merge method”.

Why Not Do Your Merging In SQL?

I know what you’re thinking, and as a long-time SQL developer, let me just say that I hear you! This concept is so engrained in SQL’s DNA that it’s hard not to want to prepare or engineer your frame in SQL first, before bringing it into a development environment for further modeling and analysis. But here’s the problem: not all data worthy of analysis is structured, nor lives in an RDBMS! More (most?) often, the data we need for machine learning models or data visualizations is ad hoc, highly unstructured and dispersed in a variety of locations. Once again, the power of Pandas and merging!

I’ll stick with my own personal sleep and cycling training data for this exercise (pun intended) and make the files available on my public github account here and here in case anyone wants to replicate my work. After some tidying and subsetting, I have two, one-week spans of sleep data that look like this

As well as one, two-week span of training and recovery data that looks like this

How can we bring these three dataframes together into one dataframe for analysis?

Vertically Merging Data

Let’s start with the sleep data, which are my sleep times (in seconds) for each of three sleep stages (REM, Deep and Light) per day. We obviously need to stack these frames vertically, and we have a couple options.

Option 1: Append()

The append method is like a SQL union, which allows dataframes to be merged vertically, on top of each other. We can do this to our sleep dataframes using append() like this

Sleep_append = sleep1.append(sleep2)

Option 2: Concat()

As an alternative to append, concat can merge dataframes either vertically or horizontally, using the axis argument, specifically axis=1 for horizontal, and axis=0 for the vertical. Here it is in action

Sleep_concat = pd.concat([sleep1, sleep2], axis=0)

With either method, we get a merged sleep dataframe called Sleep with 2 weeks worth of data, like this

Horizontally Merging Data

Now let’s merge the training and recovery data, which contain my daily recovery points (RecPts), as well as the previous days trainings’ real-perceived exertion (RPE) and training stress (TSS). Since these data and the sleep data share a row (i.e. the date), we’ll merge this horizontally. Again, Pandas gives us options.

Option 1: Concat()

As mentioned above, we can specify the axis=1 argument to merge the frames horizontally with concat(), like this

Combined_concat = pd.concat([Sleep, TraRec], axis=1)

Option 2: Merge()

The merge method works similarly, but is perhaps more flexible when working with frames with different labels, composite indexes, etc. In our simple case, it works like this to produce the exact same dataframe as concat() above

Combined_merge = pd.merge(Sleep, TraRec, on='date')

Option 3: Join()

And finally, the method nearest and dearest to my SQL-scripting heart, which is actually a more generalized version of the merge() method above, is the join() method. It works like this

Combined_join = Sleep.join(TraRec, how='inner')

Any one of these options will produce our combined dataframe below

Those are some of the very basics in merging dataframes, both vertically and horizontally.

--

--

Joel Sherman

I’m an experienced data professional at the intersection of public policy and economics, trying to make sense of the world, one dataset at a time.