Pivoting with Python — A Baller Excel Move Ported to Pandas

Chris Bruehl
Learning Data
7 min readJun 8, 2023

--

Photo by The New York Public Library on Unsplash

Pivot Tables are a staple in data analysis for Excel users, and for good reason. They allow us to aggregate data in a multi-dimensional manner, by “pivoting” the unique values of a column into individual columns, making aggregations “wide” rather than “long” as we see with traditional groupby aggretation. This can yield greater insights than “long”, and in many cases make plotting much easier as well.

It took me a few years of working with Pandas to really make Pivot Tables a staple of my analytical workflow. For years I had shunned anything Excel in favor of SQL & Pandas, but when I rediscovered the pivot_table() method in Pandas, I found a lot of great use cases for it. Sorry Excel friends, I was lost.

Introduction to Pivot Tables

We have a dataset representing bookings at a Portugese hotel. There are a lot of great angles for analysis here. We might ask:

  • Which countries a responsible the most revenue?
  • Do different age groups spend more per night?
  • Which market segments produce the most revenue for each country?
  • Has the revenue ranking for countries shifted over the years?

While we can answer all of these questions without pivot tables, the last two questions were why pivot tables were created. Rather than make a data set that has #market_segments * #countries rows with a traditional groupby, pivoting out one of our categorical variables, such as country, can present the data much more clearly for analysis. Any time we add a second column to aggregate by, it’s worth considering a pivot table.

Let’s take a look:

Summing Total Revenue by Nationality and Market Segment

Which results in the following DataFrame:

Because .pivot_table() is a DataFrame method, it will naturally recognize column names from the DataFrame we’re pivoting.

The primary arguments for the pivot_table method are as follows:

  • index — Specifies which column should provide the values that make up the index, or row labels of the pivoted dataframe
  • columns — Specifies which column should provide the values that make up the column labels of the pivoted dataframe
  • values — Specifies which column we want to aggregate
  • aggfunc — Specifies which column we want to aggregate by the row/column combinations. Default is ‘mean’, common options include ‘count’, ‘sum’, ‘median’, ‘min’ and ‘max’
  • margins — Default is False. Setting to True will perform row-wise and column-wise aggregations for each category, giving us a ‘total’ row and column

Pandas Pivot Tables vs. Excel Pivot Tables

Thankfully, these Pandas and Excel are quite similar.

For those of you coming from Excel, the two tools are quite similar. The biggest difference is that there is no ‘Filter’ argument in the Pandas equivalent, but we can easily filter our DataFrame before calling the pivot_table() method. As seen above, we first use the query method to filter down to our top 10 countries before pivoting.

Notice that we have one NaN value, for Brazil (‘BRA’) Aviation customers in the second row of our DataFrame. We’ll take a look at that later.

Heatmaps + Pivot Tables

Let’s take a quick look at a heatmap on the pivot table above, which I’ve assigned to the variable ‘hotels_pivot’. I’m first importing the seaborn library, then using its heatmap function on my pivot table, specifying a colormap of “RdYlGn”, which is more familiar to many of us than the default.

This heatmap… isn’t great. The only place we see really color pop significantly is in the ‘Other’ column which dominates revenue, and in our ‘All’ columns, which aggregate our rows and columns and thus have much larger values. I’m going to tweak my pivot table code to use ‘mean’ instead of ‘sum’. This will allow us to see which types of customers generate the most revenue on average.

Ahhh, much better.

By switching from ‘sum’ to ‘mean’, we can generate some interesting insights quickly with the heatmap. Not surprisingly, ‘Complementary’, which means free (or perhaps greatly reduced price), has by far the lowest average revenue. There is an argument to made for dropping this column from our analysis altogether.

The ‘Aviation’ segment tends to have the highest average revenue per booking, especially Spanish (ESP) Aviation customers. We also see that ‘Groups’ from the Netherlands (‘NLD’) have a nice average revenue. Finally, looking at the right hand column, the USA, Brazil (BRA), and Portugal (PRT) tend to have lower total revenues on average than the other countries.

The white color for Brazil here indicates a missing value, let’s take a look at an argument that can help fix that.

Missing Values:

If we have cells with NaNs, that indicates that no data for the specific combination occurred. We can use fill_value= to pass a scalar into missing values. In general I tend not to use this argument as it can misrepresent data, but in some cases it can be useful:

Here we’ve used fill_value = 0

Note that using 0 here makes Brazil really stand out within the aviation category. The color of the missing cell was white previously, and personally I prefer that. Using the mean of Aviation wouldn’t raise as many eyebrows, but is also misleading.

Pivot Tables Are Great for Plotting

I’m going to create a new pivot table. This time I’m going to perform a little time series analysis. I’m going to extract month from the ‘BookingDate’ column, then use month as the rows, and a subset of nationalities (“Germany”, “Spain”, and “France”) as columns.

This naturally creates a DataFrame that is well suited for plotting. Because most plotting libraries use the DataFrame index as the x-axis, and columns as y-axis categories, we can easily prepare data for plotting with pivot tables.

Germans prefer to travel in October, when weather starts to get chilly!

Above, we can see that the French and Spanish head to Portugal in droves during August, Europe’s traditional vacation month, while Germans tend to wait until fall to make their trips. Understanding this pattern could help us better market to and cater services to customers on a seasonal basis based on nationality.

The same applies to bar charts as well. Without changing the structure of our DataFrame at all, we can create grouped or stacked bar charts easily. Each column serves as either a group or a stack.

Above, all we’ve done is call the bar method, and passed in stacked=True to generate a stacked bar chart. Everything else is the same. If we omitted stacked=True, a grouped bar chart would be created instead. We could also use a lambda function to create a 100% stacked bar chart to show how each country contributes to total revenue. Let’s take a look below.

The above code calculates the percentage that each cell within a row contributes to the row total, and all rows sum to 100. For example, in month 1, Germany made up 20.7% of revenue, while France made up 58.2%. Let’s plot this!

We can now see that Spain is traditionally the major revenue provider, while Germany’s surge in October makes them the dominant tourist group then.

Multi-Index Pivot Tables

It’s also possible to have multiple levels for both our index (rows) and columns. I don’t love doing this in most cases as creates a pretty complex DataFrame, but it can be useful for doing super granular analyses.

First I’m going to create bins for the “Age” column, to breakdown our demographics a bit more.

This allows us to quickly see that our ‘Aviation’ and ‘Complementary’ customers are overwhelmingly in our older age groups. Passing this into a heatmap could yield some more interesting insights.

We could add a secondary (or third, etc.) level to columns as well by passing multiple columns into a list in the columns argument as well.

Multiple Aggregations

Finally, we can also calculate multiple aggregations for our columns by passing multiple functions into the aggfunc argument.

Here, I’m passing both sum and mean as elements in a list to the aggfunc argument.

This essentially creates a duplicate pivot table, as we can see sum on the left for all categories, before repeating for mean on the right. I clipped the table on the right hand as it’s pretty long.

Summary

Pivot tables are an awesome analytical tool that are typically associated with Excel. But they have some really nice properties that make them excellent analytical tools, regardless of the software you’re using. Next time you need to plot a multi-series dataset, want a nice summary table, or want to make a table pop with a heatmap, consider Pivot Tables.

If you liked this, we’d love for you to subscribe to our publication.

And drop us a comment letting us know what you’d like to see next. You can also submit your own writing for us to promote, so don’t be shy if you would like to become a contributor. Check out submission instructions here.

Happy learning!

--

--