Reshaping Data with Pandas (Part 1)

Tom McKenzie
8 min readSep 15, 2021

--

In this Part 1 we’ll do a whirlwind 🌪 tour of tools for pulling and stretching the shape of tabular data, including the following methods:

  • melt() / stack()
  • pivot() / unstack()
  • pivot_table()

In Part 2 we look at aggregating and grouping data as a means of reshaping tables into more condensed forms.

Introduction

Being able to quickly manipulate the shape of tabular data is an essential skill for any data practitioner. This article will take a brief look at the methods available in the python data wrangling library pandas for molding and reshaping tables of data. If the data tables are like Rubik’s cubes, these methods are your available moves.

Note that in pandas tabular data is stored in “dataframes”, so this terminology will be used throughout to refer specifically to the pandas objects which store our tabular data.

Data

For the examples given below, we’ll use the “iris” dataset available via the seaborn library.

# import Iris dataset from seaborn
import seaborn as sns

df = sns.load_dataset('iris')
The “iris” dataset.

From wide to long

Data comes in lots of shapes and sizes. For tabular data there are two common formats that are used in the wild: wide and long.

Wide format: lots of columns.

Long format: few columns, more rows.

The concept of “tidy” data generally refers to tables where one row is one observation, and each column is a property of that observation. This is the most succinct way to present and interpret data; however, there may be cases where you need to reshape tables that have lots of columns to a “long” format, or vice versa.

melt()

In pandas, this wide to long transformation can be achieved using the melt() method of the dataframes. To determine how we want to handle this transformation we specify the column(s) we want to keep (id_vars) and the column(s) we wish to “melt” into row variables (value_vars). A new column will then be created (default name is “values”) that will store the data originally found under the relevant value_vars column.

The melt() method will thus result in a dataframe with one or more indexing columns followed by two extra columns: a “variable” column and its associated “value” column.

For example, let’s melt all of the numerical columns in the iris dataset, keeping the “species” as the indexing column:

df_melt = df.melt(
id_vars='species',
value_vars=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'],
var_name='measurement_type',
value_name='unit_value',
)

Note that both id_vars and value_vars can be either a single column name or a list of column names. The keyword arguments var_name and value_name are used to rename the resulting columns on the fly.

Our “melted” iris dataframe.

stack()

Similar to melt(), the stack() method is an alternative that pandas provides that uses the current index of the dataframe as the row index and collapses all other columns into the “variable” and “value” columns.

For example, if we first set “species” as the dataframe index we can then “stack” the other columns:

df.set_index('species').stack().to_frame()
Using stack() to stack all the numerical columns on top of each other (as rows).

The only argument that stack() takes (other than a dropna boolean flag to drop any rows with all NaN) is level. This is which level to use for a dataframe with multiindex columns. The default is the innermost level (level = -1). Let’s set up a multiindex column dataframe and take a look at the behaviour.

arrays = [
['sepal_length1', 'sepal_width1', 'petal_length1', 'petal_width1', 'species'],
['sepal_length2', 'sepal_width2', 'petal_length2', 'petal_width2', 'species']
]
mult_idx = pd.MultiIndex.from_arrays(arrays)df_mult = df.copy()
df_mult.columns = mult_idx
Hierarchical index dataframe
The iris dataframe with a fake hierarchical column multiindex.

If we set “species” as the index and then stack on the first column level we collapse the first column names into row variables, leaving the second column level to remain as columns:

Stacking the hierarchical dataframe on index level 0.
Stacking on the first column level with df_mult.set_index('species').stack(level=0).

Likewise, we can collapse the second column level by passing level = 1:

Stacking the hierarchical dataframe on index level 1.
Stacking on the second column level with df_mult.set_index('species').stack(level=1).

These particular examples aren’t especially interesting as the column names are just duplicates of each other, but they show the general behaviour of using stack() with hierarchical/multiindex column dataframes.

From long to wide

There’s something uniquely intuitive I find about “melting” dataframes, but unfortunately there is no “unmelt” in pandas, as reversing the melt operation can sometimes take a bit more care about how we need things to be handled.

To achieve a long-to-wide reshaping we can use the pivot(), unstack(), or (sometimes) pivot_table() methods. These can be a bit trickier to handle as it becomes very important to take notice of whether or not you will have duplicated indexes. In these cases the methods won’t know where to put certain values, leading to the dreaded ValueError!

pivot()

When we don’t want to aggregate any data but only want to reshape the rows/columns, we can use pivot(). However, if we try to use pivot to reshape our melted dataframe from above directly we run into the following error:

ValueError: Index contains duplicate entries, cannot reshape.

This results because for a single species (e.g. “setosa”) there are multiple entries for petal_length, and so pandas doesn’t know in which row to put each of the values for the different _length and _width variables.

One way to overcome this is to modify the original dataframe. If we assume that each row in the original dataframe corresponds to a specific “iris”, we can assign each a unique ID. To do this we’ll just use the dataframe index, but move it to be an actual column (with the column name “index”).

Now we can melt the dataframe but pass both the “index” and the “species” as the id_vars:

df_melt2 = df.reset_index().melt(
id_vars=['index', 'species'],
value_vars=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'],
var_name='measurement_type',
value_name='unit_value',
)
Melted dataframe with index column.
Our new “melted” iris dataframe with index as a column.

Now, each “index + species” pair is unique, and so we can pivot the data on these two columns to return to our original, “unmelted”, dataframe:

df_melt2.pivot(
index=['index', 'species'],
columns='measurement_type',
values='unit_value'
).reset_index()
Pivoted dataframe from the melted dataframe.
The result of pivoting our melted dataframe (without aggregation).

And now we’re back to the original dataframe with the integrity of each row retained! We can then either drop the “index” column or set it back to being the actual index if we like.

unstack()

Similarly to the stack() method, unstack() works by using the index of the dataframe but allows a long-to-wide transformation (i.e. the opposite of stack()). It will also throw an error if there are duplicates in the index.

To take the above df_melt2 long-format dataframe from long-to-wide we can set the index as all three of the "index", "species", and "measurement_type" columns to ensure that it will be unique, then unstack (where the default level is the innermost index level, which in our case will be the "measurement_type", which is what we want):

df_melt2.set_index(['index', 'species', 'measurement_type']).unstack()
Result of unstacking the long-format dataframe
Unstacking the long-format (“melted”) dataframe.

If we wanted to “unstack” one of the other indexing columns we can use the level argument to specify which one (remember, unstack will convert one column into multiple columns). The levels are ordered in the way we defined the index, i.e. ["index", "species", "measurement_type"], so to unstack the "species" column we pass level = 1:

df_melt2.set_index(['index', 'species', 'measurement_type']).unstack(level=1)

Which looks like this:

Unstacking dataframe with index specified.
Unstacking while specifying the level.

Note that because of the unique ID column (“index”) this won’t combine or aggregate data for each “iris” in each species (hence the NaNs). To do that we will need some form of aggregation…

pivot_table()

The pivot_table() method is slightly different from each of the above methods in that it applies aggregation to multiple rows/sets of the data. It can be used where we have duplicated indexes and we don’t mind combining ("aggregating") them if they are numeric. In fact, this method will only work for numeric values, and is meant as an analogue of spreadsheet-style pivot-table-like functions.

Using our melted dataframe df_melt we can pivot and aggregate with the following pivot_table() call:

df_melt.pivot_table(
values='unit_value',
index='species',
columns='measurement_type',
aggfunc='mean'
)

Here we specify the column on which we wish to perform the aggregation as values, the index (the column or columns we wish to keep as rows), and the columns (the column or columns whose values we want to convert to columns).

The aggfunc is the aggregate function to apply to the multiple values associated with each "species + measurement_type" combination. The default is the 'mean', but other useful functions might be 'median', 'count' or even 'nunique' (number of unique values).

Result of using pivot_table on long-format dataframe
Performing numerical aggregation while pivoting with pivot_table.

We can even pass multiple aggregate functions as a list (e.g. aggfunc=['mean', 'count', 'nunique']) and end up with a very wide dataframe.

Bonus! Exploding a column

The explode() method in pandas is something I find myself using a lot. What it does is, if you have a column of values that are actually python ordered collections (e.g. lists or tuples, but not sets) - i.e. there are multiple values per cell - you can “explode” them into individual rows, duplicating the cell values for that row in all other columns. Note that this will create duplicated instances of each row and so the table will no longer be normalized, but it will make it much easier to perform subsequent operations on the exploded column (e.g. string operations, numerical aggregation/calculations, etc).

As an example, let’s mash together all of the numerical columns of the original “iris” dataset to create a column of list-type that is the collection of values.

Dataframe with list values
The iris dataframe with values collected in a list for each row.

Now we can use explode() and specify the column we wish to “explode” out into separate rows:

df.explode(column='lengths_and_widths')
Dataframe after using explode method on list column
Wow! What an explosion!

ay special attention to the index of the resulting dataframe, and how it has created duplicates of each initial row. The number of duplicates will be equal to the length of the collection for that specific row, and so there may be cases where the number of duplicated rows differs for each index. An empty collection will result in a single row of value NaN for the exploded column.

The only argument available in the explode() method is an ignore_index boolean flag that just allows for dynamic resetting of the index to unique values.

Summary

In summary, we can be specific with our reshaping of data with melt() and pivot(), or use the “index” of the dataframe for convenience with stack() and unstack(). If we need some aggregation of numerical data when reshaping we can turn to pivot_table(), and if you have collections as cell values take a look at explode().

I hope that has been helpful. In Part 2 I’ll look into more aggregation of data as a means of reshaping with groupby() and agg()!

This article originally appeared on my blog.

--

--

Tom McKenzie

Former synthetic chemist now enjoying design-focused data science.