Reshaping Data with Pandas (Part 1)
In this Part 1 we’ll do a whirlwind 🌪 tour of tools for pulling and stretching the shape of tabular data, including the following methods:
melt()
/stack()
pivot()
/unstack()
pivot_table()
In Part 2 we look at aggregating and grouping data as a means of reshaping tables into more condensed forms.
Introduction
Being able to quickly manipulate the shape of tabular data is an essential skill for any data practitioner. This article will take a brief look at the methods available in the python data wrangling library pandas for molding and reshaping tables of data. If the data tables are like Rubik’s cubes, these methods are your available moves.
Note that in pandas tabular data is stored in “dataframes”, so this terminology will be used throughout to refer specifically to the pandas objects which store our tabular data.
Data
For the examples given below, we’ll use the “iris” dataset available via the seaborn library.
# import Iris dataset from seaborn
import seaborn as sns
df = sns.load_dataset('iris')
From wide to long
Data comes in lots of shapes and sizes. For tabular data there are two common formats that are used in the wild: wide and long.
Wide format: lots of columns.
Long format: few columns, more rows.
The concept of “tidy” data generally refers to tables where one row is one observation, and each column is a property of that observation. This is the most succinct way to present and interpret data; however, there may be cases where you need to reshape tables that have lots of columns to a “long” format, or vice versa.
melt()
In pandas, this wide to long transformation can be achieved using the melt()
method of the dataframes. To determine how we want to handle this transformation we specify the column(s) we want to keep (id_vars
) and the column(s) we wish to “melt” into row variables (value_vars
). A new column will then be created (default name is “values”) that will store the data originally found under the relevant value_vars
column.
The melt()
method will thus result in a dataframe with one or more indexing columns followed by two extra columns: a “variable” column and its associated “value” column.
For example, let’s melt all of the numerical columns in the iris dataset, keeping the “species” as the indexing column:
df_melt = df.melt(
id_vars='species',
value_vars=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'],
var_name='measurement_type',
value_name='unit_value',
)
Note that both id_vars
and value_vars
can be either a single column name or a list of column names. The keyword arguments var_name
and value_name
are used to rename the resulting columns on the fly.
stack()
Similar to melt()
, the stack()
method is an alternative that pandas provides that uses the current index of the dataframe as the row index and collapses all other columns into the “variable” and “value” columns.
For example, if we first set “species” as the dataframe index we can then “stack” the other columns:
df.set_index('species').stack().to_frame()
The only argument that stack()
takes (other than a dropna
boolean flag to drop any rows with all NaN) is level
. This is which level to use for a dataframe with multiindex columns. The default is the innermost level (level = -1
). Let’s set up a multiindex column dataframe and take a look at the behaviour.
arrays = [
['sepal_length1', 'sepal_width1', 'petal_length1', 'petal_width1', 'species'],
['sepal_length2', 'sepal_width2', 'petal_length2', 'petal_width2', 'species']
]mult_idx = pd.MultiIndex.from_arrays(arrays)df_mult = df.copy()
df_mult.columns = mult_idx
If we set “species” as the index and then stack on the first column level we collapse the first column names into row variables, leaving the second column level to remain as columns:
Likewise, we can collapse the second column level by passing level = 1
:
These particular examples aren’t especially interesting as the column names are just duplicates of each other, but they show the general behaviour of using stack()
with hierarchical/multiindex column dataframes.
From long to wide
There’s something uniquely intuitive I find about “melting” dataframes, but unfortunately there is no “unmelt” in pandas, as reversing the melt operation can sometimes take a bit more care about how we need things to be handled.
To achieve a long-to-wide reshaping we can use the pivot()
, unstack()
, or (sometimes) pivot_table()
methods. These can be a bit trickier to handle as it becomes very important to take notice of whether or not you will have duplicated indexes. In these cases the methods won’t know where to put certain values, leading to the dreaded ValueError!
pivot()
When we don’t want to aggregate any data but only want to reshape the rows/columns, we can use pivot()
. However, if we try to use pivot to reshape our melted dataframe from above directly we run into the following error:
ValueError: Index contains duplicate entries, cannot reshape.
This results because for a single species (e.g. “setosa”) there are multiple entries for petal_length
, and so pandas doesn’t know in which row to put each of the values for the different _length
and _width
variables.
One way to overcome this is to modify the original dataframe. If we assume that each row in the original dataframe corresponds to a specific “iris”, we can assign each a unique ID. To do this we’ll just use the dataframe index, but move it to be an actual column (with the column name “index”).
Now we can melt the dataframe but pass both the “index” and the “species” as the id_vars
:
df_melt2 = df.reset_index().melt(
id_vars=['index', 'species'],
value_vars=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'],
var_name='measurement_type',
value_name='unit_value',
)
Now, each “index + species” pair is unique, and so we can pivot the data on these two columns to return to our original, “unmelted”, dataframe:
df_melt2.pivot(
index=['index', 'species'],
columns='measurement_type',
values='unit_value'
).reset_index()
And now we’re back to the original dataframe with the integrity of each row retained! We can then either drop the “index” column or set it back to being the actual index if we like.
unstack()
Similarly to the stack()
method, unstack()
works by using the index of the dataframe but allows a long-to-wide transformation (i.e. the opposite of stack()
). It will also throw an error if there are duplicates in the index.
To take the above df_melt2
long-format dataframe from long-to-wide we can set the index as all three of the "index", "species", and "measurement_type" columns to ensure that it will be unique, then unstack (where the default level is the innermost index level, which in our case will be the "measurement_type", which is what we want):
df_melt2.set_index(['index', 'species', 'measurement_type']).unstack()
If we wanted to “unstack” one of the other indexing columns we can use the level
argument to specify which one (remember, unstack will convert one column into multiple columns). The levels are ordered in the way we defined the index, i.e. ["index", "species", "measurement_type"], so to unstack the "species" column we pass level = 1
:
df_melt2.set_index(['index', 'species', 'measurement_type']).unstack(level=1)
Which looks like this:
Note that because of the unique ID column (“index”) this won’t combine or aggregate data for each “iris” in each species (hence the NaNs). To do that we will need some form of aggregation…
pivot_table()
The pivot_table()
method is slightly different from each of the above methods in that it applies aggregation to multiple rows/sets of the data. It can be used where we have duplicated indexes and we don’t mind combining ("aggregating") them if they are numeric. In fact, this method will only work for numeric values, and is meant as an analogue of spreadsheet-style pivot-table-like functions.
Using our melted dataframe df_melt
we can pivot and aggregate with the following pivot_table()
call:
df_melt.pivot_table(
values='unit_value',
index='species',
columns='measurement_type',
aggfunc='mean'
)
Here we specify the column on which we wish to perform the aggregation as values
, the index
(the column or columns we wish to keep as rows), and the columns
(the column or columns whose values we want to convert to columns).
The aggfunc
is the aggregate function to apply to the multiple values associated with each "species + measurement_type" combination. The default is the 'mean', but other useful functions might be 'median', 'count' or even 'nunique' (number of unique values).
We can even pass multiple aggregate functions as a list (e.g. aggfunc=['mean', 'count', 'nunique']
) and end up with a very wide dataframe.
Bonus! Exploding a column
The explode()
method in pandas is something I find myself using a lot. What it does is, if you have a column of values that are actually python ordered collections (e.g. lists or tuples, but not sets) - i.e. there are multiple values per cell - you can “explode” them into individual rows, duplicating the cell values for that row in all other columns. Note that this will create duplicated instances of each row and so the table will no longer be normalized, but it will make it much easier to perform subsequent operations on the exploded column (e.g. string operations, numerical aggregation/calculations, etc).
As an example, let’s mash together all of the numerical columns of the original “iris” dataset to create a column of list-type that is the collection of values.
Now we can use explode()
and specify the column we wish to “explode” out into separate rows:
df.explode(column='lengths_and_widths')
ay special attention to the index of the resulting dataframe, and how it has created duplicates of each initial row. The number of duplicates will be equal to the length of the collection for that specific row, and so there may be cases where the number of duplicated rows differs for each index. An empty collection will result in a single row of value NaN for the exploded column.
The only argument available in the explode()
method is an ignore_index
boolean flag that just allows for dynamic resetting of the index to unique values.
Summary
In summary, we can be specific with our reshaping of data with melt()
and pivot()
, or use the “index” of the dataframe for convenience with stack()
and unstack()
. If we need some aggregation of numerical data when reshaping we can turn to pivot_table()
, and if you have collections as cell values take a look at explode()
.
I hope that has been helpful. In Part 2 I’ll look into more aggregation of data as a means of reshaping with groupby()
and agg()
!
This article originally appeared on my blog.