Part1: Stack, Unstack
This page is the first part of the blog Reshaping data in pandas. click here to go back to main page.
Stack
In pandas terminology stacking is referred as transforming horizontal column names to vertical column values . stack
function converts the data from a wide to long format.Below mentioned are the input parameters to the function.
level
: Prescribed level(s) to stack from column axis onto index axis (int,str, list, default -1).dropna
: Whether to drop rows with missing values in the resulting frame, defaults (bool, default True).
Let us start by anaylsing the sample dataset shown below.
# Sample data
df_sensors
We can see that the two year values 2017
and 2018
are present as column names, but ideally both of them depict a single variable which can named as Year
. We can resturcture the two Year values as a single column using stack
as shown below.
# Stacking the Year columns into a single column.
(df_sensors.set_index(['Sensor', 'Metric'])
.stack()
.reset_index()
.rename(columns={'level_2': 'Year', 0: 'Value'}))
Let us now understand the above code snippet in a step wise manner
Step1:
# Applying stack on sample data
df_sensors.stack()
We can see that by default stack
takes in all the columns and converts them into a single vertical column at the inner most level. The above output is a series with hierarchical index, all the column names are added as an index at the innermost level and each row is repeated as values.
If there are any columns which we want to stay in the index we have to explicitly set them using set_index
method before performing the stack
.
Step 2:
Let us now apply stack
by setting the Metric
column as index and analyze the output.
df_sensors.set_index(['Metric']
df_sensors.set_index(['Metric']).stack()
The output of above step is a series
with hierarchical index. In order to convert it to a data frame we have to use pandas helper method reset_index
# Processing the stacked output
(df_sensors
.set_index(['Metric'])
.stack()
.reset_index())
We can see that the output is still messy i.e in level_1
column there are multiple data types hence the data is still not in a desired form.
Step 3:
In order to convert the multiple year columns to a single year column we have to set the columns Sensor
, Metric
as index and apply stack
.
# Stacking the Year Columns
df_sensors.set_index(['Sensor', 'Metric']).stack()
The above output is a hierarchical indexed series which is in the desired form. In order to convert it to a dataframe we have to use the helper method’s reset_index
and rename
as shown below:
# Processing the stacked output
(df_sensors.set_index(['Sensor', 'Metric'])
.stack()
.reset_index()
.rename(columns={'level_2': 'Year', 0: 'Value'}))
When more than one column header is present in the dataset we can stack the desired column header using the level
parameter of the stack function as hown below:
Let us add a column header named Value
on top of the Year columns .
# Add a column level `Value` on top of Year columns
df_sensors2 = df_sensors.set_index(['Sensor', 'Metric'])
df_sensors2.columns = pd.MultiIndex.from_product([['Value'], df_sensors2.columns])
df_sensors2
Level 0
: [Value]
Level 1
: [ 2017, 2018]
Now we can stack the two year columns by specifying the level parameter while applying stack
# Using Level parameter to stack the year columns
df_sensors2.stack(level=1)
# Processing the output
(df_sensors2
.stack(level=1)
.reset_index()
.rename(columns={'level_2':'Year'}))
Points to Remember:
By default stack takes in all columns present at the inner most level in the column axis and stacks them row wise . If you want any column to stay in index you need to set your index column explicitly using
pd.DataFrame.set_index
method before performingstack
and then usepd.DataFrame.reset_index()
to convert the output to a data frame.
Unstack:
unstack
is an inverse operation of stack
i.e it converts a data frame from long to wide format. By default it takes the inner most index values and returns a data frame by reshaping them as the columns. Below are the parameters for the unstack
method.
level
: level(s) of index tounstack
, defaults to -1 (int or string or list of these)fill_value
: replace NAN with value specified ifunstack
produces missing values.
In order to understand the unstack
method let us take the stacked output and see how we can invert the stack
operation.
# Loadind stacked data into a dataframe
df_stacked = (df_sensors.set_index(['Sensor', 'Metric'])
.stack()
.reset_index()
.rename(columns={'level_2': 'Year', 0: 'Value'}))df_stacked
We can now invert the stacked output by following the below three steps:
- Set the columns [
Sensor
,Metric
andYear
] as index. - Apply
unstack
. - Process the output to a consumable form.
# Setting the index columns
df_stacked.set_index(['Sensor', 'Metric','Year'])
# Applying unstack
(df_stacked
.set_index(['Sensor', 'Metric','Year'])
.unstack())
# Processing the Output
dfx = (df_stacked
.set_index(['Sensor', 'Metric', 'Year'])
.unstack())
dfx.columns = dfx.columns.droplevel([0])
dfx.rename_axis([None], axis=1).reset_index()
Let us see some more examples of unstack
for better understanding:
- Unstacking
sensors
metric values as columns : We can useunstack
operation on the stacked data frame to obtain the desired result as shown below.
# Setting Metric as innermost index level
df_stacked.set_index(['Sensor', 'Year', 'Metric'])
# Unstacking the Metric column values.
df_stacked.set_index(['Sensor', 'Year', 'Metric']).unstack()
# Processing the output
dfy = (df_stacked
.set_index(['Sensor', 'Year', 'Metric'])
.unstack())
dfy.columns = dfy.columns.droplevel([0])
dfy.rename_axis([None], axis=1).reset_index()
2. Unstacking both Year
and Metrics
as columns : Using the level
parameter we can unstack
methods to reshape multistack
function converts a data frame from wide to long format,ple column values as individual hierarchical columns as shown below.
# Unstacking at multiple index levels at a time.
(df_stacked
.set_index(['Sensor', 'Year', 'Metric'])
.unstack(level=[-3,-2]))
# Changing the order of the unstacked levels.
(df_stacked
.set_index(['Sensor', 'Year', 'Metric'])
.unstack(level=[-2,-1]))