Part2: Melt, Pivot and Pivot_table
This page is the secondpart of the blog Reshaping data in pandas. click here to go back to main page
Melt
Pandas has different ways to accomplish the same task, the difference being readability and performance. Pandas data frame method name melt
works similarly to stack
but gives more flexibility.The method takes in the below 5 parameters out of which two parameters namely id_vars
and value_vars
are crucial to understand how to reshape your data.
id_vars
: list of column names that you want to preserve as columns and not reshape (list,tuple or ndarray)(optional).value_vars
: list of column names that you want to reshape as columns (list,tuple or ndarray) (optional).var_name
: Name to use for the variable column, defaults tovariable.
(scalar)(optional)value_name
:Name to use for the value column, defaults to `value`(scalar)(optional).col_level
:If column are multi index then use this level to melt (int or string)(optional).
All of the above parameters mentioned are optional. Let us try to understand the usage of each of the parameters by applying it to our sample dataset.
Working of Melt:
- The
id_vars
or the identification variables remain in the same column but repeat for each of thevalue_vars
- One crucial aspect of
melt
is that it ignores the values in index, in fact it drops the existing index and replaces it with theRangeIndex
. so if you have values in index that you want to keep, you need to do areset_index
before applyingmelt
.
Let’s apply the melt function on our sample sensors data frame.
# Sample data
df_sensors
# Applying Melt on sample data
df_sensors.melt(id_vars=['Sensor', 'Metric'],
value_vars=['2017', '2018'])
melt
works similar to stack
but gives more flexibility. By default melt
assigns variable and value as the names for value_vars, this can be overrided by passing the desired column names as an input to the parameters var_name
and value_name
as shown below.
# Setting the names of variable and value columns.
df_sensors.melt(id_vars=['Sensor', 'Metric'],
value_vars=['2017', '2018'],
var_name='Year',
value_name='value')
Point to Remember: The advantage of
melt
overstack
is that you can mention the column name which you want to preserve in the index without explicitly setting them as index .
Pivot:
pivot
method is similar to unstack
but is more easy to use. It takes in 3 parameters (mentioned below) as input which are index
, columns
and values
index
: Column to use for the new frames index, if none use current index(string or object).columns
: Column to use for the new frames columns (string or object).values
: Column(s) to use for the new frames values (string or object).
Working of pivot:
- Each parameter takes in a single column as a string.
- The
index
remains in the vertical and becomes the new index. - The values of the columns referenced by
columns
becomes new column names. - The values referenced by the
values
are tiled to correspond the intersection of their former index and columns label.
Let us consider the previous melted output and see how to apply pivot
in order to reshape it to original form.
# loading melted data as a dataframe
df_melted = df_sensors.melt(id_vars=['Sensor', 'Metric'],
value_vars=['2017', '2018'],
var_name='Year',
value_name='value')
df_melted
We cannot use pivot
directly to pivot the Year
column , as mentioned above pivot raises an ValueError
when there are duplicate values in the index columns.
# Pivot raises a ValueError
df_melted.pivot(index=['Sensor', 'Metric'], columns='Year', values='value')
In order to do that we’ll have to set the index columns explicitly before applying pivot
as shown below.
# Setting the index and applying pivot
(df_melted
.set_index(['Sensor', 'Metric'])
.pivot(columns='Year')['value'])
# Processing the pivot output
(df_melted
.set_index(['Sensor', 'Metric'])
.pivot(columns='Year')['value']
.reset_index()
.rename_axis([None], axis=1))
Points to Remember:
pivot
raises aValueError
when any index or column combinations has duplicate values.
pivot
cannot take more than one index at a time.In order to achieve that you will have to explicitly set the index and apply pivot on columns.
Pivot_table:
pivot_table
is a versatile and flexible function. Below mentioned are the list of the input parameters to the fucntion.. Below mentioned are the input parameters to the function.
index
: column(s) which is intended to stay as index. (column, list, array, Grouper)columns
: column(s) which are pivoted (column, list, array, Grouper).values
: column to aggregate.aggfunc
: function, list of aggregation functions.(defaults tomean
)fill_value
: Scalar to fill for missing values in the result (scalar, default None).margins
: whether to add all rows/columns (Bool, default False )(eg: subtotal or grand total)dropna
: Do not include columns whose values for all rows are NaN (Bool, default True).margins_name
: Name of the row / column that will contain the totals when margins is True (string, default All).
Working of pivot_table:
pivot_table
is a generalization ofpivot
that can handle duplicate values for one pivoted index/column pair. Specifically, you can givepivot_table
a list of aggregation functions using keyword argumentaggfunc
. The defaultaggfunc
ofpivot_table
isnumpy.mean
.pivot_table
also supports using multiple columns for the index and column of the pivoted table. A hierarchical index will be automatically generated for you.pivot_table
also has some additional default parameters namedmargins
,margins_name
which indicates whether to add rows and columns total in the output or not.
Let us see how can we apply pivot_table
to the previous melted output in order to convert it to original form.
# loading melted data as a dataframe
df_melted = df_sensors.melt(id_vars=['Sensor', 'Metric'],
value_vars=['2017', '2018'],
var_name='Year',
value_name='value')
df_melted
# Applying pivot_table on melted output
(df_melted
.pivot_table(index=['Sensor', 'Metric'],
columns='Year',
values='value'))
# Processing the Output
(df_melted
.pivot_table(index=['Sensor', 'Metric'],
columns='Year',
values='value')
.rename_axis([None],axis=1)
.reset_index())
pivot_table
is widely used to summarize the numerical data. You can perform multiple aggregations like count, sort, average, total etc using pivot_table
. We can use the parameters aggfunc
, margins
to perform any no of aggregations at a time on the chosen column values.
# Performing aggregations using aggfunc.
df_melted.pivot_table(index=['Sensor', 'Year'],
columns='Metric',
values='value',
aggfunc=[np.sum, np.mean],
fill_value=0,
margins=True,
margins_name='Total')
Key Differences between pivot
and pivot_table:
pivot
is used for pivoting without aggregation. Therefore, it can’t deal with duplicate values for one index/column pair whereaspivot_table
can be used for both the purposes.pivot
doesn’t accept a list for indexpivot_table
accepts.