Part2: Melt, Pivot and Pivot_table
This page is the secondpart of the blog Reshaping data in pandas. click here to go back to main page
Melt
Pandas has different ways to accomplish the same task, the difference being readability and performance. Pandas data frame method name melt works similarly to stack but gives more flexibility.The method takes in the below 5 parameters out of which two parameters namely id_vars and value_vars are crucial to understand how to reshape your data.
id_vars: list of column names that you want to preserve as columns and not reshape (list,tuple or ndarray)(optional).value_vars: list of column names that you want to reshape as columns (list,tuple or ndarray) (optional).var_name: Name to use for the variable column, defaults tovariable.(scalar)(optional)value_name:Name to use for the value column, defaults to `value`(scalar)(optional).col_level:If column are multi index then use this level to melt (int or string)(optional).
All of the above parameters mentioned are optional. Let us try to understand the usage of each of the parameters by applying it to our sample dataset.
Working of Melt:
- The
id_varsor the identification variables remain in the same column but repeat for each of thevalue_vars - One crucial aspect of
meltis that it ignores the values in index, in fact it drops the existing index and replaces it with theRangeIndex. so if you have values in index that you want to keep, you need to do areset_indexbefore applyingmelt.
Let’s apply the melt function on our sample sensors data frame.
# Sample data
df_sensors
# Applying Melt on sample data
df_sensors.melt(id_vars=['Sensor', 'Metric'],
value_vars=['2017', '2018'])
melt works similar to stack but gives more flexibility. By default melt assigns variable and value as the names for value_vars, this can be overrided by passing the desired column names as an input to the parameters var_name and value_name as shown below.
# Setting the names of variable and value columns.
df_sensors.melt(id_vars=['Sensor', 'Metric'],
value_vars=['2017', '2018'],
var_name='Year',
value_name='value')
Point to Remember: The advantage of
meltoverstackis that you can mention the column name which you want to preserve in the index without explicitly setting them as index .
Pivot:
pivot method is similar to unstack but is more easy to use. It takes in 3 parameters (mentioned below) as input which are index, columns and values
index: Column to use for the new frames index, if none use current index(string or object).columns: Column to use for the new frames columns (string or object).values: Column(s) to use for the new frames values (string or object).
Working of pivot:
- Each parameter takes in a single column as a string.
- The
indexremains in the vertical and becomes the new index. - The values of the columns referenced by
columnsbecomes new column names. - The values referenced by the
valuesare tiled to correspond the intersection of their former index and columns label.
Let us consider the previous melted output and see how to apply pivot in order to reshape it to original form.
# loading melted data as a dataframe
df_melted = df_sensors.melt(id_vars=['Sensor', 'Metric'],
value_vars=['2017', '2018'],
var_name='Year',
value_name='value')
df_melted
We cannot use pivot directly to pivot the Year column , as mentioned above pivot raises an ValueError when there are duplicate values in the index columns.
# Pivot raises a ValueError
df_melted.pivot(index=['Sensor', 'Metric'], columns='Year', values='value')
In order to do that we’ll have to set the index columns explicitly before applying pivot as shown below.
# Setting the index and applying pivot
(df_melted
.set_index(['Sensor', 'Metric'])
.pivot(columns='Year')['value'])
# Processing the pivot output
(df_melted
.set_index(['Sensor', 'Metric'])
.pivot(columns='Year')['value']
.reset_index()
.rename_axis([None], axis=1))
Points to Remember:
pivotraises aValueErrorwhen any index or column combinations has duplicate values.
pivotcannot take more than one index at a time.In order to achieve that you will have to explicitly set the index and apply pivot on columns.
Pivot_table:
pivot_table is a versatile and flexible function. Below mentioned are the list of the input parameters to the fucntion.. Below mentioned are the input parameters to the function.
index: column(s) which is intended to stay as index. (column, list, array, Grouper)columns: column(s) which are pivoted (column, list, array, Grouper).values: column to aggregate.aggfunc: function, list of aggregation functions.(defaults tomean)fill_value: Scalar to fill for missing values in the result (scalar, default None).margins: whether to add all rows/columns (Bool, default False )(eg: subtotal or grand total)dropna: Do not include columns whose values for all rows are NaN (Bool, default True).margins_name: Name of the row / column that will contain the totals when margins is True (string, default All).
Working of pivot_table:
pivot_tableis a generalization ofpivotthat can handle duplicate values for one pivoted index/column pair. Specifically, you can givepivot_tablea list of aggregation functions using keyword argumentaggfunc. The defaultaggfuncofpivot_tableisnumpy.mean.pivot_tablealso supports using multiple columns for the index and column of the pivoted table. A hierarchical index will be automatically generated for you.pivot_tablealso has some additional default parameters namedmargins,margins_namewhich indicates whether to add rows and columns total in the output or not.
Let us see how can we apply pivot_table to the previous melted output in order to convert it to original form.
# loading melted data as a dataframe
df_melted = df_sensors.melt(id_vars=['Sensor', 'Metric'],
value_vars=['2017', '2018'],
var_name='Year',
value_name='value')
df_melted
# Applying pivot_table on melted output
(df_melted
.pivot_table(index=['Sensor', 'Metric'],
columns='Year',
values='value'))
# Processing the Output
(df_melted
.pivot_table(index=['Sensor', 'Metric'],
columns='Year',
values='value')
.rename_axis([None],axis=1)
.reset_index())
pivot_table is widely used to summarize the numerical data. You can perform multiple aggregations like count, sort, average, total etc using pivot_table . We can use the parameters aggfunc, margins to perform any no of aggregations at a time on the chosen column values.
# Performing aggregations using aggfunc.
df_melted.pivot_table(index=['Sensor', 'Year'],
columns='Metric',
values='value',
aggfunc=[np.sum, np.mean],
fill_value=0,
margins=True,
margins_name='Total')
Key Differences between pivot and pivot_table:
pivotis used for pivoting without aggregation. Therefore, it can’t deal with duplicate values for one index/column pair whereaspivot_tablecan be used for both the purposes.pivotdoesn’t accept a list for indexpivot_tableaccepts.
