Pandas For Beginners: Reshaping Dataframes (Part 2)
In the last tutorial, we demonstrated the Pandas melt function to reshape the wide-data form into the long-form. This tutorial will focus on the opposite process. To bring everyone on the same page, presenting a visual of the problem we are trying to solve:
Most of us have worked on scenarios like these when using Microsoft excel. Yes, the wide-form of the table resembles the famous pivot tables. We commonly use them to summarize our data based on a specific dimension (expense category in our case). To perform these summarization tasks, Pandas offer a ready-made function pivot_table. A step by step approach to implement this function is detailed below:
Assumption and Recommendation
Being hands-on is the key to master programming. We recommend that you continue to implement the codes as you follow through with the tutorial. The sample data and the associated Jupiter notebook are available in the Scenario_3 folder of this GitHub link.
Following is the list of Python concepts and pandas functions/ methods used in the tutorial:
Let’s get started
Step 1 — Keeping the data ready
For this tutorial, we have created a dummy dataset containing the monthly expenses incurred by a family across different expense categories for the years: 2018, 2019, and 2020. The dictionary for this data set and the sample data snapshot is as follows:
- Year — Calendar Year
- Month — Month of the year
- Expense Category — Categories across which the family incurred expenses
- Expense Value — Expenses value incurred by the family
Step 2 — Importing pandas package and the data set in Python.
Once you have the data available, the next step is to import it to your Python environment.
#### Sample Code
#### Importing Pandas
import pandas as pd#### Importing Data File - Change the Windows Folder Location
imp_data = pd.read_csv("C:\\Ujjwal\\Analytics\\Git\\Scenario_3\\Expenses_Long.csv")
We have used the Pandas read_csv function to read the data in Python.
Step 3 — Creating the wide-form of the expense table (without aggregation).
Once we have read the data, apply the Pandas pivot_table function to create the wide-form of the input dataframe.
#### Create the wide-form
Wide_data = pd.pivot_table(imp_data,index=["Year","Expense Category"], columns="Month",values = "Expense Value").reset_index()#### Sample Output
The pivot_table function expects the following parameters as input:
- Dataframe name — The first argument to the pivot_table function is the dataframe on which we have to apply the operation.
- Index— This is synonymous with the rows section of excel’s pivot table. Column names passed to the index parameter will form the rows of the final dataframe.
- Columns — This is synonymous with the columns section of excel’s pivot table. Column names passed to the column parameter will form the new columns of the final dataframe.
- Values — This is synonymous with the values section of excel’s pivot table. Column names passed to the values parameter will be used to fill in the cells of the final dataframe
Points to note
- reset_index() — We have used the reset_index function to remove the multilevel index created by pivot_table.
- To pass multiple columns to index, column, or value parameters, use them in the form of lists.
- Just like the melt function, pivot_table also returns a new pivoted dataframe as the output. Capture this output in a new variable (Wide_data in our case) for future use.
Did you know that the pivot table is one of the most common features of Microsoft excel? Quick implementation of data wrangling steps in Python can substantially reduce our model building life cycle. I hope that the solution presented above was helpful.
Do you know how to aggregate the data when converting it into long-form? Can you calculate the average expense per month in 2018, 2019, and 2020 across different expense categories? For the answer, check the code provided in the GitHub repository.
HAPPY LEARNING ! ! ! !