Pandas: Pivot & Multiindex
Pivot Table, Stack/ Unstack are essential pandas methods to work with MultiIndex objects
Creating a robust prediction engine starts with creating the right dataset. In most of real-life data science problems, we don’t get a clean & well-prepared dataset for model building. During this whole process,
- we define the problem and scope,
- define and create the response variable, derive new features,
- do a lot of data processing & visualization.
And all the above steps need pandas again. In the last exercise, we discussed index Pandas Index Explained, which is crucial to understand pandas better. The previous article explained index usage for selection and filtering. In this exercise, we see the usage of Stack/Unstack, Pivot_table, CrossTab for data processing & visualization. We will be using the UCI Machine Learning Adult Dataset,
Business Problem: Classification (a person earns more than 50k or less); Response feature: Label ; Predictors: country, age, education, occupation, marital status etc.
The following are explaining snippets from the notebook, GitHub repo for this exercise can be cloned from the link at the end. Reading the data now😌
Some of the predictor features are age, workclass, education, marital_status, race, sex, native_country, hours_per_week and response feature is label. Also, we are replacing label values <=50K, >50K by 0 and 1 respectively.
The bi-variate analysis explains and validates the relationship between features. And categorical features need groupby and apply functions to understand their relationship with numeric and other categorical features. In this context Pandas Pivot_table, Stack/ Unstack & Crosstab methods are very powerful.
Pivot_table
It takes 3 arguments with the following names: index, columns, and values. Categorical Features from the dataset can be passed in index and columns. The cell values of the new table are taken from a column given as the values parameter.
Coming back to our dataset, usually, earnings are frequently associated with the occupation. So, let’s pivot the occupation and see which occupations have high-earning people.
Multiple columns can also be used during pivoting, we can use pivot_table like the following for multiple columns.
Also, check pt.index and pt.columns, it will print the index and columns of pivot_table. Multi- index columns can be reduced to a single level and can also be renamed like following:-
CrossTab
pd.crosstab can be used just like pivot_table, but by default, it computes a simple frequency table of two (or more) categorical features. Because values are not mandatory for crosstab, but values and aggregation function can be passed, if required.
Let’s look at a simple example of crosstab and plot it. Education is also one important factor in estimating earnings.
Crosstab can be used just like pivot_table by passing values and aggregation parameters. Following is a Bar plot of people earn more than 50K across education and sex(male/female) features with the help of crosstab
pd.cross tab also has a parameter Normalized. While we do bivariate analysis, frequency tables can also be normalized to show percentages rather than counts:
pd.crosstab(adult.education,adult.label_50, normalize = True)
Stack/ Unstack
Stacking a DataFrame means moving the innermost column index to become the innermost row index. The inverse operation is called unstacking.
What is Stacking? -If we have multiple-index columns and we reduce dataset columns by moving the innermost column level to the innermost row level. So the stack is cols to rows(innermost).
What is Unstacking?- If we have multiple-index rows and we reduce dataset rows by moving the innermost row level to innermost column level. So unstack is rows to cols(innermost).
In our downloaded data, we don’t have any multiple indexes, but our pivoted dataset has multiple indexes. Viewing the pivot_table dataset
stack moves the inner level column to inner level rows and unstacking can again be brought it back to the original pivot shape.
Stacking/Unstacking is possible at different levels directly with the parameter level.
These methods really help us bring those insights and model explanations to come out. Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. Jupyter nb for this exercise can be downloaded from the Github repo.
Thanks for reading. If you have liked this article, you may also like Pandas Index -Invisible Soul, Scaling & Transformation When & Where