Pandas: Pivot & Multiindex

Pivot Table, Stack/ Unstack are essential pandas methods to work with MultiIndex objects

Manu Sharma
Analytics Vidhya
5 min readSep 14, 2019

--

Out of clutter, find simplicity

Creating a robust prediction engine starts with creating the right dataset. In most of real-life data science problems, we don’t get a clean & well-prepared dataset for model building. During this whole process,

  • we define the problem and scope,
  • define and create the response variable, derive new features,
  • do a lot of data processing & visualization.

And all the above steps need pandas again. In the last exercise, we discussed index Pandas Index Explained, which is crucial to understand pandas better. The previous article explained index usage for selection and filtering. In this exercise, we see the usage of Stack/Unstack, Pivot_table, CrossTab for data processing & visualization. We will be using the UCI Machine Learning Adult Dataset,

Business Problem: Classification (a person earns more than 50k or less); Response feature: Label ; Predictors: country, age, education, occupation, marital status etc.

The following are explaining snippets from the notebook, GitHub repo for this exercise can be cloned from the link at the end. Reading the data now😌

Some of the predictor features are age, workclass, education, marital_status, race, sex, native_country, hours_per_week and response feature is label. Also, we are replacing label values <=50K, >50K by 0 and 1 respectively.

The bi-variate analysis explains and validates the relationship between features. And categorical features need groupby and apply functions to understand their relationship with numeric and other categorical features. In this context Pandas Pivot_table, Stack/ Unstack & Crosstab methods are very powerful.

Pivot_table

It takes 3 arguments with the following names: index, columns, and values. Categorical Features from the dataset can be passed in index and columns. The cell values of the new table are taken from a column given as the values parameter.

Coming back to our dataset, usually, earnings are frequently associated with the occupation. So, let’s pivot the occupation and see which occupations have high-earning people.

Exec Managerial, Prof-speciality and sales have the highest number of male and female people who earn more than 50K, it may come out as one important feature.

Multiple columns can also be used during pivoting, we can use pivot_table like the following for multiple columns.

multi-index columns

Also, check pt.index and pt.columns, it will print the index and columns of pivot_table. Multi- index columns can be reduced to a single level and can also be renamed like following:-

Single level columns after combining the previous two levels

CrossTab

pd.crosstab can be used just like pivot_table, but by default, it computes a simple frequency table of two (or more) categorical features. Because values are not mandatory for crosstab, but values and aggregation function can be passed, if required.
Let’s look at a simple example of crosstab and plot it. Education is also one important factor in estimating earnings.

frequency table of education and sex

Crosstab can be used just like pivot_table by passing values and aggregation parameters. Following is a Bar plot of people earn more than 50K across education and sex(male/female) features with the help of crosstab

cross tab plot: index on x-axis and values on the y-axis

pd.cross tab also has a parameter Normalized. While we do bivariate analysis, frequency tables can also be normalized to show percentages rather than counts:

Stack/ Unstack

Stacking a DataFrame means moving the innermost column index to become the innermost row index. The inverse operation is called unstacking.
What is Stacking? -If we have multiple-index columns and we reduce dataset columns by moving the innermost column level to the innermost row level. So the stack is cols to rows(innermost).
What is Unstacking?- If we have multiple-index rows and we reduce dataset rows by moving the innermost row level to innermost column level. So unstack is rows to cols(innermost).
In our downloaded data, we don’t have any multiple indexes, but our pivoted dataset has multiple indexes. Viewing the pivot_table dataset

Initial Pivot table

stack moves the inner level column to inner level rows and unstacking can again be brought it back to the original pivot shape.

Pivot table after stacking
Stack and unstack brought original pivot table

Stacking/Unstacking is possible at different levels directly with the parameter level.

These methods really help us bring those insights and model explanations to come out. Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. Jupyter nb for this exercise can be downloaded from the Github repo.

Thanks for reading. If you have liked this article, you may also like Pandas Index -Invisible Soul, Scaling & Transformation When & Where

--

--