Slicing Hierarchical Indexes

Published in

Analytics Vidhya

5 min readMay 5, 2020

In the previous tutorial, we learnt some useful tools on how to get a more nuanced data analysis done on some datasets particularly if they are presented in higher dimensions. It is what we call Hierarchical Indexing or MultiIndexing. A DataFrame is said to be hierarchically indexed when it has more than one index column or higher dimensions. You can refresh your memory here just so you get a sense of where we are and what it is I will be talking about in this tutorial.

I will be getting into some useful analysis that can be done on MultiIndexed DataFrames. I will be using the same Fifa19 dataframe I used in the last post.

In [1]: import pandas as pd
In [2]: data = {'Position': ['GK','GK','GK', 'DF','DF','DF',
            'MF','MF','MF','CF','CF','CF'],   
            'Name': ['De Gea', 'Coutois','Allison','Van Dijk',
            'Ramos','Godin','Hazard','Kante','De Bruyne', 'Ronaldo'
            'Messi','Neymar'],
            'Overall': ['91','88','89','89','91','90','91',
            '90','92','94','93','92'],
            'Rank': ['1st','3rd','2nd','3rd','1st','2nd',
            '2nd','3rd','1st','1st','2nd','3rd']}
In [3]: fifa19 = pd.DataFrame(data, columns=['Position','Name',
                'Overall','Rank'])
In [4]: fifa19
Out[4]: 
             Position     Name        Overall     Rank
     0        GK         De Gea          91        1st
     1        GK         Coutios         88        3rd
     2        GK         Allison         89        2nd
     3        DF         Van Dijk        89        3rd
     4        DF         Ramos           91        1st
     5        DF         Godin           90        2nd
     6        MF         Hazard          91        2nd
     7        MF         Kante           90        3rd
     8        MF         De Bruyne       92        1st
     9        CF         Ronaldo         94        1st
     10       CF         Messi           93        2nd
     11       CF         Neymar          92        3rd

This Dataframe is from the player ratings of a few selected players fromEA Sport's FIFA19 dataset. The dataframe is not currently presented in the format we like it so I will quickly jump a few steps and get it Multiindexed and sorted. If you miss any step, you can refer to the previous post here.

In [5]: fifa19.set_index(['Position','Rank']).sort_index()
Out[5]:              Name     Overall
    Position  Rank  
     CF       1st    Ronaldo     94
              2nd    Messi       93
              3rd    Neymar      92
     DF       1st    Ramos       91
              2nd    Godin       90
              3rd    Van Dijk    89
     GK       1st    De Gea      91
              2nd    Coutois     89
              3rd    Allison     88
     MF       1st    De Bruyne   92
              2nd    Hazard      91
              3rd    Kante       90

Our data is now sorted hierarchical manner and ready to be analysed.

Sorting a hierarchically indexed dataframe is useful slicing. For instance, if we want to find out who is the best Centre-Forward among these lot according to the FIFA19 dataset, we will call the .locaccessor and pass a tuple ('CF' ,'1st') to extract the highest-ranked Forward in the DataFrame.

In [6]: fifa19.loc[('CF', '1st')]
Out[6]:
      Name    Ronaldo
      Overall 94
      Name: (CF, 1st), dtype: object

However, if we follow the tuple with a specific column name, ie, either ‘Name’ or ‘Overall’, we will extract a single element of the table.

In [7]: fifa19.loc[('CF', '1st'), 'Name']
Out[7]: 
       'Ronaldo'

Next, if we happen to call the .loc accessor with a single string, let’s say ‘GK’ alone, Pandas will then slice the outer index and return all the rows corresponding to string sliced, ie, ‘GK’.

In [8]: fifa19.loc['GK']
Out[8]:       Name      Overall
    Rank
     1st     De Gea     91
     2nd     Alisson    89
     3rd     Coutois    88

Another thing we can do is to extract the slice using a list of indexes. We do this by passing a list of two or more index strings after calling the .locaccessor. Pandas then returns all rows of those indexes.

In [9]: fifa19.loc[['GK', 'MF']]
Out[9]:              Name       Overall
 Position    Rank
       GK    1st     De Gea      91
             3rd     Coutois     88
             2nd     Alisson     89
       MF    2nd     Hazard      91
             3rd     Kante       90
             1st     De Bruyne   92

There are instances where we need to find out, for instance, only the highest-ranked players in some specific positions. Pandas makes it possible to do that using tuple consisting of a list and the inner index of choice. What I am saying is, if we are required to slice, let’s say, the 3rd best player amongst the Goalkeepers (‘GK’) and the Midfielders (‘MF’), we will pass a tuple after calling the .loc accessor and in the tuple, pass a list comprising ‘GK’ and ‘MF’ only and then ‘3rd’ after a comma. We will then add a colon at the very end of the .loc call to indicate that we want all the rows for those indexes returned.

Let’s see

In [10]: fifa19.loc[(['GK' ,'MF'], '3rd'), :]
Out[10]:          Name     Overall
  Postion  Rank  
      GK    3rd   Coutois  88
      MF    3rd   Kante    90

In the event, we are only interested in a specific column rather than the entire row of the selected Positions, we call that specific column instead of the colon :

In [10]: fifa19.loc[(['GK' ,'MF'], '3rd'), 'Name']
Out[10]:  
       Postion  Rank  
       GK       3rd   Coutois  
       MF       3rd   Kante    
       Name: Name, dtype: object

This fanciful way of slicing indexes is also applicable for the inner levels. Looking up data based on inner indexes could be tricky at times. You could run into problems if you pay attention to detail. The syntax is quite similar to the outer indexing. That is, if we want to see the Best and Worst ranked Centre-Forward in the DataFrame, we could do that by calling ‘CF’ followed by a list of ‘1st’ and ‘3rd’ and then a colon : to return all columns of those rows.

In [11]: fifa19.loc[('CF', ['1st' , '3rd']),:]
Out[11]:         Name     Overall
  Postion  Rank  
      CF   1st   Ronaldo   94
           3rd   Neymar    92

There is this last trick I feel I have to introduce you to when it comes to slicing hierarchical indexes. It is the use of python’s in-built function Slice()

A tuple used for indexing does not recognise slicing with colon natively. To make that work, we use Python’s Slice() explicitly.

A Slice() creates an instance of type slice. As we already know, a Slice can be used to index sequences in Start, End, Step.

Let’s say we want to select the highest-rated players in each of the positions, we could do with a

In [12]: fifa19.loc[(['CF','DF','GK' ,'MF', ], '1st'), :]
out[12]:           Name     Overall
  Postion  Rank  
      CF   1st     Ronaldo     94
      DF   1st     Ramos       91
      GK   1st     De Gea      91
      MF   1st     De Bruyne   92

That works perfectly but we can have a ‘simpler’ alternative which is the use of Slice().

In [13]: fifa19.loc[(slice(None), '1st'), :]
out[13]:           Name     Overall
  Postion  Rank  
      CF   1st     Ronaldo     94
      DF   1st     Ramos       91
      GK   1st     De Gea      91
      MF   1st     De Bruyne   92

We use ‘None’ to get the default value for all of the 3 slice arguments.

Slice(None) creates a slice instance as Slice(none, none, none) and hence returns all the rows that have a ‘1st’ index.

That seems like a handful. You should have enough on your plate to chew on at this point. I will end it here. You can take time out and practice these on your own. The concepts are relatively easy to follow and understand.

In the next tutorial, I will be dealing with Pivoting, Pivot tables, Stacks and Melting.

Until then, Keep Coding!

Slicing Hierarchical Indexes

Written by Jimmy Aidoo