Splitting Pandas DataFrames using groupby operator

Ioannis Maragkakis
8 min readNov 8, 2022

--

Pandas is one of the Python libraries that i use everyday in my job as a data engineer. Throughout my first project when i started working, there were many times that i had to use the groupby operation to group the data and apply some aggregations. These times i was blindly using a specific groupby format to apply the calculations without clearly understanding the inner depths of the operator. That is why one day i said, i am going to study every little detail about it. And so i did. Here i am trying to distribute my hours of sitting in front of the computer trying to understand the method.

Groupby operation

There are 3 things happening when using groupby:

  1. Splitting the dataframe in different groups (e.g. male and female)
  2. Applying a function to each of the group separately (e.g. mean)
  3. Combine the groups into a dataframe again

In this tutorial, only the first option is going to be presented. More tutorials about the rest are going to be created in the future.

WHAT

First initialize the following libraries:

import pandas as pd
import numpy as np

We will use the following dataframe to apply and present the calculations:

df = pd.DataFrame(
[
("bird", "chickens", 7.5, 3, 'ground'),
("bird", "falcons", 6.0, 1, 'air'),
("bird", "owls", 4.0, 1.5, 'air'),
("bird", "parrots", 50.0, 1.6, 'air'),
("mammal", "elephants", 48.0, 5000, 'ground'),
("mammal", "hamsters", 2.5, 0.1, 'ground'),
("mammal", "humans", 70.0, 70, 'ground'),
("mammal", "rabbits", 9, 2, 'ground'),
("mammal", "whales", 55, 90000, 'sea'),
('reptiles', 'crocodiles', 70.0, 1100, 'river'),
('reptiles', 'lizards', 5.0, 0.5, 'ground'),
('reptiles', 'snakes', 9.0, 1.8, 'ground'),
('reptiles', 'tortoises', 115.0, 60, 'ground'),
('reptiles', 'turtles', 30.0, 1.8, 'sea'),
('Amphibians', 'frogs', 10.0, 0.02, 'lake'),
('Amphibians', 'toads', 6.0, 0.06, 'lake'),
('Amphibians', 'salamanders', 15.0, 0.14, 'river'),

],

columns=("class", "animal", "avg_lifespan_yr", 'avg_weight_kg', 'natural_habitat'),
)
df = df.set_index('animal')
df
Different animals and their characteristics

This dataframe shows the class of different animals and it contains values about their average lifespan, their average weight in kilos and their natural habitat. I have made it for the sake of this tutorial, so please don’t get angry if you find any abnormal information 😣 I love animals too !!

Splitting the DataFrame

To split the dataframe we will use the groupby method. There are 4 different ways to split the dataframe:

1. Using a python function

Create a function that takes as input the values of the dataframe index, separately and return the class that you want to categorize them. In the following example i categorize the index rows according to the length of each individual index value:

def group_length(ind_value): # ind_value is each index value
# create two groups:
# group1='large' if length of index >6 else group2='small'
if len(ind_value) > 6:
return 'large'
else:
return 'small'
grouped_df = df.groupby(group_length) # insert the function name
grouped_df.groups # use the .groups attribute to get a dictionary
# with the groups (large, small) as the keys and the index values
# as the dictionary values that belong to each group accordingly
# result:
{
'large':
['chickens', 'falcons', 'parrots', 'elephants',
'hamsters', 'rabbits', 'crocodiles', 'lizards', 'tortoises',
'turtles', 'salamanders'],
'small':
['owls', 'humans', 'whales', 'snakes', 'frogs', 'toads']
}

2. Using a list or numpy ndarray

Use a python list or ndarray with the same length as the length of the dataframe index, which contains the different groups that you want to group the dataframe index values. Each list element or group corresponds to the element that is in the same position in the dataframe index. That means, the third element of the list refers to the third element in the dataframe index.

# create a python list with 2 groups (group1, group2) and categorize
# each index value accordingly
group_list = [
'group1', 'group2', 'group1', 'group1', 'group1','group1',
'group1', 'group1', 'group2', 'group1','group1', 'group2',
'group1', 'group2', 'group1', 'group2', 'group1'
]
# pass the list into the groupby method
grouped_df = df.groupby(group_list)
grouped_df.groups # get the groups
# result:
{
'group1':
['chickens', 'owls', 'parrots', 'elephants', 'hamsters',
'humans', 'rabbits', 'crocodiles', 'lizards', 'tortoises',
'frogs', 'salamanders'],
'group2':
['falcons', 'whales', 'snakes', 'turtles', 'toads']
}
# take into account that the 3rd element of the group_list
# ('group1') corresponds to the third element of the dataframe index
# values ('owls')

You can do the same grouping using a numpy ndarray.

3. Using a python dictionary or Pandas Series

Create a dictionary with keys, each individual index value of the dataframe and with values, the group that corresponds to this index value.

# here we create 3 groups (group1, group2, group3)
# The dictionary contains all the index values as the dictionary
# keys and their groups as the dictionary values
dct_groups = {

"chickens": 'group2',
"falcons": 'group3',
"owls": 'group2',
"parrots": 'group1',
"elephants": 'group1',
"hamsters": 'group1',
"humans": 'group2',
"rabbits": 'group3',
"whales": 'group1',
"crocodiles": 'group2',
"lizards": 'group1',
"snakes": 'group3',
"tortoises": 'group1',
"turtles": 'group2',
"frogs": 'group3',
"toads": 'group1',
"salamanders": 'group3'

}
grouped_df = df.groupby(dct_groups)
grouped_df.groups # use the .groups attribute to get a dictionary
# with the groups
# result:
{
'group1':
['parrots', 'elephants', 'hamsters', 'whales', 'lizards',
'tortoises', 'toads'],
'group2':
['chickens', 'owls', 'humans', 'crocodiles', 'turtles'],
'group3':
['falcons', 'rabbits', 'snakes', 'frogs', 'salamanders']
}

You will get the same results, if you instead use a Series. The index of the Series must be the same as the dataframe index and the values, the group that corresponds to each Series index value. Try to do the same thing but instead of the dictionary, insert:

s = pd.Series(dct_groups)

3. Using a dataframe column

Insert a string refering to a column name of the dataframe so that, the unique values of this column will be the groups. Instead of a column name, you can also insert an index level name. However, if the string is used as a column name and as an index level name at the same time, a ValueError will be raised.

# group the dataframe by the 'class' column (the unique values)
grouped_df = df.groupby('class')
grouped_df.groups
# result:
{
'Amphibians':
['frogs', 'toads', 'salamanders'],
'bird':
['chickens', 'falcons', 'owls', 'parrots'],
'mammal':
['elephants', 'hamsters', 'humans', 'rabbits', 'whales'],
'reptiles':
['crocodiles', 'lizards', 'snakes', 'tortoises', 'turtles']
}

Also keep in mind that, df.groupby('class') is the same thing as df.groupby(df['class']).

3. A list of any of the above things

You can use more than one of the above things to create a Pandas MuntiIndex.

# create two group lists and group using these two lists to create
# a Pandas MultiIndex
group_list = [
'group1', 'group2', 'group1', 'group1', 'group1','group1',
'group1', 'group1', 'group2', 'group1','group1', 'group2',
'group1', 'group2', 'group1', 'group2', 'group1'
]
group_list2 = [
'group3', 'group3', 'group4', 'group4', 'group4','group4',
'group4', 'group3', 'group4', 'group3','group4', 'group3',
'group3', 'group4', 'group3', 'group4', 'group4'
]
# group by both lists
grouped_df = df.groupby([group_list, group_list2])
grouped_df.groups # get the groups
# result:
{
('group1', 'group3'):
['chickens', 'rabbits', 'crocodiles', 'tortoises', 'frogs'],
('group1', 'group4'):
['owls', 'parrots', 'elephants', 'hamsters', 'humans',
'lizards', 'salamanders'],
('group2', 'group3'):
['falcons', 'snakes'],
('group2', 'group4'):
['whales', 'turtles', 'toads']
}

You can also use a list and a dictionary together to group:

# get the group_list and the dictionary (dct_groups) from abovegrouped_df = df.groupby([group_list, dct_groups])
grouped_df.groups
# result:
{
('group1', 'group1'):
['parrots', 'elephants', 'hamsters', 'lizards', 'tortoises'],
('group1', 'group2'):
['chickens', 'owls', 'humans', 'crocodiles'],
('group1', 'group3'):
['rabbits', 'frogs', 'salamanders'],
('group2', 'group1'):
['whales', 'toads'],
('group2', 'group2'):
['turtles'],
('group2', 'group3'):
['falcons', 'snakes']
}

However, the most usual way to combine multiple methods is by using two or more columns from the dataframe:

# groupby 'class' and 'natural_habitat' columns
grouped_df = df.groupby(['class', 'natural_habitat']).groups
grouped_df.groups
# results:
{
('Amphibians', 'lake'):
['frogs', 'toads'],
('Amphibians', 'river'):
['salamanders'],
('bird', 'air'):
['falcons', 'owls', 'parrots'],
('bird', 'ground'):
['chickens'],
('mammal', 'ground'):
['elephants', 'hamsters', 'humans', 'rabbits'],
('mammal', 'sea'):
['whales'],
('reptiles', 'ground'):
['lizards', 'snakes', 'tortoises'],
('reptiles', 'river'):
['crocodiles'], ('reptiles', 'sea'): ['turtles']
}

Grouping by the level of the index

# rearrange the dataframe to have a MultiIndex as the index
df_new = df.copy() # make a copy of the initial dataframe
new_index = [df.index, df['class']] # set two columns as the index
del df_new['class']
df_new = df_new.set_index(new_index)
df_new
New dataframe with MultiIndex

Group by the an index level using the level argument specifying one or more index names or an integer ranging from 0(left-most index) to n(total number of indexes, right-most level):

# groupby the 'class' index level
df_new.groupby(level='class').groups
# you can also do df_new.groupby(level=1) for the 'class' level# results:
{
'Amphibians':
[('frogs', 'Amphibians'), ('toads', 'Amphibians'),
('salamanders', 'Amphibians')],
'bird':
[('chickens', 'bird'), ('falcons', 'bird'), ('owls', 'bird'),
('parrots', 'bird')],
'mammal':
[('elephants', 'mammal'), ('hamsters', 'mammal'), ('humans',
'mammal'), ('rabbits', 'mammal'), ('whales', 'mammal')],
'reptiles':
[('crocodiles', 'reptiles'), ('lizards', 'reptiles'), ('snakes',
'reptiles'), ('tortoises', 'reptiles'), ('turtles',
'reptiles')]
}

Or groupby many levels:

# here we use integers to denote the index levels
df_new.groupby(level=[1, 0]).groups
# result:
{
('Amphibians', 'frogs'): [('frogs', 'Amphibians')],
('Amphibians', 'salamanders'): [('salamanders', 'Amphibians')],
('Amphibians', 'toads'): [('toads', 'Amphibians')],
('bird', 'chickens'): [('chickens', 'bird')],
('bird', 'falcons'): [('falcons', 'bird')],
('bird', 'owls'): [('owls', 'bird')],
('bird', 'parrots'): [('parrots', 'bird')],
('mammal', 'elephants'): [('elephants', 'mammal')],
('mammal', 'hamsters'): [('hamsters', 'mammal')],
('mammal', 'humans'): [('humans', 'mammal')],
('mammal', 'rabbits'): [('rabbits', 'mammal')],
('mammal', 'whales'): [('whales', 'mammal')],
('reptiles', 'crocodiles'): [('crocodiles', 'reptiles')],
('reptiles', 'lizards'): [('lizards', 'reptiles')],
('reptiles', 'snakes'): [('snakes', 'reptiles')],
('reptiles', 'tortoises'): [('tortoises', 'reptiles')],
('reptiles', 'turtles'): [('turtles', 'reptiles')]
}

Or even by using an index level and a column name:

# group by the index level 'class' and the column name 
# 'natural_habitat'
df_new.groupby(['class', 'natural_habitat']).groups
# result:
{
('Amphibians', 'lake'): [('frogs', 'Amphibians'), ('toads',
'Amphibians')],
('Amphibians', 'river'): [('salamanders', 'Amphibians')],
('bird', 'air'): [('falcons', 'bird'), ('owls', 'bird'),
('parrots', 'bird')],
('bird', 'ground'): [('chickens', 'bird')],
('mammal', 'ground'): [('elephants', 'mammal'), ('hamsters',
'mammal'), ('humans', 'mammal'), ('rabbits', 'mammal')],
('mammal', 'sea'): [('whales', 'mammal')],
('reptiles', 'ground'): [('lizards', 'reptiles'), ('snakes',
'reptiles'), ('tortoises', 'reptiles')],
('reptiles', 'river'): [('crocodiles', 'reptiles')],
('reptiles', 'sea'): [('turtles', 'reptiles')]
}

If you want to group by an index level using an integer and a dataframe column, use the pd.Grouper method:

df.groupby([pd.Grouper(level=0), 'natural_habitat'])

GroupBy object attributes

After you create the groups you can do the following:

  1. Get a specific group using the .get_group('name_of_group')
df_grouped = df.groupby('class') # group by the 'class' column# there are four groups in the 'class' column: bird, mammal, 
# reptiles, Amphibians. Here we get the 'bird' group. The get_group
# method returnds a dataframe containing the selected group
df_grouped.get_group('bird')
# result:
Groupby the “class” column name and get the “bird” group

You can also get a group from a MultiIndex:

# group by class and natural_habitat
df_grouped = df.groupby(['class', 'natural_habitat'])
# group by class and natural_habitat
df_grouped = df.groupby(['class', 'natural_habitat'])
# get a specific group (remeber now we have a multiindex so we need
# a tuple):
df_grouped.get_group( ('bird', 'air') )

2. Get a dictionary with the groups as the dictionary keys and the index values as the dictionary values, using the .groups attribute (we used this one from the beginning)

3. Get the number of the groups using len(df_grouped.groups)

4. Iterate through the groups

# group by the 'class' column
df_grouped = df.groupby('class')
# loop through the grouped object and get the group name as the
# first value and the dataframe of this group as the second value
for group, data in df_grouped:
print(group)
print(data)
print()

Conclusion

There are more things to cover in the groupby section. After getting the groups we need to perform some computations to the different groups before we combine them together again. That means a function needs to be applied to each group independently such as an aggregation (sum, mean etc.), a tranformation (standardization, fill nans etc.) and filtration (head, tail, etc.) which will be explained in a future tutorial. I hope this tutorial helped you understand the different ways you can split the dataframe into different groups. Thank you very much for your time to read this and let’s keep in touch :)

giannis.marar@hotmail.com

www.linkedin.com/in/ioannis-maragkakis-1ba2851a9

Till next time !

--

--