DataFrames with Pandas: Part 3(Aggregating)

Md Aminul Islam
5 min readMay 20, 2023

--

In our previous section, we have been introduced how we can filter rows, and also some other useful methods of DataFrame and Series like sort_values , value_counts, unique , isin. If you do not read the previous two articles, I strongly recommend to go through Part-1 and Part-2 before proceeding this one. Today, we will see how we can aggregate rows in a DataFrame.

Aggregating

Data Scientists sometimes need to aggregate rows to make summaries of data. Let’s say, in our elections DataFrame, we may want to calculate the total Popular vote for each year or we want to calculate total number of Candidate for each year. For answering these type of questions, we need to group rows in our DataFrame. We will see the detailed explanations of these questions in later portions of our discussion.

Let’s at first load our elections DataFrame on which we have worked in our previous sections.

import pandas as pd

elections = pd.read_csv("elections.csv")
elections

Aggregating on a Single Column

First of all, let’s say we want to calculate the sum of all Popular vote .

elections['Popular vote'].sum()

The above one is a simple summation of all Popular vote for all the records. Now, let’s move to our previous questions. We want to calculate the total Popular vote for each year. For this, we need to take the summation of Popular vote for each year rather than taking the summation of all records. So, we need to (i) group our dataset by Year and (ii) then take the summation(aggregation) of Popular vote for each group. For grouping the data in DataFrame, we will use the groupby method of DataFrame. In summary, the expression will look like this:

elections.groupby('Year')['Popular vote'].sum()

Notice that, the above output is a Series object rather than a DataFrame. So, the general formulae for grouping data will look like this:

elections                 # the dataframe
.groupby('Year') # column(s) to group
['Popular vote'] # column(s) to aggregate
.sum() # how to aggregate

Now, we want to answer the other question we have introduced before: we want to calculate the total number number of Candidate in each year. For this, we need to (i) group our dataset by Year and (ii) then take the unique count of Candidate for each group. For this, we can use the size() aggregate function.

elections.groupby('Year')['Candidate'].size()

Aggregating on Multiple Columns

If we want to subdivide our dataset based on more columns instead of one, then we need to group our dataset based on multiple columns. For instance, we want to know the number of Result (loss/win) for each Party . For this, we need to group our dataset on Party and Result columns. We need to pass the list of column names as a parameter in the groupby function.

elections_group_by_party_result = 
elections.groupby(['Party', 'Result'])['Result'].size()
elections_group_by_party_result

The above resulting output is a Series object. If you notice the above Series more closely, you will see that Whig has two levels: for loss (4) and win (2). So, if a party has both lost and won at least in one election, there will be two levels for that party. This type of Series is called multi-level indices Series.

It can be tedious to work with Series with multi-level indices. We can easily convert multi-level indices into a DataFrame with a single index with the reset_index() method.

elections_group_by_party_result.reset_index(name='Count')

The above DataFrame is much more easier to understand and interpret than multi-level indices Series.

Custom Aggregating Functions

Pandas has some built-in aggregation functions like sum(), mean(), size(), first() , max() , min() . So far, we have seen the usage of some of these functions. But, there can be cases where built-in aggregation functions may not serve our purpose. In that case, we need to write our own custom aggregation function. We can do these using .agg(func), where func is the custom function that we define. Let’s say, we want to calculate the difference of percentage of votes between the winner and its closest competitor in each year. For that, we need write a custom aggregation function that will calculate the difference of % between the candidate with highest percentage and the candidate with the 2nd highest percentage in each year.

def diff_with_nearest_candidate(s):
# max value in the group
max_value = s.max()
# 2nd largest in the group
second_highest = s.nlargest(2).min()

# return diff
return max_value - second_highest


elections.groupby('Year')['%'].agg(diff_with_nearest_candidate)

The above Series shows the difference of percentage of vote for the winner with its closest competitor for each year.

Today, we have discussed how we can aggregate rows in a DataFrame based on a single and on multiple columns with built-in and custom aggregate functions. You can see this GitHub repository where all the examples and dataset mentioned here are available. In our next part, we will discuss how we can join two or more DataFrames together to connect data from different DataFrames. Please follow or subscribe myself if you want to be notified for future posts related to DataScience, Machine Learning, and Casual Inference.

--

--

Md Aminul Islam

I am a PhD student in Computer Science at University of Illinois Chicago. My interests are in reasoning with data using Data Science and Machine Learning.