Writing custom aggregation functions with Pandas

This is my first post about… well… anything (It’s true, It took a bit while to publish it, but I wrote it before the one about Pokemon GO). I’ve been working as a data analyst for the last year and a half at the time of this post and I’ve mainly used Python with Pandas. Pandas is a great module for data analysis and it uses some neat data structures such as Series and DataFrames. If you’re interested on learning Pandas, I recommend checking out 10 minutes to pandas. It’s a great place to start!

Now that you’ve taken a look at Pandas, lets go to the matter at hand. Aggregation functions with Pandas. If you’re wondering what that really is don’t worry! An aggregation function takes multiple values as input which are grouped together on certain criteria to return a single value. A few of these functions are average, count, maximum, among others.

A case use of an aggregation function on Pandas is, for example, when you’ve got a DataFrame (I’ll refer to as df on the code snippets) like the following:

Item-Value DataFrame

On the above DataFrame each row is an item of type A, B or C and its value. A common task would be to know how much value you’ve got for each type of item. In order to do this, you just group by item and sum the value.

In : df.groupby(‘item’).value.sum()
Out: item
Item A 70
Item B 177
Item C 40
Name: value, dtype: int64

For this case it’s pretty straight forward. We’ve got a sum function from Pandas that does the work for us. If there wasn’t such a function we could make a custom sum function and use it with the aggregate function in order to achieve the same result.

In : def test_sum(series):
return reduce(lambda x, y: x + y, series)
     df.groupby('item').agg({'value': ['sum', test_sum]})

The aggregation function we created receives the value Series from the DataFrame and them sums all the items from the series to get the same result as the sum function from Pandas:

Resulting DataFrame after the aggregation

Of course this is a dull example, as it’s not useful at all given the existence of the sum function. I’ve been working on a real world use case today, when we wanted to verify if every sales analyst was tied to a manager and I ended up creating the following aggregation function in order to return the set of every analyst for a given manager.

def agg_analyst_per_manager(series):
analyst_list = series.astype(unicode).tolist()
analyst_list = filter(lambda analyst: analyst != '', analyst_list)
return set(analyst_list)

Another example of a custom aggregation function I’ve created is

def count_nulls(series):
return len(series) — series.count()

Knowing how to create a custom aggregation function has proved useful a few times in order to rapidly aggregate data in anyway I need to without much complication. That’s all for my first programming text!