Counting, as a fraction of a group

One of the common operations that I keep coming back to in my data science projects is counting the number of items in a group as a fraction of the total number of items in the group.

For example, in one dataset of marathon runners I have columns for gender and age and also for expected_time, where some runners have indicated their expected finish-time. I’d like to group runners by gender and age and count the percentage of runners (in each group) who provide an expected_time value, to see how it varies with gender and age.

In the past I have always calculated this using division, with two separate groupby operations: one to count the number of runners (by gender and age) with non-null expected_time entries divided by another to count the total number of runners (by gender and age). Straightforward, but not very elegant.

I recently came across a better way to do this, based on the fact that the fraction that I want is the same as the mean of non-null values in a boolean data frame, as follows:

df.groupby([‘gender’, ‘age’])[‘expected_time’].apply(
lambda x: x.notnull().mean())

Here, each group is transformed into a boolean dataframe, based on whether expected_time is non-null or not. Since True is considered to be a 1 (and False a 0) then the mean of this boolean dataframe is the fraction of True values (non-null expected_time values). For instance if we have a grouping of 100 runners and 30 of them have provided expected_time values, then the mean will be 30/100, which is exactly the fraction we want.

That’s a lot neater and more efficient. One groupby instead of two and a much more succint piece of code, with less scope for errors … I think.