Counting, as a fraction of a group
One of the common operations that I keep coming back to in my data science projects is counting the number of items in a group as a fraction of the total number of items in the group.
For example, in one dataset of marathon runners I have columns for
age and also for
expected_time, where some runners have indicated their expected finish-time. I’d like to group runners by
age and count the percentage of runners (in each group) who provide an
expected_time value, to see how it varies with gender and age.
In the past I have always calculated this using division, with two separate
groupby operations: one to count the number of runners (by
age) with non-null
expected_time entries divided by another to count the total number of runners (by
age). Straightforward, but not very elegant.
I recently came across a better way to do this, based on the fact that the fraction that I want is the same as the
mean of non-null values in a boolean data frame, as follows:
lambda x: x.notnull().mean())
Here, each group is transformed into a boolean dataframe, based on whether
expected_time is non-null or not. Since
True is considered to be a 1 (and
False a 0) then the mean of this boolean dataframe is the fraction of
True values (non-null
expected_time values). For instance if we have a grouping of 100 runners and 30 of them have provided
expected_time values, then the
mean will be 30/100, which is exactly the fraction we want.
That’s a lot neater and more efficient. One
groupby instead of two and a much more succint piece of code, with less scope for errors … I think.