Pandas Tips/Tricks —Show a count of each of the 3 most frequent values of field A for each field B value.
This blog post is for Python/Pandas users because we’re the best
(j/k everyone’s special).
Task:
Show a count of each of the 3 most frequent values of field A for each field B value.
1. Compute frequencies of field A values grouped by field B.
2. Show only the 3 highest counts of field A for each field B value
Real World Example of Above Task:
Say you have San Francisco housing project data: https://data.sfgov.org/Housing-and-Buildings/Housing-Inventories-2005-to-2018/jzn7-79x2. Each row consists of a housing project, including the housing type (apartment, business, school) and completion status (Completed, Not Completed, etc).
The word “housing” appears to indicate not just a dwelling that people live in, but one where people conduct a particular activity (business, schooling, prayer, the arts, etc).
Suppose we’d like to show the 3 most frequent housing types and their counts for each project completion status.
In other words,
1. Count each housing type for each project completion status.
2. Show only the 3 highest counts of housing types for each project status.
The Code
data.groupby(['fieldB','fieldA']).size()
.reset_index(name='Count')
.sort_values(['fieldB','Count'],ascending=[1,0])
.groupby('fieldB').head(3)
The Code Using Example Housing Data
data.groupby(['ACTION','EXT_USE2']).size()
.reset_index(name='Count')
.sort_values(['ACTION','Count'],ascending=[1,0])
.groupby('EXT_USE2').head(3)
ACTION — Housing Project Completion Status
EXT_USE2 — Use of Project (Housing Type)
I modified the original EXT_USE field into EXT_USE2 since some categories had differing values representing the same thing, like “ARTIST LIVE/WORK” and ”ARTIST LIVE/ WORK”.
In the above, I sort the Count values (essential) AND the ACTION values (non-essential). Sorting the ACTION values is not essential to the problem statement, but just makes the output cleaner.
1st Line of Code — data.groupby([‘fieldB’,’fieldA’]).size()
This returns a Series object of the count of each distinct (Field B, Field A) value, sorted in descending order.
2nd Line of Code — reset_index(name=’Count’)
reset_index() turns the Series into a DataFrame with 3 columns. By default, the values column is annoyingly renamed to ‘0’ (but without the quotes, so you’d literally just reference it using df[0]).
Avoid this by supplying the argument name=’Blah Blah’, which renames it to ‘Blah Blah’ (or whatever peaks your fancy).
3rd Line of Code — sort_values([‘fieldB’,’Count’],ascending=[1,0])
This sorts the DataFrame by ‘fieldB’ in ascending order, and then ‘Count’ in descending order. Sorting ‘fieldB’ is not essential to the problem statement.
4th Line of Code — groupby(‘fieldB’).head(3)
The magic sauce. For each ‘fieldB’ value, return the first 3 rows. We’ve already sorted counts in descending order for each fieldB value. We just need to pull the first 3 rows for each ‘fieldB’ value. Without the groupby(), we’d be pulling the first 3 rows of the entire DataFrame (what head(3) normally does), and not for each ‘fieldB’ value.
Putting it all together
The finale, since this is a symphonic masterpiece that’s been building to it’s inevitable, beautiful climax!
Before you say, “Wait! I see more than 3 rows for some ACTION values!”,
Look again!
This worked, it’s just that (for the keen-eyed analyst) the ACTION categories need to be wrangled/combined because they were entered slightly differently. CFC Issued was sometimes entered as ‘CFC ISSUED’ and ‘CFC_ISSUED’, and ‘FINAL_INSPECT_blah’ has a similar story.
So this result is not quite final! — Another task for another day, however!
Bye.