Pandas Tips/Tricks —Show a count of each of the 3 most frequent values of field A for each field B value.

Christopher Jose
learningdatascience
4 min readJun 20, 2019

This blog post is for Python/Pandas users because we’re the best
(j/k everyone’s special).

Task:
Show a count of each of the 3 most frequent values of field A for each field B value.

1. Compute frequencies of field A values grouped by field B.
2. Show only the 3 highest counts of field A for each field B value

Real World Example of Above Task:
Say you have San Francisco housing project data: https://data.sfgov.org/Housing-and-Buildings/Housing-Inventories-2005-to-2018/jzn7-79x2. Each row consists of a housing project, including the housing type (apartment, business, school) and completion status (Completed, Not Completed, etc).

The word “housing” appears to indicate not just a dwelling that people live in, but one where people conduct a particular activity (business, schooling, prayer, the arts, etc).

Suppose we’d like to show the 3 most frequent housing types and their counts for each project completion status.

In other words,
1. Count each housing type for each project completion status.
2. Show only the 3 highest counts of housing types for each project status.

The Code

data.groupby(['fieldB','fieldA']).size()
.reset_index(name='Count')
.sort_values(['fieldB','Count'],ascending=[1,0])
.groupby('fieldB').head(3)

The Code Using Example Housing Data

data.groupby(['ACTION','EXT_USE2']).size()
.reset_index(name='Count')
.sort_values(['ACTION','Count'],ascending=[1,0])
.groupby('EXT_USE2').head(3)

ACTION — Housing Project Completion Status
EXT_USE2 — Use of Project (Housing Type)

I modified the original EXT_USE field into EXT_USE2 since some categories had differing values representing the same thing, like “ARTIST LIVE/WORK” and ”ARTIST LIVE/ WORK”.

In the above, I sort the Count values (essential) AND the ACTION values (non-essential). Sorting the ACTION values is not essential to the problem statement, but just makes the output cleaner.

1st Line of Code — data.groupby([‘fieldB’,’fieldA’]).size()
This returns a Series object of the count of each distinct (Field B, Field A) value, sorted in descending order.

Create Series of Counts, Grouped by ACTION and EXT_USE2

2nd Line of Code — reset_index(name=’Count’)
reset_index() turns the Series into a DataFrame with 3 columns. By default, the values column is annoyingly renamed to ‘0’ (but without the quotes, so you’d literally just reference it using df[0]).

Convert Series to DataFrame

Avoid this by supplying the argument name=’Blah Blah’, which renames it to ‘Blah Blah’ (or whatever peaks your fancy).

Rename the Count Field from 0 to Count

3rd Line of Code — sort_values([‘fieldB’,’Count’],ascending=[1,0])
This sorts the DataFrame by ‘fieldB’ in ascending order, and then ‘Count’ in descending order. Sorting ‘fieldB’ is not essential to the problem statement.

Sort the DataFrame by ACTION and then by Count

4th Line of Code — groupby(‘fieldB’).head(3)
The magic sauce. For each ‘fieldB’ value, return the first 3 rows. We’ve already sorted counts in descending order for each fieldB value. We just need to pull the first 3 rows for each ‘fieldB’ value. Without the groupby(), we’d be pulling the first 3 rows of the entire DataFrame (what head(3) normally does), and not for each ‘fieldB’ value.

Putting it all together
The finale, since this is a symphonic masterpiece that’s been building to it’s inevitable, beautiful climax!

Before you say, “Wait! I see more than 3 rows for some ACTION values!”,

Look again!

This worked, it’s just that (for the keen-eyed analyst) the ACTION categories need to be wrangled/combined because they were entered slightly differently. CFC Issued was sometimes entered as ‘CFC ISSUED’ and ‘CFC_ISSUED’, and ‘FINAL_INSPECT_blah’ has a similar story.

So this result is not quite final! — Another task for another day, however!

Bye.

--

--

Christopher Jose
learningdatascience

An aspiring data scientist hungry to continue building out his data analytics toolkit.