Pandas — GroupBy.first vs GroupBy.nth vs GroupBy.head
Often there comes a need to compute operations on groups. But there are times when getting the first or the nth row from each group is the highest priority.
Comic-con tickets🎫️ were going on sale on next Tuesday and the sale was divided into 3 sessions. First in the morning at 10 AM, second at 2 PM and the last one at 8 PM.
It was decided that the first user🥇 of each session will be given a free tour to any country that the user specifies during the ticket reservation.
After the end of the day following user data was collected:
Extracting the first user from each session
Pandas dataframe has groupby([column(s)]).first() method which is used to get the first record from each group.
The result of grouby.first() is going off the road a little bit with the last group — that is, 8PM where Penny
was the first one to get the ticket. Due to some reasons Penny
left the country
field empty but the above result has Australia
in it.
As a result of which Penny was forced to go to Australia✈️. Poor Penny🤣.
It seems that there is something wrong with the GroupBy.first method😕
The expected result should be
groupby.first()
was expected to give the above result ✔️There is nothing wrong with the groupby.first()
method. It rather works that way.
So, if there is a null value in the first record then the first non-null value in the group is carried up into the first record. This is reason why Penny went on a tour to Australia.
What if the entire group contains the null values, then what should be expected from GroupBy.first?
Lets take a look at the last year comic-con where no one wanted to on a tour.
Getting the first user of each session.
What if there is only one record in group and the same record has null in it?
The result stays the same as the above scenario — that is, the record of that group will have null in it.
Is there a way to make it work to get the correct result using GroupBy.first?
The way to fix this problem is to replace the np.nan(NaN) with None using the np.where()
method.
There’s a simple way to get this things done without worrying about the np.nan and None. GroupBy.nth comes in handy during such situation⛑️.
GroupBy.nth doesn’t change anything and gives the result as per order even though if the first record of the group has null value in it. Also, it has extra capabilities.
The np.nan
is not replaced with None and the GroupBy.nth
gives the expected result. The result stays the same even if the np.nan
are replaced with None
GroupBy.nth has some extra powers
Now, Comic-con wants to select 2 users from each session. First
and Third
user will be the lucky winners.
GroupBy.nth can be used to get multiple specific records within each group.
Finally, the day has arrived when Sheldon, Leonard, and Raj (except Howard😞) got lucky. Country was pre-planned by the group. Comic-con decided to get first 3 entries from each session.
Is there a way to get the first n records from each group?
During such a time GroupBy.head
comes to the rescue🏃♂️. GroupBy.nth
can be used but the catch here is to provide a sequence of list starting from 0 to all the way to n-1.
The GroupBy.head
preserves the index of the original dataframe.
GroupBy.head
can also be used to get the first record from each group irrespective of np.nan(NaN)
and None
.
Safe journey to the winners.