Group Your Data By the Index

Published in

The Startup

3 min readAug 17, 2020

A few days ago, I was working on a project on risk management, until I came across a simple problem that I had never seen.

The Problem

I needed to group data on the different IDs of a transaction dataset. This dataset contains 8,250 transactions for 1,125 customers of a bank.

Here are the first five rows:

As you can see, all the rows have the same ID and I needed to group all the transactions according to the ID that created them.

Without having the solution in my head, I started looking in my Data Science books and more than anything those that were about data analysis and feature engineering but I didn’t find anything.

For this reason I started looking at Stack Overflow until I found something that looked like my situation but I had to adapt it to the characteristics of the project.

Link to GitHub — Check the project here.

The Solution

To get around this problem you need the well-known groupby of Pandas.

First, we are going to group the prod_code according to their ID (if you want to know more you can enter GitHub with the link above). But we are going to group the various prod_code values into a list with the following code:

df_0 = payment_data.groupby(['id'])['prod_code'].apply(list).reset_index()
df_0.head(10)

This is the output:

As we can see, now we have all the unique different values of prod_code (one for every transaction made) grouped by the ID.

The next feature that I needed to group was new_balance , where I decided to get the mean of all the new balances for every prod_code:

df_1 = payment_data.groupby(['id'])['new_balance'].apply(np.mean, axis=0).reset_index()
df_1.head(10)

Now, I had to group all the different types of overall payment: OVD_t1, OVD_t2, and OVD_t3. First, I grouped the sum of every type with their correspondent ID and then grouped all the types together in one dataframe.

For this I had to use the function reduce:

from functools import reduce

df_OVD_t1 = payment_data.groupby(['id'])['OVD_t1'].apply(np.sum, axis=0).reset_index()
df_OVD_t2 = payment_data.groupby(['id'])['OVD_t2'].apply(np.sum, axis=0).reset_index()
df_OVD_t3 = payment_data.groupby(['id'])['OVD_t3'].apply(np.sum, axis=0).reset_index()
OVD_dfs = [df_OVD_t1, df_OVD_t2, df_OVD_t3]
df_2 = reduce(lambda left,right: pd.merge(left,right,on='id'), OVD_dfs)
df_2.head(10)

The final feature that I needed to group was pay_normal. Here, the only operation that I applied was np.sum because I needed the total of times that the payment was normal for every client.

df_3 = payment_data.groupby(['id'])['pay_normal'].apply(np.sum, axis=0).reset_index()
df_3.head(10)

Merging all the Dataframes into one

The final step for this task was to merge all the dataframes that I created into only one. For this, I used the reduce function again in the same form as in the grouping of the OVD_types before.

dfs = [df_0, df_1, df_2, df_3]
df_final = reduce(lambda left,right: pd.merge(left,right,on='id'), dfs) 
df_final.head(10)

Conclusion

And that’s it. I hope this tutorial helps you, and thanks for getting here.

If you liked the article, I invite you to follow me on dev.to, GitHub and LinkedIn!

Group Your Data By the Index

The Problem

The Solution

Merging all the Dataframes into one

Conclusion

Written by Leandro Ruiz