Group Your Data By the Index

Leandro Ruiz
The Startup

--

A few days ago, I was working on a project on risk management, until I came across a simple problem that I had never seen.

The Problem

I needed to group data on the different IDs of a transaction dataset. This dataset contains 8,250 transactions for 1,125 customers of a bank.

Here are the first five rows:

As you can see, all the rows have the same ID and I needed to group all the transactions according to the ID that created them.

Without having the solution in my head, I started looking in my Data Science books and more than anything those that were about data analysis and feature engineering but I didn’t find anything.

For this reason I started looking at Stack Overflow until I found something that looked like my situation but I had to adapt it to the characteristics of the project.

Link to GitHub
Check the project here.

The Solution

To get around this problem you need the well-known groupby of Pandas.

First, we are going to group the prod_code according to their ID (if you want to know more you can enter GitHub with the link above). But we are going to group the various prod_code values into a list with the following code:

df_0 = payment_data.groupby(['id'])['prod_code'].apply(list).reset_index()
df_0.head(10)

This is the output:

As we can see, now we have all the unique different values of prod_code (one for every transaction made) grouped by the ID.

The next feature that I needed to group was new_balance , where I decided to get the mean of all the new balances for every prod_code:

df_1 = payment_data.groupby(['id'])['new_balance'].apply(np.mean, axis=0).reset_index()
df_1.head(10)

Now, I had to group all the different types of overall payment: OVD_t1, OVD_t2, and OVD_t3. First, I grouped the sum of every type with their correspondent ID and then grouped all the types together in one dataframe.

For this I had to use the function reduce:

from functools import reduce

df_OVD_t1 = payment_data.groupby(['id'])['OVD_t1'].apply(np.sum, axis=0).reset_index()
df_OVD_t2 = payment_data.groupby(['id'])['OVD_t2'].apply(np.sum, axis=0).reset_index()
df_OVD_t3 = payment_data.groupby(['id'])['OVD_t3'].apply(np.sum, axis=0).reset_index()
OVD_dfs = [df_OVD_t1, df_OVD_t2, df_OVD_t3]
df_2 = reduce(lambda left,right: pd.merge(left,right,on='id'), OVD_dfs)
df_2.head(10)

The final feature that I needed to group was pay_normal. Here, the only operation that I applied was np.sum because I needed the total of times that the payment was normal for every client.

df_3 = payment_data.groupby(['id'])['pay_normal'].apply(np.sum, axis=0).reset_index()
df_3.head(10)

Merging all the Dataframes into one

The final step for this task was to merge all the dataframes that I created into only one. For this, I used the reduce function again in the same form as in the grouping of the OVD_types before.

dfs = [df_0, df_1, df_2, df_3]
df_final = reduce(lambda left,right: pd.merge(left,right,on='id'), dfs)
df_final.head(10)

Conclusion

And that’s it. I hope this tutorial helps you, and thanks for getting here.

If you liked the article, I invite you to follow me on dev.to, GitHub and LinkedIn!

--

--

Leandro Ruiz
The Startup

Data Scientist & Creative Guy. I write about Self Development, Knowledge, Personal Finances and life in general.