pandas: Find column with min/max value for each row in dataframe

José Fernando Costa
Nerd For Tech
Published in
3 min readMay 7, 2021

Sometimes you have multiple columns of measures for a single purpose, yet you only want to keep the one that performs according to your needs.

(source)

In this demo we’ll analyse a synthetic clustering model output dataset. The trick is that we have columns with the distance to the centre of each cluster, but not a column with the cluster assignment itself. In other words, it becomes hard to further analyse the model predictions.

Dataframe preview
Dataframe preview

So, we have a dataframe for a clustering model of stocks data. The Symbol, Company, Sector, Date and Price columns were the features used for prediction.

The remaining columns ten “DistancesToClusterCenter no.” columns were output by the model. These represent how far a data point is from each cluster centroid.

But here’s the problem. This dataframe did not include a column with the single cluster assignment, only the centroid distances. How do we create a column like that?

The answer is the idxmin function. As per the documentation, this function returns the index or column name of the cell with the minimum value, depending on the axis specified. We are interested in the column with the minimum value, so we can make this call:

df["ClusterAssignment"] = df.idxmin(axis="columns")

Which saves the column with minimum value in the new ClusterAssignment column for each row. However, that needs a tweak: call idxmin only on the centroid distance columns, instead of all the columns available.

names = [f"DistancesToClusterCenter no.{i}" for i in range(0, 10)]df["ClusterAssignment"] = df[names].idxmin(axis="columns")

names is a list of the centroid distance columns. This second idxmin is called only on that subset, not all columns. Here’s the result.

idxmin output
idxmin output

We can go a step further and keep only the cluster number in ClusterAssignment, instead of the complete column name .

df["ClusterAssignment"] = df["ClusterAssignment"].map(lambda value: value.split(".")[-1])

For each name, split the string on the dot (.) and keep the last substring, i.e., the cluster number.

Cleaned ClusterAssignment
Cleaned ClusterAssignment

We already have what we wanted, so we can remove the centroid distance columns. Thankfully we have those saved in the names list, for an easy drop call.

df= df.drop(names, axis="columns")
Final dataframe
Final dataframe

We have arrived at the desired dataframe: the input features and the cluster predicted by the model.

As an extra tip, you could easily repeat this process for the column with the largest value, i.e., the centroid furthest from the data point, by using idxmax instead.

df["MostDistantCluster"] = df[names].idxmax(axis="columns")df["MostDistantCluster"] = df["MostDistantCluster"].map(lambda value: value.split(".")[-1])df["MostDistantCluster"] = df["MostDistantCluster"].map(lambda value: value.split(".")[-1])

Which would result in this dataframe.

Resulting dataframe for idxmax
Resulting dataframe for idxmax

Here’s the complete script for this demo.

Find column with max/min value for each row in dataframe

--

--