pandas: Find column with min/max value for each row in dataframe
Sometimes you have multiple columns of measures for a single purpose, yet you only want to keep the one that performs according to your needs.
In this demo we’ll analyse a synthetic clustering model output dataset. The trick is that we have columns with the distance to the centre of each cluster, but not a column with the cluster assignment itself. In other words, it becomes hard to further analyse the model predictions.
So, we have a dataframe for a clustering model of stocks data. The Symbol, Company, Sector, Date and Price columns were the features used for prediction.
The remaining columns ten “DistancesToClusterCenter no.” columns were output by the model. These represent how far a data point is from each cluster centroid.
But here’s the problem. This dataframe did not include a column with the single cluster assignment, only the centroid distances. How do we create a column like that?
The answer is the idxmin
function. As per the documentation, this function returns the index or column name of the cell with the minimum value, depending on the axis specified. We are interested in the column with the minimum value, so we can make this call:
df["ClusterAssignment"] = df.idxmin(axis="columns")
Which saves the column with minimum value in the new ClusterAssignment column for each row. However, that needs a tweak: call idxmin
only on the centroid distance columns, instead of all the columns available.
names = [f"DistancesToClusterCenter no.{i}" for i in range(0, 10)]df["ClusterAssignment"] = df[names].idxmin(axis="columns")
names
is a list of the centroid distance columns. This second idxmin
is called only on that subset, not all columns. Here’s the result.
We can go a step further and keep only the cluster number in ClusterAssignment, instead of the complete column name .
df["ClusterAssignment"] = df["ClusterAssignment"].map(lambda value: value.split(".")[-1])
For each name, split the string on the dot (.) and keep the last substring, i.e., the cluster number.
We already have what we wanted, so we can remove the centroid distance columns. Thankfully we have those saved in the names
list, for an easy drop
call.
df= df.drop(names, axis="columns")
We have arrived at the desired dataframe: the input features and the cluster predicted by the model.
As an extra tip, you could easily repeat this process for the column with the largest value, i.e., the centroid furthest from the data point, by using idxmax
instead.
df["MostDistantCluster"] = df[names].idxmax(axis="columns")df["MostDistantCluster"] = df["MostDistantCluster"].map(lambda value: value.split(".")[-1])df["MostDistantCluster"] = df["MostDistantCluster"].map(lambda value: value.split(".")[-1])
Which would result in this dataframe.
Here’s the complete script for this demo.