Crafting the Perfect Data Science Evaluation Metric for your Business Case

Published in

AI Science

5 min readJun 17, 2023

Image credits: https://unsplash.com/photos/f-wBCMj4pNM.

To ensure the success of your data science initiatives, it is imperative to bridge the gap between your machine learning model evaluation metric and the business objective from the get-go. This article demonstrates through an example, the impact of machine learning model evaluation, and how it can facilitate stakeholder buy-in.

Consider a scenario, the data science objective is to build a classifier that can accurately predict component failures in a business setting. The target variable is binary, with a value of 1 indicating component failures and 0 indicating working components.

Start by identifying the key stakeholders that are most likely to be impacted by the outcomes of the model that you are trying to build. Work collaboratively to understand their risk potentials for different types of misclassification errors. Use this information to construct an evaluation metric balancing the trade-off between misclassification errors while aligning with the organisation’s overall business goals.

In the given scenario, there can be different types of cost implications depending on the model predictions. Let’s assume the associated costs are $40,000 for replacing a faulty component, $15,000 for repairing a faulty component, and $5,000 for performing a component inspection.

The total associated maintenance cost based on ‘your’ model predictions can be expressed as:

Total Maintenance Cost based on 'your' Model Prediction =

(Failures correctly predicted by your model)*$15,000 +
(Real failures that are not detected by your model)*$40,000 +
(Failures predicted by your model that are not real failures)*$5,000

Let’s suppose there exists a perfect classifier that is capable of predicting all real failures perfectly.

Minimum Cost associated with a Perfect Classifier = 
 
 (All real failures)*$15,000 = 

 (Failures correctly predicted by your model + 
  Real failures that are not detected by your model)*$15,000

One valid evaluation metric for the business use case can then be expressed as below. The ratio can range from 0 to 1, and our goal will be to maximise the evaluation metric as we select the optimal classifier.

Evaluation Metric for the Business Case:

 Minimum Cost associated with a Perfect Classifier ÷
 Total Maintenance Cost based on 'your' Model Prediction

Objective: maximise(evaluation metric)

The code for crafting this evaluation metric is given below:

from sklearn.metrics import confusion_matrix

def minimum_vs_model_cost(y_train, y_pred):    

    # y_train is the target value for the training set
    # y_pred is the predicted target value for the training set

    true_positives    = confusion_matrix(y_train, y_pred)[1, 1]
    false_positive    = confusion_matrix(y_train, y_pred)[0, 1]
    false_negatives   = confusion_matrix(y_train, y_pred)[1, 0]
 
    minimum_cost      = (true_positives + false_negatives)*15000
    model_cost        = (true_positives*15000  +
                         false_negatives*40000 +
                         false_positives*5000)

    evaluation_metric = minimum_cost/model_cost
    
    return evaluation_metric

from sklearn import metrics
scorer = metrics.make_scorer(minimum_vs_model_cost, greater_is_better=True)

The code above is self-explanatory if you are familiar with the definition of a confusion matrix. A confusion matrix is a two dimensional table which compares the real target labels and the predicted target labels, summarising the performance of a classifier.

Image Source: Confusion Matrix generated by the author.

Here,

Failures correctly predicted by model = True Positives
Real failures that are not detected by model = False Negative
Failures predicted by model that are not real failures = False Positives
All real failures = (True Positives + False Negatives)

The following box-plot illustrates the performance of various baseline classifiers using the 5-fold Stratified Cross Validation on the training set for the custom evaluation metric “minimum_vs_model_cost”. The baseline classifiers considered in the scenario are Logistic Regression, Decision Tree, Random Forest, Bagging, Adaptive Boosting, Gradient Boosting and Extreme Gradient Boosting Classifiers.

Image Source: Generated by the author. Box-plot distribution of the 5-fold cross validation performance on the training set for the custom evaluation metric “minimum_vs_model_cost” for different classifiers.

Below is the code snippet that was used to generate the box-plot:

from sklearn.model_selection import StratifiedKFold, cross_val_score

models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Logistic Regression", LogisticRegression(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1)))

results = []  # Empty list to store all model's CV scores
names = []    # Empty list to store name of the models
score = []

# loop through all models to get the mean cross validated score
for name, model in models:
    kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)  # Setting number of splits equal to 5
    cv_result = cross_val_score(estimator=model, X=X_train, y=y_train, 
                                scoring=scorer, cv=kfold)
    results.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean()))

# Plotting boxplots for CV scores 
fig = plt.figure(figsize=(10, 4))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

The XGBoost classifier outperforms other classifiers in terms of the “minimum_vs_model_cost” scores and is a promising candidate for further hyperparameter tuning. To perform hyperparameter tuning, you can use the code snippet below. For a comprehensive list of parameters, please refer to this link.

# Example code to perform hyperparameter tuning

model = XGBClassifier(random_state=1)

param_grid = {"n_estimators": np.arange(150, 300, 50),
              "scale_pos_weight": [5, 10],
              "learning_rate": [0.1, 0.2],
              "gamma": [0, 3, 5],
              "subsample": [0.8, 0.9],} # initialise range for different hyperparameters

# define scorer
scorer = metrics.make_scorer(minimum_vs_model_cost, greater_is_better=True)

randomized_cv = RandomizedSearchCV(estimator=model,
                                   param_distributions=param_grid,
                                   n_iter=20, scoring=scorer, cv=5,
                                   random_state=1,n_jobs=-1,)

randomized_cv.fit(X_train, y_train)
print("Best parameters are {} with CV score={}:".format
        (randomized_cv.best_params_, randomized_cv.best_score_))

If the performance on the testing dataset for the custom evaluation metric “minimum_vs_model_cost” with the chosen tuned classifier is 0.8 for example, then the total maintenance cost based on the model predictions is 1.25 times the minimum cost associated with a perfect classifier. This serves as a baseline to connect the performance of the machine learning model to the organisation’s business objective.

Being directly able to connect machine learning models to business objectives enables stakeholder alignment, and for data science projects to contribute to the bottomline, propelling the business forward.

Thank you, I hope you enjoyed reading this article! If you liked the article, you can also check out the project here.

Crafting the Perfect Data Science Evaluation Metric for your Business Case

Written by Rochita Sundar