Classification Models Utilization by Quality Metrics

Published in

The Startup

6 min readSep 22, 2019

Machine Learning (ML) models are the most “heavy lifting” data analysis tool since Querying. In short, an ML model receives big volumes of data to analyze and identify patterns that, over time, will enable it to predict insights with new sets of data. These models can be continuously optimized given endless data with very little, or virtually no, human intervention.

As volumes of available data increase at a staggering rate, ML enables organizations to draw insights at scale. Various types of ML models are often integrated into different areas of the company products, used not only for research and analysis but also as an integral part of a company’s core value proposition.

As more and more companies utilize ML models in their everyday analysis work and products, the question of reliability becomes centric. Being statistical models, there are probabilistic behaviors to consider and each type of model has its own best practices to measure performance.

In this post, I’ll share some best practices for one of the most common ML model types.

Classification

A Classification model is about predicting an observation match to a specific class. The common language used to describe classification models is the confusion matrix.

Each observation that goes through the classification model will fit into one of the matrix quadrants. Evaluating a dataset by this matrix will enable producing different performance measurements for the model being evaluated.

Identify The Business Case

When measuring the model’s performance, it is important to note the business use case and to set the desired business criteria. It’s also important to consider the relationship between the two types of mistakes that can occur in such models:

Type 1 error — when the model produces a False Positive (FP)

Type 2 error — when the model produces a False Negative (FN)

ML is rarely being used to solve simple cases of “yes or no”. In complex problems, where the line between positive and negative is a factor of many features, these mistakes are interdependent. Meaning, when the behavior of the data is inconclusive, pending the weight given to each feature, type 1 error can grow at the expense of minimizing the type 2 error and vice-versa.

Let’s assume I have this model that predicts the security level of approving a package for delivery. FN can be very costly here, like in case of a bomb threat on an airplane. So, we might prefer screening all packages destined to be onboard, paying with resources and time in order to achieve a minimum level of FN (as a statistician, I won’t say 0). We must also be willing to “pay” for it by having significant FP (i.e. normative subjects falsely flagged for breaching security).

In other cases, like inspecting travelers’ luggage coming into the country to check for tax offenders, we’d likely choose a completely different approach. Considering the lower risk, we might prefer picking specific candidates to be screened — maybe those who came back from predetermined, high-profile destinations or with considerably higher weight than their departure — in order to save on resources and unnecessary trouble. This method will cost us some FN, but, will save us considerable resources for dealing with significant FP rates.

Note the simplistic example charted above — the optimized line balances both mistake types. The “Minimize FN” line simulates a case such as the security checks, in the price of falsely flagging some innocent subjects as failed the check. The “Minimize FP” line simulates a case such as the transaction processing, losing less business at the expense of approving some fraudulent transactions.

True Negative (TN) — in most cases, this part of the matrix will hold most of the observations. This will also usually be the result of optimizing other predictions and won’t be optimized to itself, due to naturally focusing on predicting what something is rather than what it isn’t. This is also the reason why TN is underrepresented in the suggested performance measurements below, together with the understanding that the TN size is heavily influencing the sample. TN observations can be added to the sample and cause a significant decrease in mistake measurements due to their proportional part in the whole sample.

Model Quality Metrics

Evaluating the model performance should include 2 sides of the coin and can be done equally by minimizing these measurements:

False Discovery Rate (FDR) = False Positive / Predicted Positive

Miss Rate (aka False Negative Rate, FNR) = False Negative / Real Positive

Or by maximizing these:

Precision (aka Positive Predictive Value, PPV) = True Positive / Predicted Positive

Recall (aka True Positive Rate, TPR) = True Positive / Real Positive

Business-Level quality criteria should define the threshold we’re willing to accept per measurement. Also note, selecting the pair of measurements (to minimize or maximize), has a lot to do with the message you want to send.

Optimization Tools

Sometimes, even after leveraging all the available data we have in order to create the best possible predictions, the model still can’t reach the criteria we set upfront. At that point, there are a few things you could consider in order to measure yourself vs your goal.

Use case frequency

Remember the TN we discussed?

Consider a use case which appears once every 500K observations, or another that consists of 90% of the observations. In these cases, evaluating our measurements over a sample representing real, use-case frequency can help shed light over whether or not the criteria is met (or over the criteria gap). In these cases, I recommend adding the following measurements to your set, in order to assist in determining the actual FDR, FNR you’re set to meet on a “natural” community of observations:

Fall-out (aka False Positive Rate, FPR) = False Positive / Real Negative

False Omission Rate (FOR) = False Negative / Predicted Negative

As before, instead of minimizing the FPR, POR, you can also choose to maximize their complementary measurements:

Specificity (aka True Negative Rate, TNR) = True Negative / Real Negative

Negative Prediction Value (NPV) = True Negative / Predicted Negative

Audit population

In some cases, we can clearly signal a set of observations for which the measurements are evaluated significantly different than for the rest. Sometimes, it can pay out to signal this group out of the sample and call it by name (I chose “Audit”). By doing so, you may gain the improvement of measurements for the majority of your observations while identifying the specific population that behaves differently. These groups can often be optimized using an additional set of tools (maybe even a different model), in order to reach those same criteria as the other observations.

In order to decide whether you have an “audit worthy” use case on your hands, I recommend the following steps:

Define the audit criteria
Signal the audit observations out of the sample
Evaluate the core sample measurements
Evaluate the audit sample measurements

If the measurements are significantly different between groups, you have an “audit worthy” use case. You can then start analyzing which are the best tools to further improve the audit population towards your set criteria.

I also recommend defining a maximum audit size measurement, in order to make sure the use case is suitable for an audit analysis and not simply unsuitable for the initial model you picked (if, for instance, 80% of the observations fall under “audit”):

Audit size = Audit / Total Population

That’s it!

Working with big data, we develop models over random data sets, which represent a big population. Those models are often eventually applied to a specific population attached to a business use case, which at times can behave differently than the entire original population.

It is especially for those cases that we need these optimization tools, in order to make sure our research is applicable for the destination population and to ensure a smooth transition from lab-development to real-life utilization.