Measuring the performance of ML classification

A new publication specifies methodologies for measuring the classification performance of machine learning models, systems and algorithms

Mike Mullane
e-tech
5 min readOct 18, 2022

--

Words machine learning on black canvas
Image by Pete Linforth from Pixabay

Classification is about categorizing objects such as documents or images into classes and subclasses according to their characteristics. A simple example is an email spam filter, which classifies incoming messages as ‘spam’ and ‘not spam’. The classifier needs examples of ‘spam’ and ‘not spam’ emails to learn how to perform the task by recognizing patterns.

Evaluating the performance of the classifier is essential to improving accuracy and reducing bias. Machine learning (ML) models, for example, can be trained in a way that amplifies sexist and racist biases from the real world. In recent years, we have seen image recognition software that failed to identify non-white faces correctly, for example. Similarly, biased data samples can teach machines that women shop and cook, while men work in offices and factories. This kind of problem can occur when the scientists who train models unwittingly introduce their own prejudices into their work.

Biases can also occur when a sample is collected in such a way that some members of the intended statistical population are less represented than others. In other words, when the data used to train a model does not accurately reflect the environment in which it will operate.

A sampling bias could be introduced, for instance, if an algorithm used for medical diagnosis is trained only on data derived from one population. Similarly, if an algorithm meant to operate self-driving vehicles all year round is trained only on data collected during the summer months, falling snowflakes might confuse the system.

Systematic value distortion

Systematic value distortion occurs when the true value of a measurement is systematically overstated or understated. This kind of error usually occurs when there is a problem with the device or process used to make the measurements.

On a relatively simple level, measurement errors might occur if training data is captured on a camera that filters out some colours. Often the problem is more complex.

In health care, for instance, it is difficult to implement a uniform process for measuring patient data from electronic records. Even superficially similar records may be difficult to compare. This is because a diagnosis usually requires interpreting test results and making several judgements at different stages in the progression of a disease, with the timing of the initial decision depending on when a patient first felt unwell enough to see a doctor. The algorithm also needs to consider the past medical history of each patient. An algorithm must be able to take all of the variables into account in order to make an accurate prognosis.

Algorithmic bias

Algorithmic bias is what happens when an ML system reflects the values of the people who developed or trained it. For example, confirmation bias may be built into an algorithm if the aim, whether intentional or unintentional, is to prove an assumption or opinion. This might happen in a business, journalistic, or political environment, for example.

There have been several high-profile cases of algorithmic bias related to social media and search engines, as well as in the field of corporate recruitment.

New international standard

The joint IEC and ISO committee on AI, SC 42, has developed a new Technical Specification — ISO/IEC TS 4213 — that specifies methodologies for measuring the classification performance of ML models, systems and algorithms. It describes methodologies for binary classification, such as the spam filter example given above, as well as multi-class and multi-label classification use cases. ISO/IEC TS 4213 describes consistent approaches and methods that can be applied to comparing results more efficiently across different evaluation regimes. The new TS builds on foundational concepts in the recently published ISO/IEC 22989, which covers AI concepts and terminology.

“Computational algorithms are at the heart of artificial intelligence systems. This novel standard facilitates consistent and fair outcomes across different algorithmic approaches,” said SC 42 Chair Wael William Diab. “This publication reinforces the committee’s goal of enabling broad responsible adoption of AI and complements an extensive portfolio of work that addresses the entire AI ecosystem.”

Project Leader Michel Thieme said, “As academic, commercial and governmental practitioners continue to improve machine learning models, there is a real need for consistent approaches and methods to be applied to machine learning classification performance assessment. This new publication will benefit a wide range of stakeholders”.

ISO/IEC TS 4213 specifies methodologies for measuring classification performance of machine learning models, systems and algorithms. It helps to answer questions, such as

· How “good” is the model?
· How reliable are its predictions?
· What is the expected frequency and size of errors?
· What is the best-performing model out of N alternatives?
· Does the model perform well over time with noisy or new production data?

The TS defines evaluation as the “process of comparing the classification predictions made by the model on data to the actual labels in the data.” Evaluation concepts in ISO/IEC TS 4213 include:

· Data representativeness and bias
· Pre-processing
· Training data
· Test and validation data
· Cross-validation
· Limiting information leakage
· Limiting channel effects
· Ground truth
· ML algorithms, hyperparameters and parameters
· Evaluation environment
· Acceleration
· Appropriate baselines
· Putting performance in context.

“Training data can be skewed, incomplete, outdated, disproportionate or have embedded historical biases. Such unwanted biases can propagate biases present in the training data and are detrimental to model training,” said Project Leader Lingzhong Meng.

“Moreover, training data for a particular task might not be extensible to different tasks. Extra care should be taken when splitting unbalanced data into training and test to ensure that similar distributions are maintained between training, validation and test set.”

ISO/IEC TS 4213 is part of a wide spectrum of approaches to ensure fairness and reduce bias in the increasingly prevalent ML systems on which society relies. These touch all aspects of modern life. For instance, analysts estimate that three-quarters of job applications in the US are processed by algorithms. Many banks are using artificial intelligence to make loan and credit evaluations. Left unchecked, bias can lead to a host of unfair decisions, including job rejections or being wrongfully denied bank loans.

SC 42 develops international standards for AI. Its unique holistic approach considers the entire AI ecosystem, by looking at technology capability and non-technical requirements, such as business and regulatory and policy requirements, application domain needs, as well as ethical and societal concerns. SC 42 also organizes a bi-annual ISO/IEC AI Workshop series that is freely available. Archives of the inaugural workshop from May and registration for the upcoming workshop in November can be found on the workshop series website.

--

--

Mike Mullane
e-tech
Editor for

Journalist working at the intersection of technology and media