Taking a fresh look at metrics for Classification Tasks at GumGum — Introduction

Photo of books on a shelf, organized by color.
Photo by Maarten van den Heuvel on Unsplash

At GumGum, we train machine learning models on unstructured data, like text and images. An integral part of model development is model evaluation. Especially for classification tasks, there are several metrics out there. Most known are accuracy, precision, recall, the F-beta score, and the ROC AUC. Each of those gives a slightly different picture of the model’s performance. With this series, I want to give those metrics a fresh look and dive into what each of them can do for you, and, even more importantly, what they cannot do for you.

Blogs in this Series:

  1. Taking a fresh look at metrics for Classification Tasks at GumGum — Introduction
  2. Taking a look at Accuracy, Precision, and Recall for Classification tasks
  3. Taking a close look at Precision for classification tasks


In this blog post, I want to give an introduction to what classification is and the most important classification tasks Gumgum is working on. This introduction is meant to cover the basis for the following blogs in this series, where I will go in-depth into different classification metrics, their upsides and downsides, and why we use them at GumGum.

What is classification?

Classification is the task of mapping input data to a set of predefined categories. On a high level, we distinguish between a binary classification problem and a multiclass classification problem.
In binary classification we have exactly 2 categories, e.g. in image classification we might want to identify cats in images, so we would have two categories — ‘cat’ & ‘not cat’.

In the multiclass problem, we have strictly more than 2 categories, e.g. in image classification, we might want to identify all animals in our images, we will then have a class for cats, for dogs, for birds, …

The multiclass case can then be split into 2 situations: multiclass and multiclass-multilabel. In multiclass classification, only one out of all of our labels can be present (e.g. an image with only one animal will be a multiclass problem, as only one class can be present — i.e. only one animal can be predicted).

Multilabel is the situation when multiple classes can be present at once. E.g. if we look at images with more than one animal present, we are looking to solve a multilabel classification problem, where our classifier should return more than one label at a time. If you are interested in reading more about multilabel vs. multiclass you can read the following blog post.

At GumGum, we mostly deal with multiclass-multilabel problems. The NLP & CV teams work on contextual brand-safety classification, including Brand Safety, which I will introduce in the next section, and IAB content classification.

What is Brand Safety?

Photo of an umbrella to demonstrate that brand safety can shield brands from unsafe content like an umbrella shields someone from the rain.
Photo of an umbrella to demonstrate that brand safety can shield brands from unsafe content like an umbrella shields someone from the rain.
Brand Safety — An umbrella for brands. Photo by Aline de Nadai on Unsplash

In the online advertising business, Brand Safety is very important for brands. They do not want their ads to be shown right next to inappropriate or brand-unsafe content, as this associates the brand name with the content. Inappropriate or brand-unsafe content includes content like violence, crime, gore, disasters, death, but also sexual content. Therefore Brand Safety is used for anti-targeting, working as a filter to prevent ad placements from happening on a Brand-Unsafe page.

In the last couple of years a general Brand Safety taxonomy, called the ‘Brand Safety Floor + Suitability Framework’ has been established by GARM. The above link outlines the exact taxonomy created by GARM (previously, the 4 A’s). GumGum is offering a Brand Safety Feature, that detects brand-unsafe content on pages, and with it, offers Brand Safety to our clients.

For GumGum, it is absolutely important to quantify the performance on the Brand Safety classification task on a binary level (“safe” vs “unsafe”), but also on the multiclass-multilabel problem of identifying the different categories outlined in the ‘Brand Safety Floor + Suitability Framework’. In addition, Brand Safety classification is an imbalanced data problem, as only about 16% of the total traffic is brand-unsafe.

What is IAB content classification?

In addition to Brand-Safety classification, the NLP team supports the IAB context taxonomy, which is a case of extreme classification with ~700 classes in total, which is used for ad-targeting purposes.

For instance, a soccer-shoe brand might want to only place the ads of their new shoes on pages containing soccer-related content. The IAB content taxonomy has a soccer class that can be used to target soccer-related content exclusively.

The classes range from Business and Finance, over Automobile, to Hobbies and Interests, Movies, Music and Entertainment. It is designed to cover the main targeting needs for online advertisements.


In this blog, I introduced the different flavors of classification tasks and gave an introduction into Brand-Safety classification, as well as the IAB content taxonomy classification as the most important classification metrics for GumGum.

In my next blog post, I will introduce the 3 most commonly used classification metrics and go over their drawbacks and upsides in depth. The classification metrics in question are accuracy, precision & recall.

I will also explain why accuracy is not the right metric for GumGum and how precision and recall can overcome accuracy’s shortcomings.

We’re always looking for new talent! View jobs.

Follow us: Facebook | Twitter | | Linkedin | Instagram