Contextual Brand Safety - I

Sooraj Subrahmannian
Apr 15 · 6 min read
Image for post
Image for post

Contextual brand safety is an ongoing series. This is the first blog in this series. Through this series, we talk about steps to be taken to do multi-label text classification in the industry. This blog post sets the stage by talking about the problem and data collection.

Introduction

One of the most important measures taken to ensure brand safety is Publisher vetting. We ensure that each new content under every publisher is validated against our brand safety guidelines, where the publisher content is filtered in such a way that it does not consist of the following:

Image for post
Image for post

The prevalence of threatening content is less than 10% of our entire traffic. The problem is also multi-modal. Each page consists of text, images and videos. Hence we tackle this problem with textual and visual intelligence. In this blog post, we will discuss in detail about the evolution of our textual intelligence in the context of brand safety.

To add fuel to the fire, the scale of this problem makes it even harder. We process about 40 million unique pages a day. Therefore, the solution needs to scale as well.

Goals & metrics

  1. Identify all the threatening content in our traffic.
  2. Reduce the amount of inventory loss due to a safe page tagged as unsafe (False positives)
  3. Scalability

To achieve the goals mentioned above, it's crucial to evaluate the system using the right metrics. Hence we look at Recall and FPR to evaluate our goals 1 and 2 respectively.

We use Requests per second (RPS), the number of input requests completely processed in one second, to understand the scalability of the model.

Previous work

Pros: The technique is pretty fast during inference because we only have to lookup in a dictionary of keyword scores.

Cons: Domain drift from our publisher content with respect to Wikipedia causes many ambiguities to creep in. A keyword-based approach often fails at capturing context.

Why does context matter?

Let us understand that with an example given below. In the example, the article talks about the impact of coronavirus in several industries. This is not necessarily a threatening article. But if we chose to tag all the articles mentioning COVID/coronavirus we might end up tagging a lot of informative good articles as threatening.

Image for post
Image for post

It is not the keyword that is threatening, it is the context in which the word decides whether an article is threatening or not.

A study by GumGum has shown that 60% of articles that contains COVID related keywords are safe.

In addition to context, like any other legacy system, this system had a lot of features and thresholds which makes it difficult to maintain or do any further development on it.

To alleviate the above problems, we decided to look into language models. As of today, there are a variety of language models. But the two types of language models we explored are:

  1. LSTM-based — ULMFiT
  2. Transformer-based — BERT

We decided to go depth-wise rather than breadthwise in terms of research. Thus, we explored ULMFiT in-depth and compared it with our existing system and the current state-of-the-art BERT.

Data collection and preprocessing

Image for post
Image for post
Data requirements for language modelling

At GumGum, we log all our data for the past couple of months in S3 which is about a couple of billions of documents.

How do we select representative good quality samples for training and testing?

To be able to sample more diverse, representative and good quality data for training and testing, we need to enrich the collected data. The data collected above is run through the following steps in Pypspark using Databricks:

Image for post
Image for post
Data enrichment process
  1. Exact deduplication based on text
  2. Filtering based on text-based statistical metrics such as readability score, type-to-token ratio and text length
Image for post
Image for post

Negative values of readability imply bad quality text but positive readability scores does not always imply good text.

Image for post
Image for post
Effect of type-to-token ratio score

Extremely high/low values of type-to-token ratio give you garbage text.

3. Near duplicate detection and removal using MinHash and LSH

To avoid data leakage, training and testing data are separated across time. We chose October to January for training and February to March for testing.

Data sampling

In supervised training, the samples used to train must be diverse, representative and also hard for the model to train. To incorporate difficulty of samples, we move towards active learning approaches in multi-label classification. We have employed different active learning methods such as:

  • Buckets based on confidence score quantiles:
    Here we sample from each of the score buckets for all the classes iteratively as shown below such that previously imbalanced dataset becomes more balanced. This naive sampling has proven to be effective in our experiments probably because the new and the previous classifiers were very different and there is a lot more to learn from the previous classifier. The cons of this sampling are that it does not capture the co-occurrence probability of classes well and does not have a direct notion of hard samples.
Image for post
Image for post
Buckets of score quantiles
  • Uncertainty sampling based on CVIRS (Category Vector Inconsistency and Ranking of Scores). This paper extends uncertainty sampling to multi-label classification problems. It uses ranking as well as entropy-based measures to calculate uncertainty. This is effective when we have to improve the classifier with the same architecture over and over again.

There are many more methods out there. A complete comparison and understanding of these methods deserve another blog post.

Next step is to get the sampled data annotated and use truth inferencing techniques to obtain the ground truth. This step is done iteratively with multiple annotators until the required quality is achieved.

Conclusion

STAY TUNED…

About Me: Graduated with a Masters in Data Science from the University of San Francisco. I am an NLP Scientist at GumGum. Interested in the application of NLP and Speech.

We’re always looking for new talent! View jobs.

Follow us: Facebook | Twitter | | Linkedin | Instagram

gumgum-tech

Thoughts from the GumGum tech team

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store