Contextual brand safety is an ongoing series. This is the first blog in this series. Through this series, we talk about steps to be taken to do multi-label text classification in the industry. This blog post sets the stage by talking about the problem and data collection.
GumGum is dedicated to ensuring a brand-safe environment to all our clients; advertisers and publishers alike. In order to implement brand safety, we have a variety of methods which assist in ensuring that we deliver ads on safe, relevant and high-quality content.
One of the most important measures taken to ensure brand safety is Publisher vetting. We ensure that each new content under every publisher is validated against our brand safety guidelines, where the publisher content is filtered in such a way that it does not consist of the following:
The prevalence of threatening content is less than 10% of our entire traffic. The problem is also multi-modal. Each page consists of text, images and videos. Hence we tackle this problem with textual and visual intelligence. In this blog post, we will discuss in detail about the evolution of our textual intelligence in the context of brand safety.
To add fuel to the fire, the scale of this problem makes it even harder. We process about 40 million unique pages a day. Therefore, the solution needs to scale as well.
Goals & metrics
Our goal from a modelling perspective is three folds:
- Identify all the threatening content in our traffic.
- Reduce the amount of inventory loss due to a safe page tagged as unsafe (False positives)
We use Requests per second (RPS), the number of input requests completely processed in one second, to understand the scalability of the model.
The previous system was developed based on Explicit Semantic Analysis which uses keyword scores derived from Wikipedia to understand whether a page is threatening or not.
Pros: The technique is pretty fast during inference because we only have to lookup in a dictionary of keyword scores.
Cons: Domain drift from our publisher content with respect to Wikipedia causes many ambiguities to creep in. A keyword-based approach often fails at capturing context.
Why does context matter?
Let us understand that with an example given below. In the example, the article talks about the impact of coronavirus in several industries. This is not necessarily a threatening article. But if we chose to tag all the articles mentioning COVID/coronavirus we might end up tagging a lot of informative good articles as threatening.
It is not the keyword that is threatening, it is the context in which the word decides whether an article is threatening or not.
A study by GumGum has shown that 60% of articles that contains COVID related keywords are safe.
Study: 60% Of Content Related To COVID-19 Considered Brand Safe
Of the 2.8 million online pages containing COVID-19 related keywords across GumGum's publisher network, 62% were…
In addition to context, like any other legacy system, this system had a lot of features and thresholds which makes it difficult to maintain or do any further development on it.
To alleviate the above problems, we decided to look into language models. As of today, there are a variety of language models. But the two types of language models we explored are:
We decided to go depth-wise rather than breadthwise in terms of research. Thus, we explored ULMFiT in-depth and compared it with our existing system and the current state-of-the-art BERT.
Data collection and preprocessing
Data collection is a crucial step. Since we are interested in language models, we need data for two tasks:
At GumGum, we log all our data for the past couple of months in S3 which is about a couple of billions of documents.
How do we select representative good quality samples for training and testing?
To be able to sample more diverse, representative and good quality data for training and testing, we need to enrich the collected data. The data collected above is run through the following steps in Pypspark using Databricks:
Negative values of readability imply bad quality text but positive readability scores does not always imply good text.
Extremely high/low values of type-to-token ratio give you garbage text.
3. Near duplicate detection and removal using MinHash and LSH
To avoid data leakage, training and testing data are separated across time. We chose October to January for training and February to March for testing.
Different sampling technique is employed for unsupervised and supervised training. Unsupervised training is to perform domain adaptation which helps the encoder to learn the nuances in the target domain (production data) that are different from the source domain (Wikipedia). For this purpose, we sampled about 250,000 articles using restricted random sampling (randomly sampled from the entire data but with a cap on the number of samples from top-level domains) to limit the number of contents from an over-represented domain in the enriched data.
In supervised training, the samples used to train must be diverse, representative and also hard for the model to train. To incorporate difficulty of samples, we move towards active learning approaches in multi-label classification. We have employed different active learning methods such as:
- Buckets based on confidence score quantiles:
Here we sample from each of the score buckets for all the classes iteratively as shown below such that previously imbalanced dataset becomes more balanced. This naive sampling has proven to be effective in our experiments probably because the new and the previous classifiers were very different and there is a lot more to learn from the previous classifier. The cons of this sampling are that it does not capture the co-occurrence probability of classes well and does not have a direct notion of hard samples.
- Uncertainty sampling based on CVIRS (Category Vector Inconsistency and Ranking of Scores). This paper extends uncertainty sampling to multi-label classification problems. It uses ranking as well as entropy-based measures to calculate uncertainty. This is effective when we have to improve the classifier with the same architecture over and over again.
There are many more methods out there. A complete comparison and understanding of these methods deserve another blog post.
Next step is to get the sampled data annotated and use truth inferencing techniques to obtain the ground truth. This step is done iteratively with multiple annotators until the required quality is achieved.
In this blog post, we have discussed different techniques to enrich text data and sample data for training and testing. We will discuss the experiments we conducted and the steps we took to put the new model in production in the next blog.