Contextual Brand Safety -II

Contextual Brand Safety Cover picture
Contextual Brand Safety Cover picture

Contextual brand safety is an ongoing series. This is the second blog in this series. Through this series, we talk about steps to be taken to do multi-label text classification in the industry. This blog post talks about model training and evaluation.

1. Introduction

2. Experimental setup

  1. Data loading and preprocessing (1 step)
  2. Model training (1 step for each layer)
Image for post
Image for post
Model training ecosystem

We used Weights & Biases for hyperparameter tuning and better visualization. We used an AWS GPU enabled EC2 instance, as a compute resource for running parallel experiments, connected to an AWS EFS volume to store our experiment results. We used fastai and Pytorch framework for rapid prototyping and experimentation.

Although we tried and tested several machine learning and deep learning models, we will talk about the two best models; ULMFiT and BERT and the experiments conducted with them.

3. Experiments with ULMFiT

3.1 Language model experiments:

  1. Pretraining language model on Wikipedia
  2. Fine-tuning language model on production data
  3. Training a classifier on production data
Image for post
Image for post
ULMFiT approach towards language modeling

In all our experiments, we used the language model pretrained on Wikitext 103. We conducted experiments while fine tuning the language model with two different hyperparameters:

  1. drop out multiplier: This is a multiplier to the dropouts that are default with the model
  2. vocabulary size: The number of words used in the vocabulary used to build the model

Drop out Multiplier

Image for post
Image for post

In the plot above, we can observe the following things:

  1. When dropout is increased, with different data sizes, we observe degradation in performance
  2. We also observe that, with a higher amount of data, the degradation slows down

Even though, we can see that with higher data size, there is a perceivable difference in performance. Since the evaluation set for both 100k and 200k are different but from the same distribution, it is inconclusive whether there is any difference and if there is by how much.

Vocabulary size
We conducted another experiment to understand the variation of LM performance with respect to data size. The plot below shows the variation of Language model performance with respect to the variation of vocabulary size.

Image for post
Image for post

In the plot above, we can observe the following things:

  1. With a corpus size of 100K, we observe that performance increase initially but then decrease gradually whereas with a corpus size of 250K, performance decrease monotonically with respect to vocabulary size.

Increase in vocabulary can be seen as an increase in complexity, therefore, there was an initial increase in LM performance but it later decreased due to overfitting. But in the case of corpus size of 250K, with a large amount of data, the model is able to generalize with fewer features aka small vocabulary.

3.2 Experiments with Classifier

We conducted a total of 104 experiments while training our classifier in stage 3 of language model development. During this stage, we used bayesian hyperparameter tuning to come up with the best set of hyperparameters for the task at hand.

  1. Language model encoders
  2. Final layer dropouts
  3. hidden layer size in the classifier
Image for post
Image for post
Variation of brand safety performance with a change in drop out, classifier hidden layer size and encoder

Note: Encoder naming convention: An language model trained with a dropout multiplier of 0.3 is named as fwd_enc_fp16_0.3_AWD

Key observations

  1. An encoder trained with a higher value of dropout multiplier has the highest performance when trained with a lower classifier dropout and vice versa
  2. The best performing encoder may or may not be a part of the best performing classifier.

Keep diverse equally performing encoders from the finetuning stage and include them as a hyperparameter tuning of the classifier to get the best performance.

Characteristics of the dataset, encoder and classifier hyperparameter need to be tuned together because even though we develop the classifier in 3 steps. Each step is dependent on the outcomes of previous steps.

4. Comparison of ULMFiT Vs BERT

1.Multi-label performance (Micro-averaged)

This dataset is less imbalanced with respect to all the classes and labelled for all types of threatening content

Image for post
Image for post
dataset-1: Multilabel threat

2.Binary performance

The performance was calculated on a dataset that mimics our production distribution. Only 10%–15% of the entire corpus has threatening content.

Image for post
Image for post
dataset-2: Binary threat

In the dataset-1, we can see that BERT has a better recall, which is good for our task but the recall is worse on the production dataset (dataset-2). Even though BERT being the state of the art (SOTA) model, it is still unable to outperform ULMFiT. The performance could be said to be similar if not worse. This trend could be attributed to the following reasons:

  1. Transformers outshine LSTMs on medium to long text because they are able to capture the longer-term dependencies. The median text length of our production traffic is about 50 words, which is not quite long. LSTMs are able to handle it pretty well.

2. ULMFiT encoders were fine tuned in an unsupervised way to capture the language differences between its pretrained corpus, Wiki-103 and the production traffic whereas BERT was not given any such special treatment.

Memory footprint & Throughput comparison

Image for post
Image for post

Throughput of a system is measured in the number of completed messages per second. ULMFiT is about 10 times faster than BERT. This means that deploying BERT in production will cost 10 times more than ULMFiT.

Both ULMFiT and BERT were quantized before throughput measurement.

The memory footprint of BERT is about 2x more than ULMFiT. This is also a cause of concern because when we asynchronously process messages as a batch, we could increase the size of the batch to scale the system better. Thus, having a model with lesser memory footprint will help in scaling.

Key takeaways:

  1. By following the fine tuning procedure as per the ULMFiT paper, we were able to achieve performance on par with state of the art models such as BERT.
  2. Quantized of ULMFiT has a faster inference time than BERT.

Future work:

  1. Explore distillation to speed up BERT inference with minimal loss in quality, such as DistilBERT.
  2. Amortization of BERT encoder cost by using multi-task learning using BERT as a centralized encoder for all NLP downstream tasks.



About Me: Graduated with a Masters in Data Science from the University of San Francisco. I am an NLP Scientist at GumGum. Interested in the application of NLP and Speech.

We’re always looking for new talent! View jobs.

Follow us: Facebook | Twitter | | Linkedin | Instagram


Thoughts from the GumGum tech team

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store