Building NLP Classifiers Cheaply With Transfer Learning and Weak Supervision

A Step-by-Step Guide for Building an Anti-Semitic Tweet Classifier

Abraham Starosta
Feb 15, 2019 · 15 min read
Text + Intelligence = Gold… But, how can we mine it cheaply?

Introduction

There is a catch to training state-of-the-art NLP models: their reliance on massive hand-labeled training sets. That’s why data labeling is usually the bottleneck in developing NLP applications and keeping them up-to-date. For example, imagine how much it would cost to pay medical specialists to label thousands of electronic health records. In general, having domain experts label thousands of examples is too expensive.

  1. Use weak supervision to build a training set from many unlabeled examples using weak supervision
  2. Use a large pre-trained language model for transfer learning

Background

Weak Supervision

Weak supervision (WS) helps us alleviate the data bottleneck problem by enabling us to cheaply leverage subject matter expertise to programmatically label millions of data points. More specifically, it’s a framework that helps subject matter experts (SMEs) infuse their knowledge into an AI system in the form of hand-written heuristic rules or distant supervision. As an example of WS adding value in the real-world, Google just published a paper in December 2018 describing Snorkel DryBell, an internal tool they built to use WS to build 3 powerful text classifiers in a fraction of the time.

Overview of the Data Programming Paradigm with Snorkel
  • Syntactics: for instance, Spacy’s dependency trees
  • Distant supervision: external knowledge bases
  • Noisy manual labels: crowdsourcing
  • External models: other models with useful signals
# Set voting values.
ABSTAIN = 0
POSITIVE = 1
NEGATIVE = 2
# Detects common conspiracy theories about jews owning the world.
GLOBALISM = r"\b(Soros|Adelson|Rothschild|Khazar)"

def jews_symbols_of_globalism(tweet_text):
return POSITIVE if re.search(GLOBALISM, tweet_text) else ABSTAIN
  • Improvement in recall: a discriminative model will be able to generalize beyond the rules in our WS model, thus often giving us a bump in recall.

Transfer Learning and ULMFiT

Transfer Learning has greatly impacted computer vision. Using a pre-trained ConvNet on ImageNet as initialization or fine-tuning it to your task at hand has become very common. But, that hadn’t translated into NLP until ULMFiT came about.

ULMFiT
Introduction to ULMFiT
  1. Fine-tune the LM for the task at hand with a large corpus of unlabeled data points
  2. Train a discriminative classifier by fine-tuning it with gradual unfreezing
  • Fastai’s API is very easy to use. This tutorial is very good
  • Produces a Pytorch model we can deploy in production

Step-By-Step Guide for Building an Anti-Semitic Tweet Classifier

In this section, we’ll dive more deeply into the steps I took to build an anti-semitic tweet classifier and I’ll share some more general things I learned throughout this process.

First step: Data Collection and Setting a Target

Collecting unlabeled data: The first step is to put together a large set of unlabeled examples (at least 20,000). For the anti-semitic tweet classifier, I downloaded close to 25,000 tweets that mention the word “jew.”

View of Airtable for Text Labeling
DATA_PATH = "../data"
train = pd.read_csv(os.path.join(DATA_PATH, "train_tweets.csv"))
test = pd.read_csv(os.path.join(DATA_PATH, "test_tweets.csv"))
LF_set = pd.read_csv(os.path.join(DATA_PATH, "LF_tweets.csv"))
train.shape, LF_set.shape, test.shape

>> ((24738, 6), (733, 7), (438, 7))

Second Step: Building a Training Set With Snorkel

Building our Labeling Functions is a pretty hands-on stage, but it will pay off! I expect that if you already have domain knowledge, this should take about a day (and if you don’t then it might take a couple days.) Also, this section is a mix of what I did for my project specifically and some general advice of how to use Snorkel that you can apply to your own projects.

# Common insults against jews.
INSULTS = r"\bjew (bitch|shit|crow|fuck|rat|cockroach|ass|bast(a|e)rd)"

def insults(tweet_text):
return POSITIVE if re.search(INSULTS, tweet_text) else ABSTAIN
# If tweet author is jewish then it's likely not anti-semitic.
JEWISH_AUTHOR = r"((\bI am jew)|(\bas a jew)|(\bborn a jew)"

def jewish_author(tweet_tweet):
return NEGATIVE if re.search(JEWISH_AUTHOR, tweet_tweet) else ABSTAIN
# We build a matrix of LF votes for each tweet
LF_matrix = make_Ls_matrix(LF_set, LFs)

# Get true labels for LF set
Y_LF_set = np.array(LF_set['label'])

display(lf_summary(sparse.csr_matrix(LF_matrix),
Y=Y_LF_set,
lf_names=LF_names.values()))
LF Summary
  • Coverage: % of samples for which at least one LF votes positive or negative. You want to maximize this, while keeping a good accuracy.
  • Polarity: tells you what values the LF returns.
  • Overlaps & Conflicts: this tells you how an LF overlaps and conflicts with other LFs. Don’t worry about it too much, the Label Model will actually use this to estimate the accuracy for each LF.
label_coverage(LF_matrix)
>> 0.8062755798090041
from metal.label_model.baselines import MajorityLabelVotermv = MajorityLabelVoter()
Y_train_majority_votes = mv.predict(LF_matrix)
print(classification_report(Y_LFs, Y_train_majority_votes))
Classification Report for Majority Voter Baseline
Google Sheet I used for tuning my LFs
Ls_train = make_Ls_matrix(train, LFs)

# You can tune the learning rate and class balance.
label_model = LabelModel(k=2, seed=123)
label_model.train_model(Ls_train, n_epochs=2000, print_every=1000,
lr=0.0001,
class_balance=np.array([0.2, 0.8]))
Precision-Recall Curve for Label Model
# To use all information possible when we fit our classifier, we can # actually combine our hand-labeled LF set with our training set.Y_train = label_model.predict(Ls_train) + Y_LF_set
  1. Add it to the Label Matrix and check that its accuracy is at least 50%. Try to get the highest accuracy possible, while keeping a good coverage. I grouped different LFs together if they relate to the same topic.
  2. Every once in a while you’ll want to use the baseline Majority Vote model (provided in Snorkel Metal) to label your LF set. Update your LFs accordingly to get a pretty good score just with the Majority Vote model.
  3. If your Majority Vote model isn’t good enough, then you can fix your LFs or go back to step 1 and repeat.
  4. Once your Majority Vote model works, then run your LFs over your Train set. You should have at least 60% coverage.
  5. Once this is done, train your Label Model!
  6. To validate the Label Model, I ran the Label Model over my Training set and printed the top 100 most anti-semitic tweets and 100 least anti-semitic tweets to make sure it was working correctly.
  • On LF coverage: You want to have at least one LF voting positive/negative for at least 65% of our training set. This is called LF Coverage by Snorkel.
  • If you’re not a domain expert to start, you’ll get ideas for new LFs as you label your 600 initial data points.

Third Step: Build Classification Model

The last step is to train our classifier to generalize beyond our noisy hand-made rules.

Baselines
data_lm = TextLMDataBunch.from_df(train_df=LM_TWEETS,         valid_df=df_test, path="")learn_lm = language_model_learner(data_lm, pretrained_model=URLs.WT103_1, drop_mult=0.5)
learn_lm.unfreeze()
for i in range(20):
learn_lm.fit_one_cycle(cyc_len=1, max_lr=1e-3, moms=(0.8, 0.7))
learn_lm.save('twitter_lm')
learn_lm.predict("i hate jews", n_words=10)
>> 'i hate jews are additional for what hello you brother . xxmaj the'
learn_lm.predict("jews", n_words=10)
>> 'jews out there though probably okay jew back xxbos xxmaj my'
# Classifier model data
data_clas = TextClasDataBunch.from_df(path = "",
train_df = df_trn,
valid_df = df_val,
vocab=data_lm.train_ds.vocab,
bs=32,
label_cols=0)
learn = text_classifier_learner(data_clas, drop_mult=0.5)
learn.freeze()
learn.lr_find(start_lr=1e-8, end_lr=1e2)
learn.recorder.plot()
learn.fit_one_cycle(cyc_len=1, max_lr=1e-3, moms=(0.8, 0.7))
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-4,1e-2), moms=(0.8,0.7))
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(1e-5,5e-3), moms=(0.8,0.7))
learn.unfreeze()
learn.fit_one_cycle(4, slice(1e-5,1e-3), moms=(0.8,0.7))
A few training epochs
Precision-Recall curve of ULMFiT with Weak Supervision
Classification Report for ULMFiT Model

Having Fun With Our Model

Below is a pretty cool example of how the model catches that “doesn’t” changes the tweet’s meaning!

learn.predict("george soros controls the government")
>> (Category 1, tensor(1), tensor([0.4436, 0.5564]))
learn.predict("george soros doesn't control the government")
>> (Category 0, tensor(0), tensor([0.7151, 0.2849]))
learn.predict("fuck jews")
>> (Category 1, tensor(1), tensor([0.1996, 0.8004]))
learn.predict("dirty jews")
>> (Category 1, tensor(1), tensor([0.4686, 0.5314]))
learn.predict("Wow. The shocking part is you're proud of offending every serious jew, mocking a religion and openly being an anti-semite.")
>> (Category 0, tensor(0), tensor([0.9908, 0.0092]))
learn.predict("my cousin is a russian jew from ukraine- 💜🌻💜 i'm so glad they here")
>> (Category 0, tensor(0), tensor([0.8076, 0.1924]))
learn.predict("at least the stolen election got the temple jew shooter off the drudgereport. I ran out of tears.")
>> (Category 0, tensor(0), tensor([0.9022, 0.0978]))

Does Weak Supervision Actually Help?

I was curious if WS was necessary to obtain this performance, so I ran a little experiment. I ran the same process as before, but without the WS labels, and got this Precision-Recall curve:

Precision-Recall curve of ULMFiT without Weak Supervision

Conclusions

  • Weak supervision + ULMFiT helped us hit 95% precision and 39% recall. That was much better than all the baselines, so that was very exciting. I was not expecting that at all.
  • This model is very easy to keep up-to-date. There’s no need for relabeling, we just update the LFs and rerun the WS + ULMFiT pipeline.
  • Weak supervision makes a big difference by allowing ULMFiT to generalize better.

Next Steps

  • I believe we can get the most gains by putting some more effort into my LFs to improve the Weak Supervision model. I would first include LFs based on external knowledge bases like Hatebase’s repository of hate speech patterns. Then, I would write new LFs based on Spacy’s dependency tree parsing.
  • We didn’t do any hyperparameter tuning but that could likely help improve both the Label Model and ULMFiT performance.
  • We can try different classification models such as fine-tuning BERT or OpenAI’s Transformer.

Sculpt

Text Intelligence Without Coding

Abraham Starosta

Written by

MS in CS from Stanford and Working on NLP — abraham@sculptintel.com

Sculpt

Sculpt

Text Intelligence Without Coding

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade