Labeling ML Data with Snorkel

Nick Doiron
Nov 2 · 2 min read

In previous posts on the AOC Reply Dataset, I mentioned the difficulty of training a troll detector with Google AutoML or SciKit-Learn, when I don’t want to manually label 110k Tweets. In practice I would use SQL to find overtly profane keywords, and bundle all Tweets by their authors into one category.
Since then, I learned that the term for my problem is “weak supervision” and Snorkel is a leading tool for building a better supervised learning dataset with labeling functions. Recent research around Snorkel includes Snuba, DryBell, and SuperGLUE. Generally useful elements seem to get merged back into the main Snorkel, so we will stick to that.

The concept is several different labeling functions, which Snorkel will figure out how to combine and weight. For example, a troll Tweet could contain profanity, weird conspiracy theories, certain hashtags, etc. These are all red flags, but in a world of probabilities some have more weight and meaning. Explicit, racist hashtags are almost always going to be used negatively, but profanity often can go either way (“keep fucking rocking it”).

Snorkel has other tutorials which could apply to you, such as validating crowdsourced data and generating similar text or images (data augmentation).

I didn’t see a solution to my #1 problem (suggesting additional words for my labeling) but if you’re developing a programmatic solution for your project, and avoiding the hurdles of building a SQL database, I would highly recommend this for your pre-processing and labeling tasks.

Nick Doiron

Written by

Nomadic web developer and mapmaker.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade