Understanding Snorkel

An overview of “Snorkel: Rapid Training Data Creation with Weak Supervision” (2017) by A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu and C. Re

  1. Its ability to use labels from different weak supervision sources.
  2. The system should output probabilistic labels that are used to train popular classifiers that generalize beyond noisy labels.
  3. A user should be able to supervise and interact with the system.
  1. Writing labeling functions:
    LFs can contain various weak supervision sources wrapped in a flexible interface.
  2. Modeling Accuracies and Correlations:
    Snorkel creates a generative model based on labeling functions’ correlations, that is, where they agree or disagree.
  3. Training a discriminative model:
    While the output is probabilistic labels, the ultimate goal is to train a discriminative model (such as popular ML models) that will be able to generalize beyond the noisy generative model.
Snorkel’s architecture

Labeling functions as a language for weak supervision

Example of a labeling function written in Python
Example of a labeling function written with Snorkel’s declarative language

Generative and discriminative models

Modeling dependencies

Evaluation

  • Snorkel performs better than distant supervision by incorporating a broader number of labeling sources
  • Snorkel-based models perform almost as good as hand-labeled data
  • Snorkel is time-efficient
  • Snorkel is easy to use even for first-time users
  • Snorkel offers a new way the users can interact with the modeling process

Additional references:

--

--

Data Scientist | Python Developer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store