An overview of “Snorkel: Rapid Training Data Creation with Weak Supervision” (2017) by A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu and C. Re
Machine Learning has experienced a lot of advances in the past years, but it still requires a lot of data to be able to function with high level of accuracy. It is especially true for deep learning techniques that have become very popular.
Creating training sets is quite an expensive task, so until now the traditional approach was to hire a team that would label data manually. A new cheaper approach is weak supervision that produces “weak” labels that lack accuracy. One of the most popular forms of it is distant supervision, a technique that uses an existing domain-specific source of data, and then applies it to label a dataset. However, the challenge with weak supervision is to improve accuracy of such models.
Data programming paradigm was suggested to address the challenges of data labeling through “modeling multiple label sources without access to ground truth, and generating probabilistic training labels representing the lineage of the individual labels”.
Snorkel was created based on this paradigm with the goal to allow users to create large training datasets quickly and inexpensively.
There are 3 cornerstones of Snorkel design:
- Its ability to use labels from different weak supervision sources.
- The system should output probabilistic labels that are used to train popular classifiers that generalize beyond noisy labels.
- A user should be able to supervise and interact with the system.
One of the problems with early versions of Snorkel was the difficulty for the users to apply different labeling sources in one model. This was resolved by creating a layer of interface around labeling functions (LFs) together with a specific language to express different kinds of those functions.
Another important feature of Snorkel is the ability to learn not only the accuracies of the labeling functions, but also their dependencies and correlations. Furthermore, while creating its generative model, it is able to use correlations between functions or implement the Majority Vote method, whichever better optimizes the results.
Snorkel proves the concept that weak supervision “as the sole port of interaction for machine learning“ has a lot of advantages with respect to traditional feature engineering, and has a completely different workflow.
Snorkel’s architecture consists of 3 stages:
- Writing labeling functions:
LFs can contain various weak supervision sources wrapped in a flexible interface.
- Modeling Accuracies and Correlations:
Snorkel creates a generative model based on labeling functions’ correlations, that is, where they agree or disagree.
- Training a discriminative model:
While the output is probabilistic labels, the ultimate goal is to train a discriminative model (such as popular ML models) that will be able to generalize beyond the noisy generative model.
Labeling functions as a language for weak supervision
Snorkel can employ different kinds of weak supervision sources: patterns, heuristics, external knowledge bases, crowdsourced labels, etc. For that reason, a flexible system to implement all those sources was needed. Snorkel’s creators interacted closely with users and came up with two ways of writing LFs: custom functions, usually written in Python, and declarative functions.
The authors of the article offer as an example a function written in Python that looks for the word “causes” in a slice of text and outputs either “True” if the word is found, or “False” otherwise.
The same function can be alternatively expressed in a declarative language.
Declarative interface for writing LFs enables users to make use of the most popular sources of labels, such as pattern-based, distant supervision, weak classifiers, labeling functions generators (can be built within the Snorkel framework).
Labeling functions take Candidate objects as an input that are tuples containing components of hierarchically structured data that the functions are supposed to process.
Generative and discriminative models
The core process of Snorkel is building a generative model based on the LFs. Snorkel first treats LFs as independent voters, assuming there is no correlation between them. The next step is to include correlations to avoid double vote in case the functions encode similar rules.
The important detail here is that the generative model is built without any access to the “ground truth”, that is, to the true labels. The labels attached to data points are probabilistic labels that are supposed to be used to train a discriminative model.
The discriminative model aims to be able to generalize beyond the noisy LFs. The more unlabeled data we train with Snorkel, the better is the predictive performance of the discriminatory model. The same happens when we increase the amount of hand-labeled data in training traditional models.
Once LFs are created, Snorkel has the ability to decide whether to just apply a majority vote to determine the label, or model the accuracies in the generative model. The decision is made based on the label matrix density (mean number of non-abstention labels per data point) and an optimizer suggested by the Snorkel’s creators. As for the density parameter, the generative model performs best with medium density, while majority vote gives the best results with low and high density. However, this criterion was not sufficient, so another optimization rule was introduced that is based on the “the ratio of positive to negative labels for each data point”.
Throughout the work on the early version of Snorkel, the authors of the article observed that users struggled to create an efficient dependencies structure. Having correlations in labeling functions is extremely inefficient, but correcting those correlations can be a hard task that is dataset-specific. For this reason, it was important to implement a structure that would do so automatically based on the output of the functions only. Snorkel computes a threshold estimator that is able to decide beyond which point adding more correlations to the model would be too computationally expensive.
Snorkel has been tested with data from different domains and, most importantly, with real-world users. The key take-aways from evaluating Snorkel’s performance are:
- Snorkel performs better than distant supervision by incorporating a broader number of labeling sources
- Snorkel-based models perform almost as good as hand-labeled data
- Snorkel is time-efficient
- Snorkel is easy to use even for first-time users
- Snorkel offers a new way the users can interact with the modeling process
- Snorkel Project on GitHub
- Snorkel: A System for Fast Training Data Creation