Big Data Playground for Engineers : Snorkel Scikit Transformer

Mageswaran D
7 min readApr 18, 2020

This is part of the series called Big Data Playground for Engineers and the content page is here!

Git:

A fully functional code base and use case examples are up and running.

Repo: https://github.com/gyan42/spark-streaming-playground

Website: https://gyan42.github.io/spark-streaming-playground/build/html/index.html

Snorkel

In 2016, AI researchers from Stanford University introduced a new paradigm known as data programming that allow data engineers to express weak supervision strategies and generate probabilistic training labels representing the lineage of the individual labels. The ideas behind data programming were incredibly compelling but were lacking a practical implementation.

Snorkel: rapid training data creation with weak supervision Ratner et al., VLDB’18

A weak supervised training data set creation framework. It tackles one of central questions in supervised machine learning: how do you get a large enough set of training data to power modern deep models?

If we think about the traditional process for building a training dataset it involves three major steps: data collection, data labeling and feature engineering. From the complexity standpoint, data collection is fundamentally trivial as most organizations understand what data sources they have. Feature engineering is getting to the point that is 70%-80% automated using algorithms. The real effort is in the data labeling stage.

Labeling training data many times involves domain experts manually processing large semi-unstructured or unstructured datasets. This is typically known as strong supervision labeling and tend to produce very high-quality datasets but also results cost prohibited for most companies. Alternatively, weak supervision labeling relies on programmable heuristics that produce noisy labeling data. One of the most popular weak labeling techniques is distant supervision, in which the records of an external knowledge base are heuristically aligned with data points to produce noisy labels. While less accurate, weak labeling techniques are more feasible from the cost perspective.

Snorkel lets you throw everything you’ve got at the problem. Heuristics, external knowledge bases, crowd-sourced workers, you name it. These are known as weak supervision sources because they may be limited in accuracy and coverage. All of these get combined in a principled manner to produce a set of probability-weighted labels. The authors call this process ‘data programming’. The end model is then trained on the generated labels.

There are three main stages in the Snorkel workflow:

  1. Instead of hand-labelling large quantities of training data, users write labelling functions which capture patterns and heuristics, connect with external knowledge bases (distant supervision), and so on. A labelling function is a Python method which given an input can either output a label or abstain. Snorkel also includes a number of declarative labelling functions that can be used out of the box.
  2. Snorkel learns a generative model over all of the labelling functions, so that it can estimated their accuracies and correlations. “This step uses no ground-truth data, learning instead from the agreements and disagreements of the labeling functions.”
  3. Snorkel outputs a set of probabilistic labels which can then be used to train a wide variety of machine learning models.

Label Model

The generative model

Once we have a collection of labelling functions, an obvious thing to do would be to ask each function to label a candidate and use majority voting to determine the resulting label. In fact, in situations where we don’t have many votes on an input (e.g., most of the labelling functions abstain), and in situations where we have lots of votes, then majority voting works really well. But in-between these two extremes, taking a weighted vote based on modelling labelling function accuracy works better.

Snorkel uses a heuristic based on the ratio of positive to negative labels for each data point to decide whether to use majority voting or to build a generative model of function accuracy in order to perform weighted voting.

Essentially, we are taking the expected counts of instances in which a weighted majority vote could possibly flip the incorrect predictions of unweighted majority vote under best case conditions, which is an upper bound for the expected advantage.

When a generative model is called for it is built as a factor graph, applying all labelling functions to the unlabelled data points and capturing the labelling propensity, accuracy, and pairwise correlations of the functions. The details of learning the model are given in an earlier paper, ‘Learning the structure of generative models without labeled data.’

Dealing with correlated labels

Often the provided labelling functions are not independent. For example functions could be simple variations of each other, or they could depend on a common source of distant supervision.

If we don’t account for the dependencies between labelling functions, we can get into all sorts of trouble:

Getting users to somehow indicate dependencies by hand is difficult and error-prone.

We therefore turn to our method for automatically selecting which dependencies to model without access to ground truth (See ‘Learning the structure of generative models without labeled data.’ It uses a pseudo-likelihood estimator, which does not require any sampling or other approximations to compute the objective gradient exactly. It is much faster than maximum likelihood estimation, taking 15 seconds to select pairwise correlations to be modeled among 100 labeling functions with 10,000 data points.

The estimator does rely on a hyperparameter though, which trades-off between predictive performance and computational cost. With large values of no correlations are included and as we reduce the value progressively more correlations are added, starting with the strongest. The following plots show examples of the numbers of correlations added for different values of the correlation threshold in three different tasks.

Generally, the number of correlations grows slowly at first, then hits an “elbow point” beyond which the number explodes… setting to this elbow point is a safe tradeoff between predictive performance and computational cost.

Snorkel In Action : AI Tweets or Not!

How about building Scikit Transformer kind of Snorkel Tagger?

git clone https://gist.github.com/Mageswaran1989/1197796d81674444391b25074f79b989

Before continuing read the Spam dataset tutorial from Snorkel team @

And

So can this create golden base dataset?

Garbage in — garbage out! Answers is nope.

In short Snorkel will help…

  • In infusing the domain knowledge in annotation in a semi supervised manner
  • In creating base dataset in less time as opposed to lot of human hours with traditional methods.

We need Labelling functions that can assign a label for given text.

Labelling Function — LF , is a python function that returns a class (1 for AI tweets, 0 for others and -1 for unsure ) for given text.

Careful with LFs and the labels, for instance, in our case of annotating text (tweet) as AI or not, can get into middle ground easily. As an example, the keyword “machine learning” can be viewed as two different words “machine” and “learning”, same with “neural network”, “deep learning”, “natural language processing” etc.,

If we are rely on word level match, then our labelling function will have overlapping with positive tweets and false positive tweets.

Generally the text is broken into sentences, tokens and n-Grams, which then is used deciding a label for the text under consideration.

Document -> Sentences -> Tokens -> n-Grams.

  • It can be as simple as finding a keyword present in a text
  • Some regex patterns like http links or chemical names or phone numbers or hash tags
  • Extracting relation ships between tokens using Named Entity Recognition and using the info to assign a label
  • Using some external library or model to come up with a label, like textblob for sentence ploarity

Next we need a generative model that can understand the label outputs and come up with a model that can label the new text with the applier outputs…

When N samples are labelled with M LFs through a Label Function Applier then we will get a Label Matrix of size [N x M]

This label matrix is the input for Label Model.

How to evaluate the Snorkel annotation? Indeed a good question

It depends on whether there is a labels for given data or not

  • If label is available then we can measure the accuracy with the in build function
  • If no golden labels are available then we need to rely on the coverage of the LFs. This is little tricky because some LF may overlap with others, getting a good coverage for given data is a art that we need to learn by experience!
                    j Polarity  Coverage  Overlaps  Conflicts
is_ai_tweet 0 [1] 0.314338 0.054371 0.054371
is_not_ai_tweet 1 [0] 0.346300 0.161463 0.000000
not_data_science 2 [0] 0.109594 0.096002 0.043110
not_neural_network 3 [0] 0.002218 0.001706 0.001479
not_big_data 4 [0] 0.110675 0.103054 0.005517
not_nlp 5 [0] 0.013308 0.011830 0.000341
not_ai 6 [0] 0.008929 0.007621 0.006768
not_cv 7 [0] 0.010863 0.009441 0.002047
Label Model Accuracy: 84.6%

As part of the gist, there is a dataset to try out.

Text column is the actual tweets that were collected and slabel is annotated with Snorkel code available here and a label column is created as a copy from slabel jus to show the metric calculations.

References:

--

--