Programmatically labeling data using Snorkel with example

data annotation using python Snorkel with codes

Mehul Gupta
Data Science in your pocket

--

Photo by Jakob Owens on Unsplash

Assume you wish to build a binary classifier for detecting a disease (say COVID) given some symptoms for different people as tabular features. You have been provided with 100 of GBs of data with symptoms for people. Sounds easy (you may even start dreaming of XGBoost by now)

But here is a twist !

You don’t have any labels !! what you have is just the features & no ground truth

Manually annotating any dataset is both cumbersome & costly. Hence, doing everything manually is something you would wish to skip.

Is there a pythonic way we can make this annotation thing less cumbersome & cost effective as well?

Yes !!

Snorkel can be very handy when you facing such an issue for building your training dataset when you have little or no labeled dataset at all.

Let’s jump straight onto a dummy problem statement & try labeling using Snorkel.

Problem statement: We have been given some words over which we wish to to get a label 1) English dictionary word:1 2) Gibberish:0

We will try labeling these words & finally check how accurately Snorkel performed. Also, hand in hand we will be discussing different features of the Snorkel we will be using.

A few things we need beforehand

*Unlabeled data

*Some rough rules that one can think of separating the classes. This is the intuition over which Snorkel will label the data.

For example, A word with a length>20 would most probably be gibberish as English dictionary words aren’t that long. But this doesn’t mean this rule won’t falter. If the rule works for ‘most’ of the cases, it is fine. Even if not, Snorkel will figure it out & you can modify later.

*A small amount of labeled data so as to check the quality of labels generated (not a necessity)

CODE ALERT

  1. Import libraries
from gibberish import Gibberish
import enchant
import random
import string
import pandas as pd
import numpy as np
import itertools
import re
from snorkel.labeling import labeling_function
from snorkel.labeling import PandasLFApplier
from snorkel.labeling import LFAnalysis
from snorkel.labeling.model.label_model import LabelModel

2. Preparing unlabeled dataset i.e. a mix of gibberish & English dictionary words (this is just for demo purposes, skip it if you already have some unlabeled data)

gib = Gibberish()
eng_dict = enchant.Dict("en_US")
def generate_data(count):

gib_words = gib.generate_words(count)
eng_words =list(itertools.chain(*[eng_dict.suggest("".join([random.choice(string.ascii_lowercase) for _ in range(random.randint(3,10))])) for _ in range(0,count)]))

words, labels = [x.lower() for x in gib_words+eng_words],list(np.zeros(len(gib_words),dtype=np.int8)) + list(np.ones(len(eng_words),dtype=np.int8))

return pd.DataFrame(data={'word':words}), np.array(labels)
  • generate_data() function intakes count i.e. total gibberish words to be generated & total seed words to generate English words using pyenchant
  • Gibberish words are assigned label=0 while English dictionary words=1
  • Create a list ‘words’ merging gibberish & English words generated alongside their labels in ‘labels’
  • return a dataframe with one column ‘words’ & labels separately as np.array
train_df, _ = generate_data(1000)
validate_df, validate_labels = generate_data(1000)

Note that we didn’t save labels for the training dataset. This is just to replicate the real-world scenario where we won’t have any labels for the majority of data with a handful of data being labeled (validation set)

Let’s observe train_df & validate_df

3. Determine labeling functions

ABSTAIN = -1
DICT_WORD = 1
GIBBERISH = 0
@labeling_function()
def no_vowel(record):
if sum([1 if x in record['word'] else 0 for x in ['a','e','i','o','u']])==0:
return GIBBERISH
else:
return ABSTAIN
@labeling_function()
def not_all_vowels(record):
if sum([1 if x in ['a','e','i','o','u'] else 0 for x in record['word']])<len(record['word']):
return DICT_WORD
else:
return ABSTAIN

@labeling_function()
def length(record):
if len(record['word'])<8:
return DICT_WORD
else:
return GIBBERISH
@labeling_function()
def consecutive_consonants(record):
if re.findall(r'(?:(?![aeiou])[a-z]){3,}',record['word']):
return GIBBERISH
else:
return DICT_WORD

@labeling_function()
def consecutive_vowels(record):
if re.findall(r"\b(?=[a-z]*[aeiou]{3,})[a-z]+\b",record['word']):
return GIBBERISH
else:
return DICT_WORD

This requires some explanation !!

  • Labeling functions are nothing but soft rules we built out of intuition/domain knowledge. A labeling function can return (depending on some criteria) either any valid labels (0 or 1 in our case) or -1 (Abstain)

What is Abstain?

So, it might be the case when you define a rule, it doesn’t fall in one of the classes in some cases. For example, Rule 1 states if there is no vowel in the word, it is Gibberish. & If at least one vowel is present? then? Would it become a dictionary word? not really. So we used ABSTAIN i.e. we aren’t clear

So, we must know a couple of things about labeling functions

The rule must output a label given all the scenarios possible under the condition mentioned. If not clear what would be the output given a scenario, use ABSTAIN i.e. -1

We need to use decorator @labeling_function to declare a function as a labeling function

We can have as many labeling functions as we wish to have

The rules formed aren’t required to be mathematically backed !! As mentioned, they can be out of domain knowledge or even gut feelings. How do determine the quality of the rules formed? we will discuss this later using LFAnalysis

So what are the rules we formed for our problem?

No vowels: If no vowels in words, it should be Gibberish else ABSTAIN

If not all vowels: English dictionary word else ABSTAIN(all vowels no consonants)

length: if word length <8 then dictionary word else Gibberish

consecutive consonants: if 3 consecutive consonants in a word, Gibberish else dictionary word

consecutive vowels: if 3 consecutive vowels in a word, Gibberish else dictionary word

Note: All these rules have no mathematical backing & are built just out of intuition!!

Moving ahead

4. Applying labeling functions to training & validation data

lfs = [no_vowel, not_all_vowels, length, consecutive_consonants, consecutive_vowels]applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=train_df)
L_validate = applier.apply(df=validate_df)
  • PandasLFApplier applies a list of Labeling Functions to a pandas dataframe. Snorkel do have other applier function as well depending upon the data type over which labeling functions has to be applied like DaskLFApplier(), SparkLFApplier(), etc

Let’s observe L_train

So, each word is now turned into a 1x5 vector. As you must have guessed, each value corresponds to the output for that particular word by the 5 labeling functions. Doing a deeper dive into the performance of the labeling functions

LFAnalysis(L_train, lfs).lf_summary()

This output looks interesting & may require some explanation. This can be very handy in determining the quality of rules formed

  • Polarity: Labels these functions have returned at least once over the given dataset excluding ABSTAIN (as it is not a label). Hence, no_vowel() has polarity 0 as it either detects Gibberish (0) or abstain(-1). Similarly, polarity for length is [0,1] as it can return both labels. It helps to check out logical errors in the functions
  • Coverage: proportion of data for which the particular labeling function has given an output Gibberish/Dict_word but not Abstain. Hence, coverage isn’t 1 for no_vowels (as it must have given ABSTAIN for some words) while 1 for the last 3 functions (as they either can’t give Abstain or haven’t given Abstain on any word). A function with low coverage can be modified for better coverage.
  • Overlaps: Given a function’s non-abstain output(excluding -1), what proportion of the result overlap with at least one of the other functions as well. For example: Assume no_vowels() give 0 for 10 & -1 for 30 out of 40 samples. Now, if not_all_vowels() gave 0 for 4 out 10 samples given 0 by no_vowels() & length() gives 0 for 5 out 10 samples given 0 by no_vowels() where there is no common sample giving 0 between not_all-vowels() & length() & the remaining 2 functions haven’t output 0 at all, than overlap for no_vowels()=(4+5)/40=9/40=0.225 as there were 9 samples outputs that were in common with the 10 samples outputs given 0 by no_vowels().
  • Conflicts: Sort of reverse for overlaps, given a function’s non-abstain output, how many times it has conflicted with at least one of the other functions' non-abstain decisions.

5. Training model

label_model = LabelModel(verbose=False)
label_model.fit(L_train=L_train, n_epochs=1000, seed=100)
preds_train_label = label_model.predict(L=L_train)
preds_valid_label = label_model.predict(L=L_validate)

Snorkel provides LabelModel() that learns from the conditional probabilities of LabelFunctions & finally combines the outputs from each of the functions to give a label. More about the model functioning can be read here

The model is trained on training data formed after applying label functions

6. Analyzing results

Its time we observe how the model performed in annotating the data given a set of rules on the validation set

print('validate metrics')
print(label_model.score(L_validate, Y=validate_labels,metrics=["f1","accuracy",'precision','recall']))
final metrics on the validation set

The results look pretty good with an F1 close to 75% given the fact that

We used just 2000 words in the training dataset

The label functions were a quick thought. If spent some time, we can come up with quality label functions as well

Time taken for the entire process was <2–3 minutes

We had no labels in the beginning

A further deep dive can make things much more interesting

LFAnalysis(L_validate, lfs).lf_summary(validate_labels)

So, as and when we provide ground truth as well to LFAnalysis, we got 3 more columns than training data analysis

Correct & Incorrect: Of all non-abstain labels (only 0 & 1 in our case), how many of them are correct

Empirical Accuracy: Nothing but accuracy considering just non-abstain outputs (0 & 1)

As we can observe, our function 1 is a total doom (Accuracy=0). Also, consecutive_consonants appears to be the best labeling function in the lot. Hence, LFAnalysis can help us out in understanding the quality of our label functions.

Final words

Snorkel appears a gun providing a hybrid solution using heuristics & statistical models to assign labels to unlabelled samples saving time & money. Such scenarios are prevalent in real-world data problems where though, we have massive datasets but without annotations. On top of it, if we wish to train some heavy deep learning model, manual annotation is just not an option. Even though we have just touched the tip of the Snorkel this time, it appears promising & worth giving a shot

--

--