How Snorkel, a semi-supervised learning technique, solved invoice accounting at Tide

Dioscuri, the project that matched Tide invoices with transactions only with a small hand-labeled dataset

Natalia Koupanou

Published in

Tide Engineering Team

10 min readApr 22, 2020

TL;DR

Due to Snorkel, Dioscuri achieved high accuracies in the detection of any transactions made for a particular invoice without large hand-labeled datasets and in less than 2 sprints.
Weak learning combines unsupervised and supervised learning. Snorkel is a python package that uses labels from different weak supervision sources.
With Snorkel, business knowledge and other models were harnessed to “weakly” label data programmatically.
Snorkel uses a list of defined heuristics to label data and creates a generative model that weights the importance of each rule.
The noisy labels created from the trained Snorkel label model can be used to train a classifier that generalises.

Project Dioscuri had the objective of being able to identify the transaction(s) related to a receivable invoice, if there were any. Examples of invoices issued from Tide and paid into Tide accounts were not available and easily identifiable to achieve Dioscuri’s goal. It was, therefore, important to label whether an invoice and transaction combination was correct. However, this meant either waiting for a long period to collect this data or investing highly paid subject matter experts’ time in hand-labeling historical data. Thankfully Snorkel, a Python package for programmatically building and managing training data, helped us to meet the Dioscuri’s goal quickly and with a high level of accuracy.

So, let’s understand the problem and how to Snorkel …

What is Dioscuri?

In Greek mythology, Dioscuri were “the twins, Castor and Pollux, which reunited as stars in the sky by Zeus after Castor’s death” (aka Gemini in Latin) [6].

In Tide, Dioscuri is the project that matches receivable invoices raised from the Tide app with incoming transactions.

At Tide, we want to do what we can to help our members have a healthy cash flow. As Paul Uppal, Small Business Commissioner, said:

“A healthy cash flow is the life blood of any small business. Getting your credit and payment process right from the start is crucial if you want to build and grow your business.” [5]

This project was challenging for many reasons. To name a few:

The invoice reference number is not always unique in small businesses.
Payers of invoices don’t always use an invoice reference number in the payment.
An invoice might never be paid in full.
Amending invoices is not an option currently with Tide.
Invoice to transaction matching isn’t always a 1:1 relationship — For example, it can be many:1 or even 1:many.
For frequent customers of our members, with multiple invoices and payments, it’s harder to identify which transaction is for which invoice.
But probably the toughest problem was the lack of labeled data.

How a data scientist feels without labeled data (Image source: here)

What is semi-supervised learning?

Before diving into the details of Snorkel, it is important to know what semi supervision — or, as it is often termed, weak or noisier, higher-level supervision — is.

Supervised learning is “the machine learning task of learning a function that maps an input (X) to an output (Y) based on example input-output pairs” (Y = f(X)).[1] Unsupervised learning is a “type of machine learning that looks for previously undetected patterns in a data set (X) with no pre-existing labels (i.e. no Y) and with a minimum of human supervision”. [2]

Semi-supervised learning is in between the two machine learning approaches as it “combines a small amount of labeled data with a large amount of unlabeled data during training” with the benefit of improving learning accuracy.

Why Snorkel?

Snorkel is proven to solve various business problems and has been applied in numerous companies before Tide, including Google, IBM, SAP, Ant Financial, Accenture and Microsoft.

It can leverage business knowledge to label data programmatically [4]. Heuristics and rules coming from business expertise can be included.
It can transfer learning from previous models and even non-servable data [3], which are slow or expensive when used in production. The labels created from Snorkel can be used later to train a model with servable features, e.g. inexpensive, real-time signals.
It brings development time and cost down by an order of magnitude according to Google’s AI team [3].
Snorkel doesn’t need massive sets of hand-labeled training data. It can programmatically create many weak labels without having any label data.
As a result, it can considerably improve the accuracy of models with a flexible form of transfer learning. In this paper, it’s shown that it takes roughly 12K hand-labeled examples to match the predictive accuracy of the weakly supervised classifier trained on data labeled by Snorkel [3].
By combining various weak proxies Snorkel can create labels even for concepts that are loosely defined and cannot be easily collected or created manually, such as user engagement and quality of lead or traffic.

How to apply Snorkel?

Labeling Functions & Preprocessors

Labeling functions (LFs) are essentially heuristic or rules that capture organisation knowledge originated either from business teams or data models in a logical statement.

Depending on the statement, a LF either assigns a label to a datapoint (in this case: invoice-transaction match or no match) or abstain (i.e. doesn’t assign any label) [7].

Left: Labeling functions that assign 1 to data points or abstain. Right: Labeling functions that assign 0 to a data points or abstain

A list of LFs is defined to cover a variety of weak supervision strategies. Some are covering cases where an invoice-transaction match might be correct, and others are for combinations of invoices and transactions that might not match.

Before defining the labeling functions there is a lot of data preparation required, as in all data models. Even though feature engineering is beyond the scope of this blog post, it is worth mentioning Snorkel decorator @preprocessor, which can be used for preprocessing functions.

Preprocessors map a data point to a new data point [7].

Also, preprocessors have a memoization functionality for caching input/output to avoid re-execution of data mapping needed in multiple LFs.

from snorkel.preprocess import preprocessor@preprocessor(memoize=True)
def fuzz_scores_payment_inv_reference(x):
    x.reference_match = fuzz_score(
        x.trans_description, x.inv_reference_number)
    return x

Labeling Functions (LFs) in Snorkel are created with the @labeling_function decorator, which can be applied to any Python function that returns a label for a single data point. Any preprocessors needed to be run on data points before a LF is executed can be specified in the labeling_function decorator together with any necessary labeling resources (ie. keyword arguments for the labeling function) as shown in the code example below.

from snorkel.labeling import labeling_function# LF outputs for inv-trans matching
MATCHING = 1
NOT_MATCHING = 0
ABSTAIN = -1# Check if payer's name is similar to invoice receiver name
@labeling_function(
    resources=dict(NAME_THRESHOLD=NAME_THRESHOLD),
    pre=[preprocessors.fuzz_scores_payment_inv_reference])
def lf_similar_reference_inv_trans(x, NAME_THRESHOLD):
    return MATCHING if x.reference_match>=NAME_THRESHOLD else ABSTAIN# Create a list with the label functions we want to apply
# Note: not the whole list of LFs is included here
lfs = [
    lf_unsimilar_name_inv_trans,
    lf_similar_reference_inv_trans, 
    lf_inv_trans_same_amount,   
    lf_trans_within_allowed_period
]

Label Matrix

After defining all the LFs, we can apply them in a couple of lines of code as shown in the example below. Because of our dataset size, we used DaskLFApplier method here to speed up calculations. However, Snorkel also supports other DataFrame-like structures, including Pandas and PySpark [7].

from snorkel.labeling.apply.dask import DaskLFApplier# Use PandasLFApplier for pd.DataFrame
# DaskLFApplier to get LF labels for our Dask DataFrame
applier = DaskLFApplier(lfs) 
L_train = applier.apply(df)

The output of the above code is a label matrix, which is a NumPy array with one column for each LF and one row for each datapoint.

# Label matrix L_train
[[-1  1 -1  0]
 [-1  1  1 -1]
 [-1 -1 -1 -1]
      ...
 [ 0  1 -1  0]
 [-1 -1  1 -1]]

As observed from the example above, the same datapoint might be labeled by multiple LFs or none of them and its labels might be the same or might differ. Noisy LFs without perfect accuracy are expected though and Snorkel has a utility for statistics and analyses about LFs.

from snorkel.labeling import LFAnalysisprint(f"{LFAnalysis(L_train, lfs).lf_summary()}")

As shown below, the output of the above code is a summary report for LFs and includes:

Polarity: The set of labels the LF outputs (excluding abstains)
Coverage: % of dataset the LF labels
Overlaps: % of dataset the LF and at least another LF label
Conflicts: % of dataset the LF and at least another LF label and disagree

                              j Polarity Coverage Overlaps Conflictslf_unsimilar_name_inv_trans    0     [0]  0.01416  0.01416   0.00968
lf_similar_reference_inv_trans 1     [1]  0.08321  0.08308   0.06904
lf_inv_trans_same_amount       2     [1]  0.10581  0.10580   0.10115 lf_trans_within_allowed_period 3     [0]  0.98566  0.67340   0.67340

Label Model

The observed overlaps and conflicts of the LFs are expected and used by Snorkel to create a generative model that estimates the accuracies of LFs.

from snorkel.labeling import LabelModel# Invoice matching is a binary classification problem (2 classes)
label_model = LabelModel(cardinality=2, verbose=True)label_model.fit(L_train=L_train, n_epochs=500,
                lr=0.001, log_freq=100, seed=SEED)# Save model in a pickle format
label_model.save(LABEL_MODEL_PATH)# Possible to get the weights of LFs
label_model_weights = np.around(label_model.get_weights(), 2)

Probabilistic (confidence-weighted) training labels (aka “noise-aware training labels”) are produced by re-weighting and combining the LFs’ output based on the accuracies of the LFs found from the label model. [7]

from snorkel.labeling import filter_unlabeled_dataframe
from snorkel.utils import probs_to_preds# Get probabilities for class 0 and 1
probs_train = np.asarray(label_model.predict_proba(L_train))# Get labels from probabilities (0, 1, -1)
preds = probs_to_preds(probs=probs_train)# Filter unlabeled datapoints
df_train_filtered, probs_train_filtered = filter_unlabeled_dataframe(
    X=df, y=probs_train, L=L_train
)

As the following diagram shows, there was a dramatic decrease in the possible matched combinations of invoices and transactions. Note that some extra validation rules were added on top of Snorkel probabilities to handle and asses cases such as when a transaction is matched with multiple invoices.

The validated labels found from Snorkel were then used to train a classifier. The problem was essentially transformed into a supervised one and there is a huge variety of classifiers, from logistic regression and ensemble of classifiers to neural networks, to train and generalise beyond noisy labels.

Snorkel DryBell workflow diagram from [4]

How to assess Snorkel results?

Similarly to the evaluation of classification models, there are various metrics we can use to the assess success of Snorkel’s models — to name a few:

Type I error = false positives/ (true positives + false positives)
Type II error = false negatives/ (true negatives + false negatives)
Recall = true positives/(true positives + false negatives)
Accuracy = (true positives + true negatives) / (all positives +all negatives)
Coverage = % of dataset labeled

Note that depending on the application, different targets are set for these metrics. But for all metrics apart from coverage, we need to know whether the prediction is correct or not. For this reason, we carried out a short hand-labeling exercise in order to treat that small hand-labeled dataset as a golden dataset and measure the metrics listed above.

What is the maths behind Snorkel?

Let an unlabeled data point be X ᵢ, its associated unknown label Y ᵢ and the total number of data points m.
Say we have n label functions (LFs), then for a LF λⱼand a binary classification λⱼ:Χ→{-1,0,1}
Let Λ be the matrix of labels output with dimensions m x n, such that Λᵢ,ⱼ= λⱼ(X ᵢ).
The parameters of the generative label model, w can be estimated by maximising the log marginal likelihood of the observed labels Λ:

5. Probabilistic training labels are then calculated as follows:

6. With the training dataset (X, Ỹ), a discriminative classifier h_𝜃 can be trained as usually by minimising the expected loss with respect to Ỹ:

The definitions above are according to Snorkel Drybell 2019 paper [3] and more detailed explanations can be found there.

Thanks for reading and feel free to share! I would love to hear what your experience with Snorkel was.

If you’re interested in applying machine learning in FinTech, join Tide data team: https://www.tide.co/careers/

References

Stuart J. Russell, Peter Norvig (2010). Artificial Intelligence: A Modern Approach, Third Edition, Prentice Hall ISBN 9780136042594.
Hinton, Geoffrey; Sejnowski, Terrence (1999). Unsupervised Learning: Foundations of Neural Computation. MIT Press. ISBN 978–0262581684.
Stephen H. Bach, Daniel Rodriguez, Yintao Liu et al. (2019), Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale.
Alex Ratner, Stanford University and Cassandra Xia, Google AI (2019). Harnessing Organizational Knowledge for Machine Learning, Google AI Blog.
https://www.smallbusinesscommissioner.gov.uk/home-page/get-your-invoices-paid-on-time/
https://www.merriam-webster.com/dictionary/Dioscuri
https://www.snorkel.org/use-cases/01-spam-tutorial