NOTE: This article assumes you have a license for Prodigy and that you understand active learning and machine teaching at a cursory level.
I’ve seen very few examples of useful custom Prodigy recipes that go beyond the basic tasks the tool was originally designed for (text classification, named entity recognition and image annotation).
These tasks all work exceptionally well but Prodigy is capable of so much more. Recently, I’ve been framing more and more of my problems at work as machine teaching problems and I’m super happy with the results.
In this post we’re going to walk through building a more complex custom Prodigy recipe. The basic problem we’re trying to solve is linking entity records across multiple data sets. To do this we’ll be using the awesome Python dedupe library for the heavy lifting. Our example data sets are CSV files with product data from two fictitious stores (Abt, and Buy).
Here’s a sample of what our data looks like:
See https://github.com/dedupeio/dedupe-examples/tree/master/record_linkage_example for the original
If we examine the first row of each data set, we can immediately see they refer to the same product, the “Linksys EtherFast Ethernet Switch“. While that observation is obvious to us, our computers may have trouble with this logical jump. That’s where Machine Teaching (aka Prodigy) comes in.
Prodigy Recipe Configuration
A Prodigy Recipe is just a python function, decorated with @prodigy.recipe. This decorator defines the arguments needed to execute the annotation task in prodigy. Our recipe asks for a Prodigy dataset to store annotations in, two file paths to CSV files containing records that need to be linked and resolved and the path to a JSON file containing the fields configuration for dedupe to train with. See https://docs.dedupe.io/en/latest/Variable-definition.html for an explanation of the fields configuration object.
Our fields configuration is fairly simple. Essentially we want to compare the title, description and price columns of our two CSV dataset files. The “type” attribute tells dedupe how to treat the value in that column for each observation. Dedupe provides a suite of distance algorithms to help narrow down the potential duplicate records it presents to the annotator.
Running the annotation task
Now that we have a better understanding of our data and annotation task, let’s start the annotation process and give our model something to work with.
$ prodigy dataset link_records_example
$ prodigy records.link link_records_example \
--left data/raw_dedupe_abtbuy_abt.csv \
--right data/raw_dedupe_abtbuy_buy.csv \
--fields fields.json \
This will launch Prodigy with the custom HTML interface for linking records.
In the interface, a row is highlighted green if the field has an exact string match across both datasets, otherwise the row will be green.
If you think the records are duplicates like they are in the image above, accept, otherwise reject.
When you click the save button the model will be updated with your annotations and your progress will be updated.
In order to reach 100% progress, the dedupe library recommends at least 10 positive and 10 negative examples.
Once you end the annotation session, a model will be batch trained and evaluated on the rest of your dataset and will write out records the model think should be conflated together to a file named ./data_matching_output.csv and save a copy of the dedupe model settings to ./data_matching_learned_settings.
The point of this article is to open your eyes to all the potential use cases of Prodigy that go beyond it’s base use cases.