Harnessing Lineage for Continuous Improvement of Deep Learning Datasets

Ari Surana
8 min readSep 1, 2023

--

Machine learning at scale requires a robust data engine geared towards continuous data quality improvement. In this article I share my insights on how to build such a system with a fine grained lineage tracking solution.

The Problem

Deep learning models’ success rests upon the datasets they are trained on. One of the most reliable source of performance is better training data.

However, the journey to model excellence doesn’t end at dataset creation, the process of continuous improvement of dataset quality is equally vital.

Continuous Improvement

Application of machine learning in real world solutions needs iteration. Just like software solutions, over time we need to address new functionality, refinement to existing functionality, regression, bugs, etc. These are all critical for the success and usability of any application.

However, unlike most software projects, machine learning models rely heavily on datasets for most of these improvements. This is why we need the ability to continuously improve the training datasets and fix quality issues as we find them while keeping track of what’s changed.

Reproducibility

Improving dataset quality usually means corrections on labels that may have been collected some time ago. We build datasets from various sources such as archives, medical records, public information, internet, etc.

Over time as processes, techniques, tools and knowledge improves, these dataset and their labels need to evolve and improve, while still maintaining older versions for reproducibility.

Reproducibility Challenge

Why should we care about reproducibility in industrial applications?

Building machine learning models historically has been mostly an academic and scientific exercise, as such you may have come across the reproducibility challenge in this field.

This is simply a precursor to the reproducibility challenge we face in the industry as AI and machine learning models become prevalent in every application, ranging from chatbots, to copilots to medical assistants etc.

Training models for real world applications is an iterative exercise with a long evolutionary chain of experiments, tweaks and architectural changes. It is always a challenge to pick the best model for the application at hand. To compare and evaluate models built across time, access to frozen training datasets is crucial. To track and explain improvement in models, access to the evolutionary history of your labelled dataset is equally important.

Other than the need for an evolutionary framework, many applications of machine learning models warrant a deeper scrutiny and regulation of the data that was used for training the model. Applications in fields like medicine, insurance, law enforcement etc. can have massive real world repercussions due to small biases in the training dataset. As such these datasets should be subject to higher degree of analysis and tracking that ensures we eliminate biases as and when we discover them over time.

Needless to say, any serious machine learning operation needs to focus on reliable and reproducible datasets.

Tracking Versions

Traditionally open datasets have solved this problem by simply publishing newer versions of the entire dataset as newer copies. This works well for relatively slow moving datasets. COCO is an example of this approach, where newer snapshots are available to download independently.

COCO downloads

For industrial applications, where we need to gather 10s of millions of images with 100s of millions of labels, this approach falls short.

Typically we expect labels to be improved continuously on a daily basis with an expert human workforce of 100s of people meticulously improving labels.

We expect to train new models every month if not every week. The speed of iteration combined with the scale of data means that this solution falls short.

We need granular lineage tracking that allows large scale datasets to evolve at speed while not compromising on tracking, reproducibility and flexibility.

What is Lineage?

Lineage refers to the historical record of the origin, transformation, and evolution of data. It encompasses the entire lifecycle of data, including its creation, processing, and any changes it undergoes over time.

In the context of this text I will use the word lineage to describe the transformation or correction of ground truth related to a feature, over time. In practice this may look like the following example:

Imagine the case of semantic segmentation to detect buildings on aerial imagery. We select a region on the map, and send it to be labelled by an expert human labeller:

Round 1— human labelling.

While experts labelled the image to the best of their knowledge, there is always a propensity for systemic issues and human error. For example the labelling instructions may not cover how to label a small portion of tiles which looks like, but really aren’t a part of any building.

Eventually, as we detect such systemic issues, and train our workforce in how to handle such cases, we need another human to verify and fix this label.

Round 2 — label correction.

While this seems simple enough, the time between the original label and the correction of label can span months or years. Tracking this lineage graph is going to be instrumental.

Simple label lineage.

A DAG emerges

As you can imagine tracking lineage like this can result in many different structures.

Various types of lineage structures.
Various types of lineage structures.

You can represent these as a series of Directed Acyclic Graphs or “DAGs of labels”, that sit within the larger graph of feature and label dataset. Capturing this information with your labels is only part of the challenge, what you really need is a way to transform this into a training dataset that needs a feature and a label pair.

A naive approach can be to aggregate all the labels ever gathered.

Naive aggregation.

This approach makes some sense, as you are building consensus from all human labels.

However this approach has a massive problem: it ignores the fact that new information is more likely to be correct than the old. As we improve tooling, knowledge and processes over time, labels resulting from corrections, by definition are “better” than the original labels. Thus aggregating new corrected labels with older labels dilutes, and in some cases reverses the improvement.

An optimal approach would be to boost the data that has passed through more human attention. We can do this easily by figuring out the leaf nodes of these DAGs and discard the parent nodes.

Boost the leaf nodes for better accumulation of knowledge.

This approach guarantees that there will always be an improvement with every correction in our dataset.

However, we may still end up with many leaf nodes, and we can use different strategies to tackle this case based on the problem at hand.

A couple of example solutions that can be considered:

Human in the loop consolidation of many labels into one.

Human in the loop consolidation

Or simple aggregation of leaf nodes.

Simple aggregation of boosted nodes.

These approaches finally result into what the model is trained on, a feature + label pair.

Atomic unit of a training dataset.

However a singular feature label pair does not a dataset make. We need millions of such DAG traversals and consolidations to freeze a trainable dataset. Scale is a challenge.

A dataset made of boosted leaves.

Tackling Lineage at scale

A key challenge for this solution is how do we tackle these DAG traversals in a way that scales while still maintaining reliability and transparency?

Computing large scale relational data has been solved with relational databases and data warehouses for many years now, it is a battle hardened, well understood technology. SQL engines are excellent at handling relationship structures. They are extremely optimised to scan and calculate arbitrary relational data.

Consider any modern data warehouse such as AWS Athena, Snowflake, Google BigQuery. All of these technologies can scale well to calculate millions even billions of such graph traversals, and do it cheaply.

SQL lifts the heavy load

To optimally leverage SQL engines, we need to serialise a graph of metadata onto a structure that databases understand and are optimised for.

We serialise the only lineage information needed for this purpose, the nodes and their parent links.

Having immutable data is quite important for maintaining integrity and reproducibility over time, so only the children need to record their parent relationships. As new children are introduced, the table grows linearly.

Serialise the lineage into sql relations.

Using simple SQL semantics, we can traverse and calculate children for each node with a simple SQL query. Here is an example pseudo sql (actual implementation will change slightly based on the underlying engine):

select
p.node,
p.parent,
list_agg(c.node) as children
from nodes as p
left outer join nodes as c
on p.node = c.parent
group by 1, 2
Calculate children with simple SQL join.

Lack of any children provides an easy marker to find leaf nodes.

Boosted leaf nodes are identified for easy querying.

From this point implementing flexible solutions to get to the final dataset is trivial.

This also allows easy access to ALL of the history of your dataset in exactly the same way you would access the latest version of your dataset.

Conclusion

When building training datasets for real world application of ML models, consider the long term evolution of your dataset. Capturing the parent lineage information with your labels is best practice. Leveraging modern SQL warehouses provides powerful solution for publishing datasets and selecting optimal labels from the lineage. Treating rich historical data as first class citizen, allows easy navigation of the iterative path that is machine learning.

--

--

Ari Surana

Principal Machine Learning Engineer and Technical Leader working towards building more intelligent machines for a bountiful future.