Complaint-driven Training Data Debugging

Eugene Wu
Published in
7 min readApr 29, 2020


Eugene Wu

Machine learning is often called “Software 2.0” due to its reliance on training data to “program” the predictive model. These models are used to make product recommendations, predict user characteristics, recognize patterns, and extract content from text, images, audio, and videos. Its success has led industry and science to increasingly incorporate machine learning into their business processes and data analytics pipelines.

A major issue is how to debug complex workflows that combine traditional data analytics with machine learning predictions. In particular, errors in training data can cause the model to mis-predict, and ultimately affect analysis results. How can we help users and developers bridge the gap between those errors in the analysis output, which are easy to find, and errors in training data, which are considerably more challenging? Our recent SIGMOD 2020 paper, Complaint-driven Training Data Debugging for Query 2.0 proposes a complaint-driven approach towards training data debugging to address this problem.

This is a collaborative research project between Weiyuan Wu and Jiannan Wang from SFU, and Lampros Flokas and Eugene Wu from Columbia University. We will present this work at the SIGMOD 2020 conference during the week of June 14–19 2020. We presented a shorter version at MLOps earlier this year that lays out our larger motivation.

Shoot us an email if you are interested or have questions about our work!


Modern analytics workflows combine ML and data analytics (diagram below). Training data is ingested (gray) and used to train models (green), the models are used to make predictions (green), and the predictions are combined with other datasets or predictions for data analysis (orange). Different parts of this complex workflow may even be managed by different development teams (the ML researchers, ML deployment, database admins).

The challenge is that errors in the training data at the very beginning of the workflow can easily introduce errors that are only caught in downstream or final outputs. When this happens, is there any hope to identify the erroneous training data that caused those errors?

Errors in training data can silently corrupt model predictions and lead to analysis errors that are only found when examining the downstream results.

This problem is difficult because analytics workflows differ from traditional programming. Traditional software will conveniently crash or throw exceptions when there is a bug, and there is a huge ecosystem of tools to assist with debugging. In contrast, ML is “programmed” by the training data, and bugs manifest as errors in the training data (or model specification). Instead of crashing, the model will simply mis-predict, and silently corrupt the analysis results. These corruptions may later manifest as unfair treatment of loan applications, inappropriate or lack of marketing to a target user base, or incorrect trends that lead to poor business decisions.

How is training data debugged today?

A dominant approach [1] used at companies like Google is to add syntactic checks during data ingestion to check that new training data contains the expected attributes with appropriate data types, and features match expected distributions. The challenge is that these checks are at the very beginning of the workflow (above) — how does a developer know what checks to write in order to prevent errors in the downstream analyses or applications (some of which may not even exist yet)? For instance, a model built by the ML team may be used by the marketing team to analyze customer trends.

The other approach is to use techniques like influence analysis [4,5] to identify training data that most influenced a given mis-prediction. However, who labels these mis-predictions? Although this may be feasible for applications that directly show predictions to users (e.g., a movie recommendation), it’s unrealistic to label mis-predictions for analytics workflows that process thousands or millions of records and predictions.

errors are defined with respect to the application... Complaint-driven debugging is a paradigm that lets developers complain about application-level errors, rather than individual prediction errors

Fundamentally, errors are defined with respect to the application (or workflow) outputs, and that’s where it is most natural to spot and label errors. Complaint-driven debugging is a paradigm that lets developers complain about errors in workflow outputs rather than individual prediction errors, and leverages the workflow structure to identify training data errors that are likely the culprit.

Rain: Complaint-driven Debugging for Query 2.0

Our paper tackles this problem in the context of “Query 2.0” — database queries that leverage ML prediction as part of the query logic. Since business data is primarily stored and analyzed in a database, it is natural to “bring the model to the data”. In fact, this class of queries is already supported and widely used in databases such as PostgreSQL[6], BigQuery[7], and SQLServer[8].

Consider an email marketing company that we worked with. The company specializes in email marketing campaigns for retail companies (their customers). Customers can define user cohorts based on user profile data as well as predicted attributes (e.g., churn likelihood). Cohort metrics are tracked in dashboards over time.

Example of Query 2.0, which computes statistics about active users that the ML model predicts will churn. The middle diagram is the analytics data flow, where the red portions correspond to ML training, and the gray portion corresponds to the analytics query. The line chart on the right plots the query results over time; the user asks why the statistic over the past two weeks has dropped as compared to her expectations.

The above shows a query that counts the number of active users that the model predicts will churn. Running the query weekly generates the line chart. The customer is surprised that the count dropped over the last two weeks, the reasons are opaque to the customer and the company’s engineers. It turned out that the customer had uploaded new training records that caused systematic model mis-predictions, which led to the drop.

Rain formalizes and solves this problem. Given a description of the incorrect query outputs (called complaints), as well as the training and query workflow, it identifies the subset of training data that are most influential in causing the user complaint.

So, how does it work?

At a high level, we want to combine two steps. First, we want to propagate the user complaints back to the model predictions that, if flipped, would help fix the complaints. Second, we want to find the training records that, if removed, would cause the desired predictions to flip.

The great news is that existing work from the provenance literature [2,3] helps address the first step by encoding the problem as an integer linear programming constraint problem, and the influence functions work [4] from the ML literature addresses the second step. Unfortunately, both steps are error-prone, and naively combining them together causes the errors to cascade. Ultimately, this leads to a method that has difficulty finding the training data errors.

Our contribution is a holistic approach that combines both steps into a single holistic optimization problem. In addition to more accurately identifying training data errors, the optimization problem can efficiently run as a TensorFlow program and benefit from the recent deep learning hardware advances.

So, does it work?

We ran experiments for a range of SQL queries (SPJA queries) and a range of datasets (an entity resolution dataset, the Adult income dataset, ENRON spam dataset, and MNIST). We also evaluated cases where complaints are made for multiple queries that use the same ML model, and show how combining the complaints together more effectively pinpoints training data errors. Overall, our experiments find that leveraging query semantics is key to identifying the training data errors that affect query result errors.

A single complaint using Rain’s approach can more accurately identify training errors than than individually labelling 700+ model mis-predictions

We want to highlight a particularly exciting result: we wanted to compare the efficacy of individually labelling model mis-predictions and using influence functions (Point Complaint), with Rain’s approach of labeling errors in query results (Agg Complaint). Here, we executed an aggregation query and specified a single complaint on its single output value. The following graph shows that a single complaint using Rain’s approach can more accurately identify training errors than than individually labelling 700+ model mis-predictions!

A single complaint using Rain’s approach (Agg Complaint) can more accurately identify training errors than than individually labelling 700+ model mis-predictions (Point Complaint).

In our paper, we motivated Rain’s complaint-driven debugging with a business analytics use case. However, it can also be applied to other settings such as model fairness. Concepts such as group fairness are easily expressed as complaints over group-by aggregation queries (e.g., count of predictions grouped by race). Making these connections is part of our ongoing work.

Final Thoughts

Machine learning is increasingly integrated into every facet of our computing systems, and data analytics is no exception. However, the universal truth in computing is that bugs happen, and programming with data means the surface area for introducing bugs has exploded. Although research that makes ML faster and cheaper is crucial for promoting adoption, research for explaining analysis results and debugging training data is crucial for understanding what we have adopted.


[1] Baylor et al. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform.

[2] Meliou et al. Tiresias: the database oracle for how-to queries.

[3] Kanagal et al. Sensitivity analysis and explanations for robust query evaluation in probabilistic database.

[4] Koh et al. Understanding Black-box Predictions via Influence Functions.

[5] Zhang et al. Training Set Debugging Using Trusted Items

[6] Hellerstein et al. The MADlib analytics library or MAD skills, the SQL


[8] Karanasos et al. Extending Relational Query Processing with ML Inference