It has been too long since the data management community has really gotten together in person, so we are super excited about the 2022 SIGMOD conference that’s coming up in a month. Better yet, it’s right down the street from Columbia!
The WuLab is excited to present several exciting projects at the main conference, workshops, and demos. This post highlights what we’ll be up to, and we will follow up with detailed posts about the projects. Come say hi and find us in the hallways!
All of this work is thanks to the excellent students in the WuLab and our collaborators around the world.
Yiru Chen will be presenting PI2: End-to-end Interactive Visualization Interface Generation from Queries on Thurs 6/16, 2PM in 204A. Interactive visualization interfaces are critical in every stage of data management, but are deeply challenging to actually build. It requires analysis, visualization, front-end, systems, and design expertise. But what do you do if you’ve only taken intro to databases?
PI2 is the first system to generate fully interactive multi-visualization analysis interfaces simply from a few example SQL analysis queries. We show how PI2 can be integrated in SQL notebooks to aid analysis, used to generate interfaces from natural language, and help author complex interfaces. Our user studies also show how PI2 matches interfaces that developers might design manually, and goes beyond new commercial tools such as Hex.
Lampros Flokas will be presenting Complaint-Driven Training Data Debugging at Interactive Speeds on Tues 6/14, 4:15-4:35PM in 202B. In SIGMOD 2020, we presented Rain, a training-data debugging system for SQL queries that use ML model inference (how many users are predicted to churn?). Rain ranks the training examples would most resolve user-specified errors in the SQL output (complaints), if deleted. It does so by relaxing the entire training and query process into a differentiable function, and estimating its sensitivity to training examples.
Although effective, Rain is very slow because it needs to compute the hessian (2nd order derivative) of the model parameters. Thus, if you use a neural network with millions of parameters, Rain could take minutes to run. Our paper shows why the full hessian is not even needed, and in fact introduces numerical instabilities. We then propose cheap data structures that speed up data debugging for million-parameter NN models by >70000x over Rain — from minutes to ~1 ms. Co-authors: Jiannan Wang, Weiyuan Wu, Yejia Liu (SFU); Nakul Verma (Columbia University).
Zachary Huang will be presenting Reptile: Aggregation-level Explanations for Hierarchical Data on Tues June 14, 5PM in 202B. Traditional data cleaning tools try to find and repair erroneous values in a dataset. However, in high stakes settings, users don’t want automated cleaning, instead, they want help finding regions of the data that are wrong, and to know how that data are wrong so they can decide what interventions are most realistic.
For instance, our collaborators in Columbia Climate School’s Financial Instruments Sector Team (FIST) visit and survey thousands of villages across Ethiopia, Zambia, and Senegal to understand how droughts affect their livelihoods, and to design national insurance policies. They visualize the survey data at the national level and cross-check it with external climate and rainfall data. However, when they spot an inconsistency, they want to incrementally “zoom in” on the specific region, then district, then village that is most responsible. They can then decide whether to re-interview those farmers, visit the villages to collect new data, or another course of action.
Reptile is such a tool! Rather than propose fixes to individual records, it identifies sub-groups whose aggregate statistics deviate from expectations in ways that affect a final query result — for instance, that New York is over sampled (count is high), or California’s average rainfall is lower than expected. In each iteration. Reptile uses natural hierarchies in data — geography, time, organization, etc —to recommend the next level to zoom into, and suggest anomalous subgroups at that level. Reptile borrows ideas from multi-level modeling to accurately estimate expected statistics based on external data sources, and factorized learning to quickly train hundreds of models. Reptile has been used by FIST to clean data and help insure millions of farmers in Africa.
Workshops and Demos
Lampros Flokas will discuss How I Stopped Worrying About Training Data Bugs and Started Complaining on Sunday 6/12, 2- 2:20PM in 203B. A major challenge in modern ML is that it involves many teams — model designers, data engineers, ML ops, domain experts — each with a limited view and expertise in the whole process. Model designers are not domain experts, and domain experts (e.g., doctors) do not design models. Yet this also means that model designers cannot assess the training data quality themselves, while at the same time, domain experts only see the final results after the model has been trained, deployed, and used to produce predictions and charts. What do we do when the domain expert spots errors in these downstream results?
To this end, our lab has long advocated for a complaint-driven approach to training data debugging. Can the errors that downstream experts identify be directly used to find errors in training data? Is it possible to know how fixes to training data or the sourcing process affect downstream applications at the user level? The presentation will cover our past work and open questions towards carrying out our vision. Co-authored with Weiyuan Wu, Jiannan Wang (SFU); Nakul Verma (Columbia University).
Jeff Tao will be demoing PI2: Interactive Visualization Interface Generation for SQL analysis in Notebooks on Tues 6/14 from 11–12:30PM and 4–6PM. Visualization is a crucial part of the data analysis workflow, but notebooks, the tool of choice for authoring data analyses, have limited support for creating interactive visualizations. Further, creating visualizations is a tedious process, requiring design, programming, and analysis expertise. This work unites PI2, our system for automatically generating interactive data interfaces given SQL queries, with Jupyter Lab, the popular notebook platform. Our demo lets users perform SQL-based analyses in a notebook, and easily generate interactive analysis interfaces with a single click of a button. Co-authored with Yiru Chen.
Gerardo Vitagliano, visiting us from the Hasso Plattner Institute, will demo Mondrian: Spreadsheet Layout Detection on Tues, 6/14 from 1–3PM and 4–6PM. Data preparation for spreadsheets is often a nightmare — multiple tables can be crammed into a single file, and related information can be spread across multiple files. Mondrian automates preparation spreadsheet in data lakes. It automatically identifies the data layout in each sheet, and extracts shared layouts across sheets. Data scientists can then prepare or clean extracted tables in groups. The demo lets users use Mondrian on real-world spreadsheet corpora, and can be accessed on the web! Co-authored with Lucas Reisener, Lan Jiang, Mazhar Hameed, Felix Naumann (Hasso Plattner Institute).
Student Research Competition
We also have two students that will present their research as part of the ACM SIGMOD Student Research Competition!
These are all Tues 6/14, 11-12:30PM
Alex Yao will present Fast Provenance-powered Query Explanations. Query explanation engines are powerful tools for answering “why” questions about query results, searching for hypothetical interventions over the input data which remove unexpected output behavior. However, the performance of these systems is highly dependent on the time spent evaluating the original query over hypothetical interventions. Leveraging advances in fast fine-grained provenance capture, we create a new compilation engine for efficiently evaluating such interventions. We achieve multiple orders of magnitude speed improvements on existing solutions, and can evaluate over a million explanations per second. This speed enables explanations over larger and more complex data than previously possible.
Sughosh Kaushik will present SQL interfaces for Smoked Duck. Smoked Duck is project in our lab that extends DuckDB with the ability to capture and query fine-grained lineage at interactive speeds. However, it relies on a bunch of custom hacks to be able to query the materialized lineage. Sughosh’s work shows how lineage querying can be natively integrated into DuckDB in a way that conforms to its extensions APIs, and leverages its already fast query execution engine. In this way, DuckDB users can query the underlying database as well as the lineage of those queries!
New Researcher Symposium
Eugene Wu will be a panelist at the SIGMOD New Research Symposium on Tues 6/14, 6–8PM. The panel topic is “The Dos and Don’ts of Sharing Research”. But really, come and ask the panelists any questions you wish about doing research!