Columbia@SIGMOD 2023

Published in

thewulab

6 min readJun 16, 2023

It’s another year and SIGMOD 2023 is coming up soon in beautiful Seattle Washington.

Columbia’s data management group, including the WuLab, is excited to present research across visualization, data structures, compilers, lineage, semantic layers, ML on CPU-GPU, and data education. Say hi to us in the hallways and we would love to chat about our work (and data in general)!

Paper Presentations

We will present OM³: An Ordered Multi-level Min-Max Representation for Interactive Progressive Visualization of Time Series on Tuesday June 20 2:00pm — 3:30pm in Evergreen A. OM3 is a progressive data structure for interactive visualizations over massive time series datasets. A major challenge is the sheer amount of data that needs to be transferred to the client and rendered. Although M4 introduces query rewriting techniques to push “pixel-perfect” aggregations down to the database in order to reduce network and rendering costs, it increases the query processing costs during user interactions. M4 is a hierarchical data structure that is progressive, in that any cut of the data structure is sufficient to render an approximate visualization. In this way, a few bytes is enough to immediately render a coarse grained approximation (for when the user brushes or interacts rapidly), but quickly converges to the full resolution visualization. This data structure complements our earlier VLDB 2020 work on Khameleon, which treats interactive interfaces as a scheduling and communications problem and uses progressive encodings of the response as a key technique. Co-authored with Yunhai Wang, Yuchun Wang, Xin Chen, Yue Zhao, Fan Zhang, Eugene Wu, Chi-Wing Fu, Xiaohui Yu

Demonstrations

Haneen Mohammed and Charlie Summers will demo SmokedDuck on Tuesday June 20 2:00 pm — 3:30 pm and Thursday June 22 10:00 am — 11:30 am. Fine-grained lineage tracks the relationships between inputs and outputs of a query, and is particularly useful in analytical applications such as query debugging, view maintenance, query explanation, and data cleaning. Prior approaches rewrite SQL queries to also track lineage, but can slow query execution in analytical engines that are designed to process complex queries on large datasets. Moreover, they mainly capture lineage at the logical level. SmokedDuck is the first system that extends a high performance analytical DBMS (DuckDB) to support fast lineage capture and querying. Our key insight is identifying a duality between lineage and data movement in columnar data models. In this demonstration, we show how a user can leverage operator-level lineage to understand and debug a query with SQLTutor: an application built on top of SmokedDuck (and presented at the DataEd workshop during SIGMOD!). Attendees will be able to upload data and execute queries then explore query-level and operator-level lineage visually to track down bugs.

HILDA Workshop

Yiru Chen and Jeffrey Tao will present DIG: The Data Interface Grammar at HILDA on June 18 from 12:00 pm — 12:20 pm. Data interfaces are visual representations of data transformations. The increasing complexity of these transformations makes it challenging to design interfaces and optimize their interactive performance. To address this, we propose the data interface grammar (DIG), a compact and analyzable representation of data transformations. DIG leverages the insight that interactions within an interactive data interface can be expressed as production rules in a grammar. By interacting with the interface, users make choices that are reflected in the grammar, leading to new data transformation queries. DIG consists of minor extensions to the popular Parsing Expression Grammar and directly maps to data interfaces. This enables novel functionality such as automatic interface generation, automatic interface backend optimization, tutorial generation, and workload generation. Co-authored with Eugene Wu.

Zachary Huang will be presenting “Aggregation Consistency Errors in Semantic Layers and How to Avoid Them” at HILDA on June 18 from 16:10 pm — 16:30 pm. The presentation addresses the common issue faced by analysts who find it challenging to analyze data across multiple tables due to complications arising from joins and aggregations. Data engineers usually mitigate this by predefining “semantic layers,” yet this can lead to “aggregation consistency issues,” such as inflated total revenue due to double counting from join fanouts. We explain how this cannot be solved “offline” and existing BI solutions are based on heuristics that can be incorrect or misleading. We show how adopting a formalism called semi-ring aggregation, along with weighing join relationships, generalizes heuristic solutions used in market attribution and order management, and addresses aggregation consistency issues. We then propose human-in-the-loop framework allowing users to test different weighing strategies and visualize the outcomes. Co-authored with Pavan Kalyan Damalapati and Eugene Wu.

DaMoN Workshop

Junyoung Kim will be presenting ‘AMULET: Adaptive Matrix-Multiplication-Like Tasks’ at DaMoN on June 19th from 17:20–17:30pm. Many useful data science tasks can be represented as variants of matrix multiplication. However, executing these tasks is challenging as compilers cannot produce fast machine code for these tasks written as nested loops, and hand-tuned linear algebra libraries are not flexible enough to represent variations of matrix multiplication. Matrix multiplication is also commonly used in machine learning. However, only standard matrix multiplication tends to be used in machine learning due to the aforementioned difficulties in using variants of matrix multiplication. Having fast performance for variants of matrix multiplication could lead to the discovery of new machine learning models and more useful tasks. To support this goal we introduce Amulet, a compiler framework that uses compiler optimization techniques and database-style adaptive execution to get fast performance on variants of matrix multiplication. We show through experiments that Amulet achieves speedups compared to existing compilers on a variety of matrix multiplication-like tasks that are not supported by linear algebra libraries. Co-authored with Kenneth Ross.

Zachary Huang will be presenting “Random Forests over normalized data in CPU-GPU DBMSes” at DaMoN on June 19 from 12:00–12:10 pm. Random forests are widely used in machine learning and high-performance libraries like LightGBM and XGboost typically employ GPUs for better results. Nevertheless, these libraries are limited for training on normalized datasets stored in DBMSes: GPU memory is known to be limited and joins can create a much larger table that easily exceeds the available memory. To address this, we translate training to join aggregation queries and apply aggregation pushdown before joins; aggregation pushdown avoids the creation of large intermediate joins, making it ideal for GPU processing. We investigate different data placement and query execution strategies and find that the unique properties of training ML models using aggregation pushdown necessitates different design decisions. We show that it is possible to not only train random forest models on off-the-shelf CPU DBMS (DuckDB) and GPU dataframe library (cuDF), but outperform the leading LightGBM ML library by 1.5× on SSB SF=10. Co-authored with Pavan Kalyan Damalapati, Rathijit Sen, and Eugene Wu.

DataEd Workshop

Sean Kross will be presenting Teaching Data Science by Visualizing Data Table Transformations: Pandas Tutor for Python, Tidy Data Tutor for R, and SQL Tutor on June 23 in the DataEd 3:30PM session. The past two decades has witnessed a sea change in the data ecosystem: more people than ever demand tools to transform and analyze data. In response, there’s a growing ecosystem of data tools that span programming (python, R) and query (SQL) languages. This diversity in users and programming tools make learning new data tools difficult: a single line of code could perform multiple data transformation steps, but require different syntax and potentially even different semantics across the tools. Along with collaborators at UCSD, we have developed javascript library that visualizes step by step data transforms and the relationships between an operation’s input and output tables. On top, we also developed the *Tutor suite of PandasTutor, RTutor, and SQLTutor for the most popular libraries and query languages. Co-authored with Sam Lau, Philip Guo, and Eugene Wu