The 2018 Very Large Databases (VLDB) conference is in just a few days. Fotis Psallidas will be representing the WuLab at the conference and presenting his research paper. We also are honored to receive a Test-of-Time award! This award is given to the VLDB paper published ten years earlier that has had the most influence since its publication.
This blog post introduces both research projects, and we hope you will read the full papers. If you are attending VLDB, find Fotis in the halls and say hi!
Smoke: Fine-grained Lineage at Interactive Speed
Fotis will present an incredibly fast query engine called Smoke that supports tracking and querying lineage. What I love about this research is that it turns the prevailing expectation about lineage tracking on its head. Contrary to popular belief, Smoke shows that tracking data lineage is not simply query processing overhead that should be avoided. Instead, by tracking and indexing lineage very efficiently, it can be used to speed up future queries quickly enough to power highly interactive applications! Below, I describe the research highlights.
Background: Fine-grained lineage is metadata that tracks the dependency information between input and output records of a query, and has a wide range of applications such as debugging, explanation, auditing, security, and more. In fact, our recent HILDA paper outlines interesting connections between lineage and interactive applications. However, the challenge is that actually tracking lineage for individual records can be incredibly expensive — especially for query engines that are already very fast.
Modern query engines can run SQL analytical queries so quickly that they are often bottlenecked by how fast hardware is able to read and process data. But tracking would require writing a huge number of pointers to track the relationship between each output record and its corresponding input record for each operator in the query. This can easily cripple query performance. On our micro-benchmarks, traditional approaches to track lineage can slow down a single GROUPBY operator by over 250x! In contrast, Smoke reduces the slow down to as low as 0.3x — an >800x reduction.
The main contribution is a set of design principles when creating a query engine to support lineage tracking and querying. These principles include tight integration of lineage tracking logic within the physical query plan using query compilation, very efficient lineage index representations that are essentially integer arrays, and to leverage data structures that are already constructed during query execution to also track and index lineage data — because the best work is work that someone else does!
Exciting Results: One of our exciting results is that Smoke can track and query lineage so quickly that it can be used to power highly interactive use cases such as visualization and data profiling applications. We show howusing Smoke to run these applications can be as fast or faster than state-of-the-art, hand-optimized implementations!
For instance, we implemented the following cross-filter visualization over the US flight delays dataset and compared it to a baseline approach called LAZY, which does not perform any optimization and simply re-runs the relevant SQL queries to re-compute and re-render the views upon each interaction. We also compare with visualization optimization technique based on Data Cubes, which are data structures that precompute interaction results before loading the visualization. BT is running Smoke to generate a backward index from output to input records, while BT+FT uses Smoke to also create a forward index from input to output records. We simulated every possible interaction in the visualization and measured the total amount of time it took the system to respond to each interaction.
The plot on the right shows cumulative time in log scale. LAZY is orders of magnitude slower than all other approaches — 230x slower than Smoke. The purple Data Cube line takes ~100 seconds to precompute the data structure, but then answers each interaction instantaneously. In contrast, it turns out that the backward and forward indexes are far cheaper to create than the data cubes but can be used to derive data cubes on the fly. This is why the BT+FT interaction latency is nearly as fast as state-of-the-art visualization optimization techniques!
WebTables: Exploring the Power of Tables on the Web
We are honored to receive the 10-year Test-of-Time award for WebTables: exploring the power of tables on the web. This research performed the first large-scale study of tabular data on the web, by extracting table-like relations from Google’s 2008 index of the web. We extracted >14 billion HTML tables, and from there, identified >150 million relation-like tables. This was orders of magnitude larger than the biggest corpus of tables in existence at the time!
At one point during the project, the table parsing and extraction pipeline kept crashing due to out-of-memory errors and I couldn’t figure out why. During debugging, the image below kept turning up and I was confused why. It turns out that the image is actually an HTML Table, where each table cell is a pixel. The internet is insane.
The following decade has seen WebTables extended to support other data types, to extract high-quality tables, and to support a rich set of table-search and table-completion applications. WebTables have also been integrated into many Microsoft and Google products and services. A few examples include table-based answers to web search queries, and a synonym API that many websites use to enrich their own search functionalities. We have written a 10-year retrospective that looks back on the progress since 2008.