Data Council 2024: The future data stack is composable, and other hot takes

Chase Roberts
Vertex Ventures US
Published in
5 min readApr 2, 2024

I wrote the synopsis below for the Vertex Ventures investment team as a debrief to last week’s Data Council conference. The summary wasn’t originally intended to be published as a blog post, but one of my partners suggested I post it publicly. So, here it is — entirely unedited. Special thanks to the presenters, including Michelle Ufford Winters, Ryan Dolley, Jan Soubusta, Wes McKinney, Caitlin Colgrove, George Fraser, Leah Weiss, Amit Sangani, Kyle Weller, and the countless practitioners who shared their experiences working with data.

Data lake(houses) aren’t hot yet but might be soon. The definition of a data lake shifted from “store all of your structured and unstructured data in its native format” to “data in S3 stored with a common data structure” — hence, the term lakehouse (presenters used these terms interchangeably). The appeal is storing data at a low cost in object storage and relying on the open table formats (Hudi, Iceberg, Delta) for ACID guarantees, schema evolution, updates & deletes (merge/upsert), and metadata abstractions.

This observation corresponds to the second takeaway: open standards have reached a support level such that the composable data stack is increasingly feasible. By “support,” I mean engines, languages, and other components of database systems now integrate these standards: Apache Arrow, Substrait, Ibis, open table formats, and more. I created a diagram with more detail:

Data lakes and the open data stack (ODS) go hand-in-hand. You can store your data once (S3 + OSS table format) and choose the db engine that makes sense for your workload. Here’s what one of the Apache Arrow creators proposed at the conference:

  • </= 1TB — DuckDB, Snowflake, DataFusion, Athena, Trino, Presto, etc.
  • 1–10TB — Spark, Dask, Ray, etc.
  • 10TB — hardware-accelerated processing (e.g., Theseus).

The front-end languages are portable, too (Ibis, Malloy, PRQL, or a standard transpilable SQL). But what are the benefits of an ODS? Cheap storage, hardware acceleration, modularity with reusable components, simpler management, and a better user experience.

This quote from a paper discussed at the conference nicely summarizes the advantages:

By clearly outlining APIs and encapsulating responsibilities, data management software could more easily be adapted, for example, to leverage novel devices and accelerators as the underlying hardware evolves. By relying on a modular stack that reuses [the] execution engine and language frontend, data systems code could provide a more consistent experience and semantics to users, from transactional to analytic systems, from stream processing to machine learning workloads.

Imagine a scenario where a data engineer develops a batch processing job in Apache Spark using Scala to extract sales data from Databricks periodically. This data is cleaned, transformed, and enriched with additional customer demographic information pulled from a PostgreSQL database. Once processed, the enriched sales data is loaded into Snowflake for complex querying and reporting purposes. In parallel, a data scientist leverages this consolidated data in Snowflake to build a predictive model for sales forecasting using Python and scikit-learn. The predictions are then exported to a MongoDB database to serve real-time sales forecast data to a dashboard application, enhancing decision-making processes for sales managers. Such a data workflow is not easily integrated and would involve coordination across multiple systems and teams. A composable system aims to simplify this.

Perhaps the most compelling feature of the ODS is it counteracts data silos. Data gravity is real: migration is expensive, retraining workforces is difficult, and rewriting data workloads to new interfaces requires considerable resources. So, buyers age with their vendors as these vendors’ database systems become archaic. This learning was my third takeaway: VCs are 5–10 years ahead of practitioners. I spoke to multiple data people stuck in legacy systems and still inching their way to the cloud. VCs have moved on from data catalogs, yet practitioners told me they look forward to solving data discovery.

So it’ll be easy to adopt the ODC, right? Not entirely — there are a few barriers (the first two bullets are quoted from Wes McKinney’s presentation):

  • Knowing which engine to use is non-obvious (Magpie from MSFT helps).
  • SQL dialects are non-portable, and vendors aren’t incentivized to fix this (SQLGlot, Coral, and Substrait help).
  • Complexity! Assembling these systems is the purview of data architects and sophisticated DevOps teams.

Other conference takeaways:

  • The MDS was MIA. The modern data stack was more likely the product of VC hype than a set of technologies that will support large, independent companies. Snowflake and Databricks have consolidated OLAP wallet share, offering full-stack solutions and leaving scraps for hopeful MDS vendors. Despite the MDS exhaustion among early adopters, multiple practitioners still lamented data discovery and quality challenges.
  • AI applications aren’t yet a reality. While there were a handful of presentations about AI and LLMs, the adoption among practitioners was relatively muted. Presenters forecasted LLMs’ role in simplifying discovery, exploration, movement, preparation, analysis, visualization, modeling, maintenance, and operations. Yet customer anecdotes about LLMs applied to data infrastructure were absent.

Where should we invest?

If we believe the ODS will be the next major design pattern, we’ll need better abstractions for assembling the components and managing these systems in a distributed context. This abstraction might look something like a control plane.

As I mentioned earlier, Snowflake and Databricks crowd out budgets for ancillary tools like data catalogs, quality management, observability, and lineage tools. On the other hand, the problems these tools solve persist among the practitioners I spoke with. I suspect the unsolved problems aren’t due to missing product-market fit but slow transitions to modern database architectures and broken organizational processes (e.g., data producers not coordinating with data consumers). Data access control came up a few times, but I’m unsure if solving this problem will support a venture-scale outcome.

Finally, the intersection of LLMs and data infrastructure tools is still in infancy. I bet many new features emerge in existing tools — like using natural language for analytical queries or transformations — rather than producing new investable data infrastructure product categories.

For more insights and updates from Vertex Ventures US, sign up for our weekly newsletter here.

--

--