Opportunities in the data stack

Sivesh Sukumar
Balderton
Published in
6 min readJun 10, 2022

Introduction to the data stack

The data stack has its roots in the 70s with the invention of ETL allowing businesses to manage their data in a structured manner. The stack evolved in a natural but unstructured way, where engineers built a mesh of point-to-point data pipelines wherever they were required. In 2012, the launch of Amazon Redshift and other “modern data warehouses” which leveraged Massively Parallel Processing brought some structure back to the data stack as they made it cost-effective to store and process all data (structured and unstructured) within the warehouse.

Since then, a huge array of tools have been built around this central hub, leading to what most people know today as the “modern data stack”. There are many definitions as to what the modern data stack actually entails, some common traits are:

  • It’s cloud-based
  • It’s modular and customisable
  • It’s best-of-breed first (choosing the best tool for a specific job, versus an all-in-one solution)
  • It runs on SQL (at least for now)

I believe it’s a core shift towards business-orientated data teams, who can now leverage a range of modular data tools (usually built around a central data warehouse, such as Snowflake) to produce data workflows, which power analytical business decisions.

A brief history of data infrastructure

It’s worth noting that there is nothing particularly modern about the layers in the stack people use today. Ingestion, storage, transformation, BI and unification layers have always been integral to data infrastructure. What’s changed is the underlying technology (e.g. adoption of cloud) and the demand for businesses to be data-driven outpacing talent. Crucially, data teams are now able to command the respect they deserve as they can prove their ROI, leading to a huge increase in the investment flowing in.

A [simple] data stack

Opportunities

With technology consistently evolving, there are always going to be opportunities across the stack, but there are a few clear trends that we’re excited about at Balderton. If you’re building in the space, we’d love to hear from you!

1. Real-time / event streaming

There’s no debate over the fact that data loses its value once it’s been produced. A dashboard which has data refreshed every minute is always going to be more valuable than a dashboard refreshed every week. A recommendation based on your activity over the past hour is better than a recommendation based on what you were doing this time last week.

There’s clear value in real-time data but it’s historically been more effort than it’s worth to produce real-time products. Kafka is an open-source library (produced by some brilliant engineers within LinkedIn) built with a few key design principles: a simple API for both producers and consumers, designed for high throughput, and a scaled-out architecture from the beginning. The value here isn’t just in real-time data but the event-based technology also simplifies data workflows as there’s less of a reliance on databases, you can process your data as and when it comes in.

Confluent is seeing great growth alongside Kafka

However, despite Confluent (the company behind Kafka) seeing incredible growth (71% at a $500mm run-rate!), there is still much to be done before the masses start building in real-time. Redpanda is one alternative but it’s still not trivial. There’s a colossal opportunity in making this great technology more accessible whether it’s building serviceable data products using (Popsink) or building high-performance ML models (Quix).

2. The last mile of analytics

The last mile is not only about making data more accessible but also making it more actionable. The technology underlying Snowflake and the majority of the data stack is state of the art, however, very few businesses are able to really extract maximum value from this.

There is a huge range of tools aiming to solve for this such as Reverse ETL, Data Workspaces and Spreadsheet based BI. We believe that a robust semantic layer is vital for solving this problem. Looker made a great effort with LookML, which led to it becoming the go-to BI tool for the modern data stack but it’s clear that this is unlikely to last following their acquisition.

3. The open-source data stack

Open-source is becoming a common feature in data winners. There are benefits in accelerated enterprise adoption (as no need for a procurement process), accelerated innovation / optimal roadmap driven by the community and the open-nature allows flexibility. It’s also a plus that enterprises are able to access the code — if something goes wrong, they know they can fix it if they need to. Great data businesses are often a function of integrations due to the huge permutations of stacks available. Open-source can be a great way to maintain output of integrations and keep a platform ahead of the pack.

As an investor, it can feel a bit backwards when a company builds great software and then gives it away for free. However, if it’s done right OSS business models can be some of the best. Balderton was an early investor in MySQL, which was one of the first examples of this and we continue to invest in OSS.

CNET article from 2008

4. Data operations

DataOps is by no means a new priority but it’s clear that it’s the missing piece in the modern data stack. The pace of development in data engineering has led to several quality issues and with larger organisations now implementing these tools, these issues are no longer acceptable.

Three buckets that we’re interested in are Orchestration (including lineage), Access (including catalogues/metrics stores) and Observability. Observability is particularly interesting as it’s relatively new and hasn’t really been defined. I see it revolving around the key concept of dynamic data monitoring, which shouldn’t be confused with static data testing. Data testing (Great Expectations) is all about defining clear static rules about what you expect to see: e.g. “does my numerical data contain any letters in it?”. Data monitoring is more about tracking how your data changes over time and creating alerts if something unusual occurs. Advancements in metadata analytics now make it easier for a huge amount of data monitoring to be automated leading to several businesses being built in the space, including Monte Carlo. Today, none of these platforms are perfect with one big issue being “alert fatigue” as businesses are being overwhelmed with alerts and have no way of easily managing or prioritising them. Given that integrations are a key success factor for data platforms, and this is exaggerated in DataOps, it’s important to have full-stack connectivity so a business can really know the downstream impacts of any errors.

Despite high demand, there is still a lot of room for improvement in data observability solutions

Conclusions

Data management is clearly a priority for all businesses and we’ve been shown time and time again how it can lead to dominance. The modern data stack has driven us forward but with technology constantly evolving, there are always going to be opportunities to innovate.

It’s important to not only build workflows to make data more accessible but also to focus on making it more actionable, whether that be by increasing trust in the data or building in real-time.

Balderton has its roots in the data stack going back to Business Objects, which was founded by Bernard, our Managing Partner. Additionally, Balderton was an early investor in a huge array of successful data businesses including MySQL, Talend, Toucan Toco, Matillion, Kili Technologies and Funnel. We’re always trying to keep our finger on the pulse of data innovation so if you’re building in the space, we’d love to hear from you!

--

--