What’s Next in the Modern Data Stack?

Laura Waldenstrom
Inkef

--

A lot has been written about “The Data Space” and the related VC hype. The insatiable appetite to deploy capital in this highly-sought after sector, has not gone unnoticed either. At Inkef, we can only echo the excitement for this new breed of companies and the dawn of what has been dubbed the “decade of data”.

It is our firm belief that every modern company today, from SMEs to enterprises, needs to become data driven to stay competitive, since the dependence on data to carry out daily operations has dramatically increased in the last years. Deriving insights from data is directly linked to a company’s ability to drive product innovation, understand its customers, optimize financial management and navigate the competitive landscape. Despite the accelerated innovation that has taken place in the data world during the last four years alone, and the huge amount of venture capital invested, we still believe we are early in the adoption and there is ample room for innovation and emergence of new category leaders. In this blog post we are outlining some of the important trends and a couple of the subcategories that we are most excited about at the moment, and why.

The modern era of data infrastructure

Two decades ago, at the beginning of the “Big Data Era”, enterprise data was moved between silos, through fragile and error-prone data pipelines. These highly complex pipelines could cost months of development time to build and maintain. Since then, data architecture standards have undergone a remarkable transformation; from monolithic systems and expensive storage to a modular and cloud-based architecture, where data is treated as code.

In the last couple of years, data infrastructure entered the modern era, thanks to the adoption of cloud data warehouses, like Snowflake. Since the beginning of 2020 the term “the modern data stack” has gone mainstream. It refers to the new, best-of-breed data architecture, which is centred around a cloud data warehouse (or cloud lakes) and built to cope with massive amounts of data. It is modular, runs on SQL and consists of a suite of tools for every part for the data workflow.

The Modern Data Stack. Source: Inkef

From ‘making data useful’ to ‘making use of data’

At the dawn of the modern data era, (investor) attention shifted from the last mile of the data life cycle, where tons of venture capital had been invested in BI systems and analytics dashboards during the last decade, to data and AI infrastructure tools, sitting deeper in the stack. As first-mile data issues became evident to the rest of the world, outside the data teams, VCs jumped on every opportunity to invest in tools solving “billion dollar problems” such as data integration, transformation, storage, monitoring and pipeline orchestration.

After conversations with a number of data leaders in enterprises and scale-ups, our hypothesis is that there now exists a next-gen, best-of-breed tool in each category of the modern data stack, as depicted above. Tools like dbt, Snowflake and Fivetran are deployed and utilized by the entire data community (at least for the time being). And some of the categories which were frequently cited as “underserved” early in 2020, such as data observability, quickly went from being empty to relatively crowded, with start-ups such as Monte Carlo, Databand, Datafold, Metaplane, and Acceldata, collectively raising hundreds of millions of dollars in less than two years.

In other words, with a selection of tools from the modern data stack, and the right talent in place, a company is well-equipped to make its data useful. However, useful data doesn’t generate value unless it is actually being used to run the business. And this, is still a massive problem. In fact, >70% of enterprises are still behind in their ability to create value from data and 87% of all data projects never even make it to production.

Roadblocks to create value from data

There are a number of reasons why data is still not being (optimally) used to create value in companies. Two notable ones driving our investment thesis in the data space are the (1) disconnection between data teams and data consumers, and (2) immature data governance.

1. Disconnection between data teams and data consumers

A common set up in organisations is one where data products are being produced and maintained by data teams, who also own the data workflow and infrastructure. This set up has several pitfalls. To start with, data products that don’t involve data consumers (often sitting in the business teams) in the design phase, tend to result in lower usage and adoption. For example, due to lack of trust in the data or perceived value. Moreover, as business users have to continuously send requests to the data team whenever a figure seems off or a dashboard needs to be tweaked, it creates a high burden on the data team, whose time is already scarce and expensive. Another problem is that data consumers often don’t even know what they need or simply lack understanding of what data is available (and as such, what information could be derived).

We believe that it is crucial to bridge the gap between data and business teams to become truly data driven. The more companies include data roles (data engineers, scientist, analyst) as partners or integral parts of the business teams, the more value will be created. On the back of this notion, the data mesh concept, which was on everyone’s lips last year, has risen in popularity. A data mesh isn’t a tool or a service, but a design concept that treats data-as-a-product. As opposed to store data centrally and have dedicated data teams assigned to business projects, serving data consumers with processed and transformed data, a data mesh is built for decentralized and domain-specific ownership of data. It means that each business domain is responsible for hosting, prepping and serving its data to their own domain.

Data democratisation is, in our perspective, another key enabler to create more value from data. Democratizing the access to data, by giving more individuals access to visualization, data modelling, discovery, ML and statistics tools, poses a tremendous opportunity to help people solve problems and generate meaning from data quicker and in new, exciting ways. Building a data infrastructure that caters for a higher level of self-service for downstream customers, unburdens the data and engineering teams. This is a major benefit as we are facing a shortage of data talent and the demand is only increasing.

2. Data governance immaturity

Good data governance ensures that data is accessible, reliable, available and of high quality, while at the same time it supports data security and privacy compliance. We see it as the combination of three key elements: observability, discovery, and privacy & security. Over the past several years data governance has become a corporate necessity. Increased compliance and data privacy regulations such as GDPR and CCPA, have pushed the data governance topic high on the strategic agenda for businesses. Still, data governance is frequently cited as a major hurdle by data teams. As many as +80% of organizations have immature data governance.

Insufficient data governance is a main contributor to why many data projects never reach production. As companies implement tools from the modern data stack and democratize access to working with data, it is important to ensure that only the people that are entitled to access the data are using it, and that all elements of the data stack follow compliance best practices. Moreover, as data becomes more accessible and discoverable, the need for data to be reliable and trustworthy increases further.

Data privacy presents a massive opportunity due to the “privacy debt” tech companies have accumulated. There is a huge gap between where privacy policies, processes and technologies are today and where they need to be. We can see proof of this every other day in the news as data breach activities are continuously reaching the headlines.

To sum it up, tools that help companies create value from their data by solving some of these issues, form interesting pockets of opportunity in the modern data stack. Below, we elaborate on a couple of them, in which we are keen to explore future investments.

Opportunity areas in the data stack in 2022

Codeless data consumption

We believe that low and no-code technologies are important enablers for data democratization. Embracing low- and no-code, allows companies to create competitive advantage by moving faster, reducing the burden on the engineering teams and democratising access to technologies that leverage data. When business units like Marketing, Finance and Operations are dependent on SQL skilled data team members, it creates bottlenecks, slows down a company’s effort to become data driven and stifles innovation. Instead, solutions like codeless data and BI platforms and no-code databases, put data in the hand of the end users and allow them to access data faster, build data apps and reporting dashboards, thus increase time to insights and actions. At the same time, data teams can free up time and resources to work on projects that, for example, improve and scale the data and analytics infrastructure.

As more companies move towards a data mesh architecture, it opens up opportunities for individual business domains to select which tools for data work that serve their needs best. We foresee that this movement will increase adoption of low code platforms. It also creates opportunities for a new generations of data companies to emerge, with various degrees of low- to no-code, in several categories of the modern data stack.

Early stage start-ups realizing this trend include: Baserow*, Weld, Y42, Octolis and Whaly

Analyst & analytics engineer empowerment

As new capabilities have been, and continue to be, introduced to the modern data stack, roles in the data team get more complicated and the division of responsibilities between data engineers, scientist and analyst will blur further. However, within the data team, the knowledge and skillsets tend to vary broadly between different roles. For example, data engineers, who work further upstream in the data life cycle, tend to have more technical skills and are knowledgeable in a broader set of programming languages, whereas data analysts, who work more downstream close to the data users, are mainly skilled in SQL but have a better understanding of the business operations. A major frustration associated with this setup has been the dependency of data engineers for setting up data pipelines to move and transform the data. The rise of SQL-based pipeline building tools like dbt and Dataform has changed this for the better. Analysts are now empowered to own the full data transformation process.

About two years ago, the title “Analytics Engineer” was coined, which describes former Analysts who are now responsible for the entire data transformation work (including cleaning, testing, deploying, and documenting data), which happens between loading data into the warehouse and analysing it.

We believe that we are only at the beginning of the movement to empower data analysts and this new breed of analytics engineers. This shift will lead to more start-ups building SQL-based tools that make life easier for the analysts / analytics engineers, by enabling higher efficiency, reduce the volume of redundant tasks, and increase creativity. Ideally these tools work in symbiosis with the likes of dbt. We see a number of exciting pockets of opportunities such as collaborative data science notebooks, next generation full-stack BI platforms, and tools to facilitate self-served analytics.

Early stage start-ups realizing this trend include: LightDash, Count, Kleene.ai, and DuckDb.

Data sharing & collaboration

Data is one of the most valuable assets of tech companies, arguably the most valuable for some. Treating “data as commerce”, i.e. sharing data between (and within) organisations, is therefore an interesting aspect for the industry in the future. Data sharing is the ability to distribute the same data resources to multiple users while maintaining high reliability and fidelity. Accessing third-party data allows companies to make data driven decisions based on external data, ranging from data about public health, weather conditions, supply chain vulnerability, price fluctuations etc. Last year, Google, Databricks and Snowflake all launched data sharing and marketplace initiatives. In the coming years, we expect that data sharing technologies will become a “must-have” for organisations.

Data sharing eliminates silos but it also creates challenges with governance. Flawless security and privacy compliance is critical for data sharing. Simply sharing data via CSV files, USB drives or API calls is not compliant with policy regulations like GDPR.

Data sharing also unlocks data collaboration within and across organisations. In domain-oriented data ownership models, such as the data mesh, internal data sharing and collaboration across functions becomes a powerful capability by leveraging additional data sources, thus generating more accurate insights. We see a massive potential for start-ups that enable compliant and secure data collaboration. For example, data federation platforms built for sectors with high data privacy regulations like healthcare and financial services.

Early stage start-ups realizing this trend include: Harbr, Lifebit, Apheris, and RosemanLabs.

Active metadata management

We see proper metadata management as key enabler for solid data governance, as it can facilitate data discovery, observability, privacy and security. The concept “active metadata management” has been buzzing for some time, which cumulated in that Gartner introduced a “Market Guide for Active Metadata” last year. The concept was formed on the notion that earlier generations of metadata management were passive and siloed platforms built for monolithic data stacks. On the contrary, active metadata platforms are intelligent, cloud native and built for the modern data stack. They continuously collect metadata, and create intelligence from it, at every stage of the stack — logs, query history, usage statistics, notebooks — and don’t wait for humans to manually enter it. “Active” in this context, also refers to the capability to drive actions, such as recommendations to data pipeline systems, generate alerts and operationalize intelligence. Active metadata can also improve data quality, by automatically trigger actions when a data quality issue is detected.

As data catalogs are tools that facilitate metadata management, we think that next generation data catalogs, built for active metadata management (such as Atlan), form a highly enticing pocket of opportunity. In particular as data catalog market leaders (Collibra, Alation) were not built for the modern data stack. Next generation data catalogs for active metadata should break down data silos and provide a single source of truth and end-to-end data visibility. They should also enable better data security and privacy, when used to set up access policies and permissions.

Early stage start-ups realizing this trend include: Castor, Stemma, Zeenea, and Marquez.

* Inkef investments

At Inkef, we are extremely enthusiastic about what’s next in the world of data and would love to hear from you if you are a founder, investor, operator, or anyone who is as excited as us about the data stack! Look out for future blog posts on the space and do reach out to laura[at]inkef.com to have a chat.

--

--

Laura Waldenstrom
Inkef
Editor for

VC Investor at Earlybird - based in London, made in Stockholm.