Navigating the Data Rabbit Hole
Ever since Alice (literally) fell down the rabbit hole in Lewis Carroll’s “Alice’s Adventures in Wonderland”, the meaning of the phrase has changed to a metaphor for extreme distraction on the Internet with no clear destination.
Not long ago, we too fell into a rabbit hole: the modern data stack. Data & Analytics is a core theme for us at Lightbird. After talking to many experts in the space and getting lost in said rabbit hole, we have defined one thesis we believe will spark innovation in different areas based on existing challenges in the space and summarized core aspects we consider when talking to Data & Analytics founders.
TL;DR: As more data gets generated, it becomes critical for unique business insight and value creation. With the data stack consolidating around a small set of cloud-based vendors and the cloud data warehouse becoming the system of record, new areas for innovation emerge. We are particularly excited about warehouse-native applications and reverse ETL, providing better, real-time insights on the whole customer lifecycle. Driven by increasing complexity and dependency, we also believe that data teams will focus more strongly on data operations. Therefore, data quality, monitoring, observability as well as management and governance are getting more important. Lastly, when assessing solutions in the Data & Analytics space, we particularly focus on the strength of the community and low barriers for adoption combined with a 10x better user experience.
Data, data everywhere
There is little doubt among technology leaders on the overall importance of data to derive insights, drive decision-making and ultimately increase business value. The drivers among others are the move to the cloud, the amount of data generated and the overall defensibility achieved through data. Cloud spending is increasing and expected to reach $374bn in 2022, an estimation conducted by Bain even before the COVID-19 pandemic. The same consultancy firm also expects SaaS and cloud solutions to receive the largest spending increases in 2022. What’s more, the average amount of SaaS applications used by organizations has risen to an average of 110 SaaS tools. Consequently, more data is generated, captured, copied and consumed worldwide — an estimated 181 zettabytes by 2025. Lastly, data is seen as a strategic asset and differentiator, fueling product innovation, customer insights and business value.
However, businesses aspiring to be truly data-driven still face many challenges. While speaking to data experts at startups as well as more established corporations, it became clear that the path to (data) success remains long for many.
🔮 Expectation vs. reality: Data is often seen as the magic bullet solving all problems. Yet, more than 70% of enterprises derive low value from their data and/or have low integration of data in their strategy, operations and culture. The main reason is the lack of resources and skills to do so (see next point).
🧠 Data mindset ≠ data skills: Despite the fact that many organizations want to be data-driven, they don’t know where to start or lack the resources and skills. Across all stages, this lack of data specialists is prevalent. According to a survey by Gallup for the Business-Higher Education Forum, 69% of US employers expect candidates with Data Science & Analytics skills to get preference for jobs in their organizations. However, only 23% of college and university leaders say their graduates will have those skills. With the increasing flywheel of more data produced → more data knowledge required, this gap will only become more urgent to be filled.
🏚️ Legacy systems: The pace of innovation in the data & analytics space has exploded in recent years (check out this market map by Firstmark to better understand what we mean). Especially in more mature companies, the main driver for data headaches are old and distributed systems. With data, it’s always easier to build new things in the data stack than to maintain old systems. A company starting today has a significant competitive advantage with its data stack over one starting 3–5 years ago.
🛡️ Data siloes: Data teams are often siloed from data consumers or business units and might even, to quote one of our data experts, get requests in a freaking ticketing system like Jira. Democratizing access to data, enabling data customers and business teams to understand, find and use data is thus critical to a) take away the burden from the data teams and b) unlock more value from the available data to become actually, truly data-driven.
The modern data stack
Despite these challenges mentioned above, huge innovation has happened in the data space. The major one is the adoption of cloud software and the cloud data warehouse as the critical infrastructure component for the emergence of the modern data stack. Why is this? The cloud data warehouse enables cheap and scalable storage, handling many different data types, timeframes and speeds. It answers the core question of where to store it all?, standardizing and centralizing data from different sources and consequently becoming a system of record.
In this context, a16z voices the platform hypothesis, proposing that the whole “backend” of the data stack — not just the data warehouse but the data ingestion, storage, processing and transformation — is consolidating around a relatively small set of cloud-based vendors. As shown in the simplified illustration below, previously isolated data from different data sources is fed into the cloud data warehouse (illustrated in red below). It then gets transformed and made available in BI tools like tableau, in data science notebooks like Deepnote and more.
Following this thesis around the cloud data warehouse becoming the system of record or single source of truth for data, we want to highlight two areas that we are particularly excited about. The first concerns warehouse-native applications built on top of the data warehouse and Reverse ETL (Extract, Transform, Load). The second, a bit wider area, deals with the challenges arising from opening up more data systems and linking them to the cloud: Data Quality and Governance.
Warehouse-native applications & Reverse ETL
As the “backend” of the data stack starts to consolidate and monolithic, closed data systems get opened up, there is a huge potential for warehouse-native applications to emerge and to put previously siloed data into context through Reverse ETL.
💡 A wide range of new applications are emerging with a “warehouse-native” architecture in place. These applications provide better, real-time insights on the whole customer lifecycle including product usage and user journeys.
As the adoption of SaaS solutions increases, more data gets created and made available in the cloud. This data from the likes of Hubspot or Stripe can not only be fed into a data warehouse, it can also get enriched by the data from the data warehouse and synced back to the respective tool through Reverse ETL. Providers like Census or Hightouch enable this as a service. Consequently, data customers and business teams derive new and better insights right in their main OS and can execute them in real-time — taking away workload from oftentimes already overworked data teams.
In addition to that, as the data warehouse becomes the system of record, it allows “frontend” developers to use it as a single point of integration (aka standardization) to build many applications on top.
We see verticalized solutions for different business functions emerging that leverage these new ways of creating, storing and using data: Fullview, Planhat or Vitally for Customer Success; Weflow for Sales and Endgame, June, Correlated, Pocus for product analytics and Product-Led Growth (check out our previous series on PLG if you’d like to dig deeper into the topic). There are also more generalist platforms like Breyta or PoggioLabs, and enablement tools like Y42, Whaly or Weld to sync user data across different SaaS tools.
A core challenge with Reverse ETL however, which is also discussed by Priyanka Somrah in her newsletter “The Data Source”, is to maintain low latency in order to avoid inconsistencies between data updated in the warehouse and data that is fed into third-party applications like Salesforce or Zendesk. There, business users need the most up-to-date versions of the data in order to execute the insights. Snowflake has already launched Change Data Capture (CDC) to address this challenge.
Data Quality & Governance
One drawback of connecting more diverse data sources and the explosion of new warehouse-native applications is that it creates huge dependencies and complexities. Petr Janda, former CTO of Pleo, has illustrated this problem in the following image:
Finding, understanding as well as securing data while ensuring high quality and trust is still a challenge for many. Yet, it is a critical fundament upon which data success and literacy are built. Therefore, inspired by Ethan Aaron, we expect data teams to expand their mandate from analytics into operations — similar to what we see with the rise of DevOps in engineering.
💡 There needs to be better integration, visualization, security, and control capabilities that reduce repetitive and laborious tasks required to deliver clean, compliant data repeatedly to the end user/application.
DataOps and the respective tools help users find, understand, access and secure data as well as create trust in its quality. It answers questions such as What data do we have available? Where does it live? How is it structured and used? Who has access to it? How can it be shared in a secure way?
Data Ops is particularly relevant as it deals with physical systems, data models, processes and human beings. DataOps is designed to build high-quality data and analytics solutions at an increasingly accelerated pace, and with higher reliability. It can be broken down into different sub-categories.
🔎 Data quality, monitoring & observability: Here, the goal is to help users trust the data they find. This is especially critical as mastering data and data quality management has been and still is one of the most important trends in Data, BI and Analytics according to a report on the Top Business Intelligence Trends 2022 by BARC.
Companies are constantly battling with insufficient data quality as a hurdle to making better use of the data.
Soda, a Belgium-based company, helps enterprises monitor the quality of their data and brings together different departments to collaborate and investigate data challenges together. Other players in the space are Monte Carlo, Datafold, and BigEye.
🏛️ Data management & governance
The survey from BARC also found that data governance is equally significant to ensuring high-quality data to work with. Data Management and Governance includes the access and collaboration aspect of working with data which defines how you handle permissions and exchanges of data, both internally and with other organizations. Companies addressing these challenges are Harbr, Oblivous and Decentriq.
Another challenge is to make data discoverable within the organization and to establish standardized metrics across all the teams to ensure alignment between the business and the data teams. Data Catalogues aim to solve exactly that and give insights on what data is actually available, where it can be found and how it is structured and being used. Tools like Castor, Atlan or Lyft’s open source program Amundsen provide such solutions.
Lastly, in order to be able to track data across repositories and pipelines for troubleshooting and/or compliance purposes, Data Lineage is critical. Manta, helps enterprises understand and visualize the flow of information through the organization and brings visibility into the where, how and what of their data assets. Another well-established player in this field is Collibra which has raised almost $600M in funding so far.
When we assess opportunities in the areas mentioned above and by talking to data experts, it became clear that the bar to try and buy data tools is very high. Founders building in the Data & Analytics space should thus address the following questions of users and buyers when defining their value proposition:
🚧 Barriers to adoption: How applicable is the tool to the current tech stack and existing data sources? How well is it integrated with other tools? What are the switching and implementation costs?
📊 Data types: How fast (batch vs. real-time) and in which format (structured vs. unstructured, new vs. old) should the data be made available?
💐 Control: How much of the data stack should be kept in-house vs. outsourced? Should it be on-premise, in the cloud, or offer a hybrid solution?
💙 Community: Where can one get help? How responsive and growing is the community? Is there clear documentation available?
💨 Varia: Other aspects that were mentioned included the reliability and stability of the tech, the calculation and loading performance, data security/GDPR, access and release management as well as version history.
At Lightbird, we emphasize the strength of the communities and low barriers to adoption as important assessment criteria as well as a 10x better user experience. Many of the data experts we spoke to mentioned that having an open-source approach with an engaged and growing community has a strong signaling effect and increases trust for them — not only in how the solution is being developed over time but also in how they can get support.
We will continue to go deeper into the Data Rabbit Hole as time goes by since there is still much more to unpack. Nonetheless, we hope that this article has given you a good first overview of the drivers, challenges as well as opportunities for builders in this space. Founders focusing on building strong, sustainable communities around their products while providing a 10x better user experience than the status quo still have many areas to disrupt — and we’re bullish to partner with them, so please reach out!
And to stay up to date about everything at Lightbird, make sure to sign up for our newsletter here.
This article would not have been possible without the support of the experts that helped us nurture our understanding of the space. Many thanks to Janine Lee, Matthew Brandt, Macarena Beigier, Martin Seifer, Sarah Schneeberger, Lingxuan Zhang, Livia Kaiser, Osmo Salomaa and many more🙏
Bonus: Data Bookshelf
There are a couple of very insightful articles describing the modern data stack which we used to get started when we first fell into the rabbit hole. They might be helpful for you as well 👉 check them out here.