There has been an explosive growth in modern data stack. Globally, VCs have invested ~$12B in last 2 years in this space. Snowflake had a blockbuster IPO in 2021 and now, generating $1 Billion in annual revenues. Adoption of cloud technology led to a rise in SaaS and consumer tech applications and transformed the way organisations operated. Similarly, modern data stack is enabling organisation to access, analyse and use data to build personalised applications for their end users.
In future, company of any size — small, medium or enterprise will not just be a software driven but also data driven, especially in decision making
Adoption of cloud and growth in SaaS apps has led to data growing at an exponential rate. As per IDC, 175 zettabytes data will be created by 2025 (a zettabyte is a trillion gigabytes) and it will continue to grow at 23% per year. As AI/ML applications continue to rise, data and engineering teams find it challenging to build and maintain the data infrastructure and thus, they need right set of tools.
Modern data stack comprises of collection of cloud native tools that are centered around a cloud data warehouse and covers different stages of the data journey from ingestion, storage, transformation to business intelligence as shown below.
To start with, organisations need data ingestion tools like Fivetran, Hevodata to move data from multiple data sources to cloud data warehouse such as Snowflake or data lakes and then run transformation logic using dbT. This data can be further used by data analyst to derive insights or data scientists to run AI/ML experiments or serve data back to applications for enrichment using reverse ETL tools. New categories are emerging as the data ecosystem is maturing and companies are exploring opportunities to do more with the data.
Below are the emerging trends in this space
A. Fundamental shift in data storage technologies is main driver of innovation
Advent of cloud data warehouses in 2014 such as Snowflake, Amazon Redshift has been an inflection point which eventually led to rise of Modern Data Stack. There is a shift in how data teams are building data pipeline — move from ETL to ELT technique. With ELT, data teams are first extracting data from various data sources and dumping it in data warehouse and then running transformation logic within the data warehouse itself. These led to emergence of two new categories data ingestion tools such as Fivetran, Hevo data, Stitch and transformation tools such as dBT.
Data lakes and data warehouses each have their own set of pros and cons. Data lakes are good for storing any kind of raw data and thus, great tool for ML experiments but has poor support BI/SQL interface. Data warehouses on the other hand are ideal for structured data for BI/Analytics but not very good support for ML.
Data lake house aims to unify the best functionalities of data lakes and cloud data warehouses and entire data team from analysts, data engineers to data scientists can collaborate on single data platform. Snowflake’s Data Cloud and Databricks’s Data lake house are frontrunners in this innovation.
Just as adoption of cloud data warehouses led to rise of new categories in data stack, emergence of lake houses will unlock opportunities for new categories to emerge. And as market for data warehouses and lake houses grows (Snowflake’s annual revenue growth 100% Y-o-Y) so does market size for all adjacent categories in data stack and ML stack space.
B. What more I can do with data — Rise of data products
Tech stack for data ingestion, storage, processing has fairly evolved and thus, meets the bottom of the data hierarchy for most organisations. Now, they can focus on high value applications of data such as real-time streaming, machine learning, data products, etc. that appear higher in the hierarchy of data needs.
Modern tech organisations use historical data to generate insights to understand operational metrics. But can data be leveraged to build data products? Netflix personalised recommendation page is prime example. I have been reading Netflix tech blog recently. I am amazed by the kind of algorithms, A/B testing experimentation teams run in the back-end such that over 200 M subscribers get personalised homepage.
However, building data products is quite challenging as its very time consuming and data engineers need to use multiple open-source tools from Apache Kafka, Apache Spark, Job scheduling tools such as Cron/Airflow, Memory/Cache tools, etc. to move data from warehouse to end user apps or micro services. Implementation time for such a project can vary between weeks to months.
In future, more consumer facing companies will build data products in order to gain competitive advantage. Thus, the back end data infrastructure that enable this will evolve. Snowflake and Databricks are adding capabilities through data lake house platform to enable companies to build real-time data products on top of it.
We believe there is large white space and new categories will emerge such as no-code data API platform which will enable organisations to build data products faster.
C. Need for real-time/streaming pipeline will become mainstream
Majority of the companies use batch processing wherein data is processed in batches at certain intervals. This is data at rest (stored in data warehouses over a period of time) Streaming data processing happens with data in motion as it flows through the pipeline. Currently, there are limited use-cases for real-time streaming — online fraud detection, Uber’s dynamic pricing, Netflix’s personalised recommendations, etc.
Confluent, the company behind Kafka and its successful IPO in 2021 has been trailblazer and led to acceleration of real-time/streaming data stack. As per McKinsey study, costs of real-time data messaging and streaming pipelines have decreased significantly
However, setting up and managing streaming processing stack is complex for data teams vs managing batch processing. Large enterprises such as Whatnot (Livestream ecommerce), Netflix, Airbnb, Zillow have built an internal streaming data stack using open-source tools such as Kafka, Apache Spark, etc. New players are emerging in different categories such as ClickHouse — real-time data analytics, Materialize, Apache Flink — real-time processing and Amazon Kinesis, Google Pub-Sub — cloud hosted streaming engine.
Demand for real-time streaming technologies is increasing as companies want to analyse data in real-time — churn prediction, forecasting, in-app personalisation. There is an opportunity for newer categories to emerge in real-time streaming data infra.
We, at Gemba Capital are looking to back category defining companies in data management, streaming data stack, data privacy, etc. If you’re an entrepreneur building in data stack space, reach out to me on kamini@gembacapital.in
We would like to thank data engineers/ data scientists at Plum, Flipkart, Carousell, Airmeet who gave their valuable time to share insights on this space.
Sources
Pitchbook, IT Business edge, Bessemer Venture Partners, Matturuck MAD Landscape