The fact that we are dealing with an explosion of data all around us is not lost on anyone. A lot of time in board meetings is spent on ‘data strategy’. Are you an offline retailer? Your board is bound to deliberate how you leverage your data to get ahead of online retail. Do you run movie theaters? Your lead investor is bound to ask you to use your data and stay ahead of Hotstar and Netflix. Data teams are front and center in every organization — startups to 150-year-old conglomerates. LinkedIn India alone lists over 40,000 jobs across data science and business analytics.
What is less understood, however, is that data is not easy to handle and use at scale.
Processing, storing, and extracting value from large datasets is a challenging task. Moreover, this is a rapidly compounding problem with 90% of the 44 zetabyte global dataverse being created in the last two years alone. Our experience is that it hasn’t been easy — even for mature startups with all the dollars to invest and the best people to hire — to put together a cohesive data strategy. Part of this has to do with lack of experience: a lot of teams are learning as they go, making (often expensive) mistakes along the way. This is further exacerbated by a fairly complicated, rapidly evolving data stack.
Broadly, this is what the modern data stack looks like:
The early version of the data stack that evolved over the 80s and 90s was fairly linear in terms of building blocks: storage was managed by a DBMS that was connected via pipelines to somewhat static sources at one end and to fairly simplistic output tools on the other. This was almost all consultative and custom built. As more distributed data started to trickle in, organizations (led by some of the tech giants of today) started to develop tools to handle and process large, distributed data. Many of these tools are in fact well recognized open-source projects (like Hadoop, Hive, and so on) that in many ways form the heart of the modern data stack today. AWS led the wave of moving the entire data stack to the cloud, made infrastructure and compute elastic, and offered it as a service. Today, cloud storage is ubiquitous, pipelines have turned on their head (ETL to ELT) given the elasticity of compute in the cloud, orchestration is fairly mature, and so on.
While the foundation of the modern data stack has been laid over various advancements in the last decade, many issues of dealing with data have still not been addressed. Having a single source of truth is still a utopian pursuit, Excel spreadsheets are still a core component of most data pipelines and managing enormous data assets is a large and growing problem.
Confluent’s recent S1 has a telling representation of these silos:
In addition to this enterprise architecture-driven chaos is the human layer of complexity too, where various teams exist — at times working in silos — to maintain, utilize, and extract value from the data stack.
With this context, we dug deeper into how the modern data stack would evolve over this decade and what areas would see major disruptions. Broadly, there are various redundancies across different components, some predominant architectures are archaic, and it is still difficult to work with the data stack. We’ve briefly articulated on how some of these would be re-imagined, and each of these have already started seeing innovations from companies around the world:
- New ways of thinking about data repositories: With data being split across lakes and warehouses today, we see a possible convergence to a ‘lakehouse’ architecture. We also see several forms of specialized data repositories that enable organizations to derive specific benefits, like the data fabric that aims to extract value from relationships between datasets, and time series-optimized repositories built for the real-time flows that are ubiquitous today.
- Actionable Business Intelligence Platforms: For a long time, we have lived in a world where insight was the prime commodity. In today’s world, a lag between insight and action can be lethal, and therefore, this needs to be crunched. We will see organizations reducing the time to action at scale. This could entail building workflows linked to insights, building feedback loops from BI to operational platforms, and to create verticalized platforms for specific BI use cases.
- A Comprehensive DataOps Layer: Data management will move the same way as code management has: we will see emergence of platforms that catalog data, monitor quality of data assets, and ensure that they’re secure and responsibly used. We’re seeing a host of interesting tooling here that would only grow, as would the role of a DataOps team.
- Collaboration as a Core Feature: In the new normal, data platforms will be built with collaboration as one of the key features, be it in creating models, querying results, or making sense of insights.
Our colleagues in the Global Vertex Network have made some very exciting investments in the data space, and this piece derives inspiration from the exciting work these companies are doing. Some of these companies include Quilt, Upsolver, Zepl in the US, Recko in India, Scylla, DataRails, Aporia in Israel, among others.
This was a sneak peek into our thesis around data infrastructure and the modern data stack. We have now published Part 2 where we double click on all these trends.
Enjoy what you’re reading? Stay tuned to updates from the Vertex Network by signing up below.