In Part 1, we shared some context around the evolution of modern data stack. In this part, we take a crystal ball approach towards predicting how the data space will evolve from here. We are excited to potentially work with groundbreaking startups that will play a significant role in the evolution of the data stack, and the following is just a limited view of what lies ahead in this space.
Without further ado, let’s dig in.
Rethinking Storage Paradigms
The holy grail of data has been to establish a ‘single source of truth’, where analysts (humans, and increasingly, machines) can go and magically find the proverbial insightful gold. The Enterprise Data Warehouse gained prominence in the 80s to store structured data, and storage paradigms have now largely moved to the cloud on the same premise. The need for a warehouse to store ‘relevant’ data was governed by cost — intelligent storage was expensive, and warehouses were used to store data deemed relevant at ‘a point in time’. This approach had limitations as what is considered relevant today may not be relevant tomorrow, and vice versa. Moreover, classic warehouses are extremely restrictive (from a data science perspective) as they tend to be structured data stores.
Enter the data lake, which extends the premise of the holy grail by allowing organizations to store everything, in any format. This is especially relevant from the perspective of utilizing unstructured data for AI and ML applications within the enterprise. However, compute inside a traditional data lake wasn’t easy due to the underlying architecture, so organizations started building a data warehouse on top of a lake to make compute (basically analytics) easier. Talk about redundancies and monolithic thinking.
Where do we go from here? We believe the best data teams are reimagining the storage paradigm. On one hand, there have been signs of a convergence between the data warehouse and the data lake namely the “lakehouse” — which incorporates the best of both worlds. While the resultant architecture is not clear yet, the incumbents have different takes. For instance, Snowflake is following a warehouse-first approach and Databricks’ Delta Lake has a lake-first approach.
In parallel, we observed a divergence from the norm where new data repository architectures are gaining traction. These data repository architectures aim to address the inefficiencies of warehouses and lakes in specific use cases. For example, the Data Fabric/Data Mesh aims to leverage the Enterprise Knowledge Graph to describe relationships between disparate datasets that would help extract insights. These are being developed by startups such as Stardog and Cinchy, and legacy players like NetApp and Talend.
Another emerging trend is the development of architectures that can handle hybrid data flows which involve handling large batch data as well as low-latency time series streams. The latter is especially useful for processing real-time data right from social networks to self-driving cars. We believe this is a promising area with a lot of disruption and growth potential.
While we look to the future when it comes to storage, it is important to note that only ~15,000 companies use Redshift and ~3,100 use Snowflake as of 2020, and there is a long way to go in developing single-source-of-truth architectures that improve data usability.
Actionable BI and Closing the Loop
The term ‘Business Intelligence’ (BI) owes its origins to the world of Decision Support Systems. The idea was simple — managers needed reporting to make business decisions. Given that context, the assumption was there is always going to be a human in the loop. However, this paradigm is becoming increasingly archaic. An extreme example lies in the Autonomous Vehicles (AV) space, where the premise is to eliminate this very human in the loop. While the idea of a fully autonomous future remains very much an idea, we are fairly certain that the role of human in the loop will continue to reduce, or become a lot more informed, as redundancies are being eliminated.
Let’s pick up from where we are today. Classic BI tools are unable to aid on-ground execution. A banking agent out on the field for collections does not benefit from the branch-level loan performance data on a central dashboard. She needs a mapped-out route indicating each location they need to visit sequentially. In an ideal world, each agent should be equipped with five discussion guides — each suited for a different segment of customer they are visiting — and a tool to prompt which guide should be used for which customer. This is actionable BI — one we believe is increasingly becoming a reality. We can paint this picture further, but hopefully you get the idea.
The data pipeline today is built to help take high-level business decisions and not directly fuel business operations. Hence, Excel is still a popular tool used by operational teams as it’s easy to use. As investors, we have an inherent belief that wherever we find excessive Excel usage within a business vertical, it is time for someone to build a platform that would first eliminate Excel and then add use-case-specific bells and whistles. Here are a few ways in which this might play out:
- Human-in-the-loop Workflows: Teams have always thought of workflows and BI as separate instances. As internal tools have proliferated workflows, organizations are beginning to realize that workflows and BI can be nicely linked together. Abstracting this, we think that the gap between the data pipeline and the operational pipeline could be bridged by internal app development platforms (such as Retool and AppSmith) that would potentially merge BI and workflows into one customizable low/no code actionable BI platform for human-in-the-loop workflows.
- Automated Feedback Loops: BI has been unidirectional — data moves from a source to an analyst, who may or may not take an action. For instance, a sales leader may not think twice about slowing growth in a region, where the intelligence shows the company is losing market share. However, in an ideal world, such an event should automatically create tasks on the regional manager’s to-do list, guiding sales teams to activate promotions. This is an example of use of connectors from BI to SaaS, also called Reverse ETL, with the likes of Census and Tray building interesting products here. The possibilities are endless: feedback loops from BI to storage, storage to SaaS, you name it, could potentially revolutionize how operating teams function.
- Verticalized Visualizations: The hedgehogs in a world of foxes, vertical BI tools are designed for a specific user persona. For example, ML visualization tools like Plotly Dash and Streamlit enable data scientists to build a visual representation of their models and circulate them as webapps to non-technical users. These users can easily interpret the model inferences and take subsequent actions with the data at hand. Other companies like Amplitude, Locale and Glean are also building visualizations and analytics solutions for non-technical teams so that insights can immediately be converted to results at the point of action.
DataOps: Here and Now
There are 22,000+ job listings for DevOps roles in India. The same number for DataOps is under 2,000. We are willing to go out on a limb and say that this number will explode in the near future. Data has become increasingly complex over the last decade. There has been a massive jump in the size of the data teams, and the data handled by any organization has become far more complicated. This has led to the need of a management layer to orchestrate the data pipeline, along with specialist tooling as well as (human) resources.
A few DataOps opportunities that excite us include:
- Metadata Management: Think better cataloging of an organization’s datasets. While the idea has been around, fantastic tooling is just starting to emerge, and will change the way metadata management will be done. Some tools include Collibra, Alation, and Lyft’s open-source Amundsen.
- Data Observability and Quality: Preventing the adage of ‘garbage-in garbage-out’ is important, and platforms such as Monte Carlo, AccelData and Soda aim to verify data quality and reliability with features solving for data freshness monitoring, distribution tracking, anomaly and outlier detection, and schema errors. Others like Sisu, Anadot, and Outlier enable observability of metrics important to various business units.
- Data Governance and Privacy: We wanted to write a whole paper on this — especially given how misunderstood the topic is among consumer companies in India (we still might, stay tuned). However, there is a whole suite of platforms that tackle Governance and Privacy, and we expect many more to emerge as these topics continue to take center stage globally.
Interestingly, some platforms are emerging to enable more than one DataOps pillar, such as Data.world, Atlan, Datakitchen etc. These players aim to become the data workplace within an organization and we anticipate more startups trying to a GitHub for data.
Data Teams of the Future and Collaboration
What good is an article without some mention of the New Normal? Global teams, remote work, distributed knowledge assets etc. mean that collaboration must be front and center for any application in the world of data. This is pivotal for organizations with complex data use cases and large data teams — as they tend to be more complex from an organizational and geographic viewpoint.
Data scientists are using the notebook as their default workspace. Jupyter, for instance, is a great platform that provides data teams a workspace that includes visualizations, text, and mathematical models, and code, all in an interactive web environment. Several companies are building products that expand on this idea, including Jovian.ml, Zepl (a Vertex Ventures US portfolio company) and Polynote. We believe a lot more use cases and functionality will get built to augment the notebook.
Collaboration on datasets has also been gaining traction. Tools like PopSQL and Dataform enable collaborating on SQL queries and query libraries for commonly used searches. There are also platforms that enable real-time collaboration. For instance, Mode and Graphy are building a Google Docs-like BI, while tools like Cord aim to build a Figma-for-everything by enabling annotation and chat on any software.
In summary, there is tremendous potential across the stack. However, the key areas and trends that we have broadly outlined are highlighted in the chart below.
This is the decade of data, and the team at Vertex Ventures Southeast Asia and India is keen to partner teams solving these global challenges! It will be great to hear thoughts and takes on the data space, and do reach out to us if you’re building or know of any exciting teams revolutionizing the data stack.
Enjoy what you’re reading? Stay tuned by signing up below.