VALIDIO
Published in

VALIDIO

5 Data Trends in 2022

In 2021, we saw quite an acceleration of the buzz around the rise of the Modern Data Stack. We now have a tsunami of newsletters, influencers, investors, dedicated websites, conferences, and events evangelizing it. The concept around the Modern Data Stack (albeit still in its early innings) is tightly connected with the explosion of data tools in the cloud. The cloud comes with a new model of infrastructure that will help us build these data stacks fast, programmatically, and on-demand, using cloud-native technologies like Kubernetes, infrastructure as code like Terraform and cloud best practices of DevOps. So, infrastructure becomes a critical factor in building and implementing a Modern Data Stack.

1. The rise of the Analytics Engineer

If 2020 and 2021 were all about the rise of the data engineer (which according to Dice’s tech jobs report was the fastest-growing job in tech in 2020), in 2022 the analytics engineer will make its definitive entrance to the spotlight.

LinkedIn job posting data as of 1st of December 2021

2. The data warehouse vs data lakehouse war intensifies (and lines get increasingly blurred)

Very few in the data community missed the very public showdown between Databricks and Snowflake at the end of 2021. It all started when Databricks claimed a TPC-DS benchmark record for its data lakehouse technology and said a study showed it was 2.5X faster than Snowflake. Databricks lacks integrity said Snowflake, which came out fighting, saying the study was flawed and had a blog post released by its founders.

“In the vast majority of accounts that we are in, we co-exist with Snowflake — the overlap in accounts is massive… What we’ve seen is that more and more people now feel like they can actually use the data that they have in the data lake with us for data warehousing workloads. And those might have been workloads that otherwise would have gone to Snowflake.”

The data warehouse vendors are gradually moving from their existing model to the convergence of the data warehouse and data lake model. Similarly, the vendors who started their journey on the data lake-side are now expanding into the data warehouse space. We can see convergence happening from both sides.

The evolution of data storage and warehouses

“I think Snowflake will be very successful, and I think Databricks will be very successful… You will also see other ones pop up in the top, I’m sure, over the next three to four years. It’s just such a big market and it makes sense that lots of people would focus on going after it.”

According to Bill Inmon, who has long been considered the father of data warehouses, the data lakehouse presents an opportunity similar to the early years of the data warehouse market. The data lakehouse can “combine the data science focus of the data lake with the analytics power of the data warehouse.”

Data Warehouse vs Data Lake vs Data Lakehouse by Striim

3. Real-time streaming pipelines and operational analytics will continue to push through

As Matt Turck notes in his MAD Landscape 2021 analysis, it feels like real-time has been a technology paradigm that has always been just about to explode. As we’ve entered 2022, the trade-off we hear seems to be still in cost and complexity. If a company is getting a cloud data warehouse off the ground and needs an immediate 4–6 week impact, the overall notion seems to still be that it’s a heavy load to set up real-time streaming pipelines compared to batch pipelines. Or that’s is just purely overkill if the company is in the beginning of its data journey.

“If you don’t have a real-time streaming system, you have to deal with things like, okay, so data arrives every day. I’m going to take it in here. I’m going to add it over there. Well, how do I reconcile? What if some of that data is late? I need to join two tables, but that table is not here. So, maybe I’ll wait a little bit, and I’ll rerun it again.” — Ali Ghodsi on a16z

Apache Kafka has been a solid streaming engine for the last 10 years. Enter 2022 and we see companies increasingly moving towards cloud-hosted engines like Amazon’s Kinesis and Google’s Pub/Sub.

“Cloud data warehouses were designed to support BI use cases, which amount to large queries that scan entire tables and aggregate the results. This is ideal for historical data analysis, but less so for the “what is happening now?” class of queries that are becoming increasingly popular to drive real-time decision-making. This is what operational analytics refers to. Examples include in-app personalization, churn prediction, inventory forecasting, and fraud-detection. Relative to BI, operational analytics queries join many disparate sources of data together, require real-time data ingestion and query performance, and must be able to process many queries concurrently.”

As noted by McKinsey back in 2020, the costs of real-time data messaging and streaming pipelines have decreased significantly, paving the way for mainstream use. McKinsey further predicts in a recent article that by 2025, data is generated, processed, analyzed, and visualized for end-users is dramatically transformed by new and more ubiquitous technologies, such as kappa or lambda architectures for real-time analysis, leading to faster and more powerful insights. They believe that even the most sophisticated advanced analytics are reasonably available to all organizations as the cost of cloud computing continues to decline and more powerful “in-memory” data tools come online (e.g. Redis, Memcached).

The stages of data pipeline maturity by Chris Riccomini

4. The rise of Cloud Marketplaces for Modern Data Stack adoption

The PLG (product-led growth) trend has been growing over several years in the data infrastructure space as usage-based pricing, open source and the affordability of software has pushed purchasing decisions to the end-users. However, product-led growth and usage-based pricing can be complex to implement and execute from a business model and product standpoint when it comes to software, compared to traditional sales-led go-to-market models. Cloud Marketplaces via e.g. AWS, GCP and Azure are emerging as the best first step as businesses evolve towards the future of digital selling.

  • The committed enterprise spend flowing through the three major Cloud Providers exceeds $250 billion per year — and that number is climbing fast.
  • For 2021 alone, independent software providers generated over $3 billion in revenue through the Cloud Marketplaces, based on Bessemer predictions. Bessemer expects that to grow by a multiple of 10 in the coming years.
  • Forrester had projected that 17% of the $13 trillion global B2B spend would flow through e-commerce and Marketplaces by 2023 — but that number might have been reached already in 2021.
  • A 2020 Tackle survey found that 70% of software providers said they have increased focus and investment in Marketplaces as a go-to-market channel due to COVID-19.
The win-win of Cloud Marketplaces

5. Harmonization and consistency of terminology around the Modern Data Stack and data quality

It has been pretty incredible to see the data quality space in the context of the Modern Data Stack go from a niche category in 2020 to completely exploding during the past 18 months with a combined 200M$ of capital flowing into the space in 2021. Even G2 noted in their recent “What Is Happening in the Data Ecosystem in 2022” article how 2022 will be all about data quality and how they in 2021 saw an unusual trend in the steep traffic increase to the data quality category.

The data quality category in the 2020 MAD Landscape vs the 2021 Landscape by Matt Turck

In summary

We believe we are still in the very early days of a revolution in the Modern Data Stack. Just as the cloud changed the way we work today, harnessing data through modern cloud-native infrastructure is becoming essential to companies of all sizes and industries. Additionally, as modern data stacks become more widely adopted, we expect to see numerous areas for further enhancement, including streaming data to allow companies to take real-time action.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Oliver Molander

Co-founder at Validio and early-stage tech investor at J12 Ventures. Preaching about the realities & possibilities of Data & ML.