The State of Data Infrastructure Landscape in 2022 and Beyond

Key trends to expect in the data infrastructure domain in 2022 and beyond.

Photo by Cayetano Gil on Unsplash

I happened to come across two tweets in data Twitter recently that spurred my curiosity to learn what will happen in the data infrastructure landscape in 2022 and beyond.

Chris Riccomini and Gunnar Morling are two veterans in the data infrastructure space that I regularly follow to understand what’s going on.

This tweet by Chris:

and this by Gunnar:

It provided an excellent summary of things to expect in this space in 2022 and beyond. So I thought of taking those tweets to the next level by augmenting them with my research and compiling a post.

The decade from 2010 to 2020

Gone is the decade when many companies have focused on building large-scale, complex systems that offer a great user experience. That resulted in inventing industrial trends like Microservices, Kubernetes, and Serverless.

A few companies started to focus on analyzing the data generated by users rather than the operational systems. They called themselves “data-driven organizations” and started to build “data-products” that harness the power from their user behaviors. Companies like Amazon, Netflix, and Airbnb were leading this trend.

Source

The decade from 2020 to 2030

The decade we live in, from 2020 to 2030, is considered the Data Decade. According to Forbes, now “every company is a data and analytics company.” That will result in more companies embracing the data-driven culture, treating data as a strategic asset, and building products that capitalize on data-driven decision-making.

As Prukalpa from Atlan correctly framed it, this decade will be dedicated to the companies that build their data infrastructure to harness insights in four different ways.

Credits — Prukalpa Sankar

The cultural shift from being operationally excellent to being data-driven will lead to the emergence of new roles, shifts in customer spending, and the emergence of new startups providing infrastructure and tooling around data.

The rest of this article discusses a few trends that I believe will be instrumental in driving the data industry forward in 2022 and beyond.

The Modern Data Stack

Modern Data Stack is a radically new approach to data integration that saves engineering time, allowing engineers and analysts to pursue higher-value activities.

Charles Wang, Fivetran

In a nutshell, the modern data stack (MDS) is a suite of software used to simplify data integration and analytics. A minimal viable MDS consists of the following:

MDS components at a minimum

MDS differs from the traditional BI scene as MDS components are hosted and managed in the cloud. There can be different vendors for each stack, but ultimately, that’ll be available as SaaS, reducing the time to set up data stacks from months to days.

MDS aims to make data analytics accessible for everyone in the organization and reduce the time to insights. Hence, I see MDS will be getting more traction in 2022.

If you are curious to learn more about MDS, these are some great places to start.

Analytics engineering and dbt

Analytics engineers provide clean data sets to end users, modeling data in a way that empowers end users to answer their questions. While data analysts spend their time analyzing data, analytics engineers spend their time transforming, testing, deploying, and documenting data. Analytics engineers apply software engineering best practices like version control and continuous integration to the analytics code base.

Claire Carrol, getdbt.com

As Claire correctly framed it above, analytics engineer is a role that emerged during the past couple of years of data space. They sit at the intersection of business teams, data analysts, and data engineering to bring robust, efficient, and integrated data models to life.

dbt (Not DBT, please!) is the best tool in the analytics engineer’s tool belt. It is a data transformation workflow (ELT) that lets teams quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation.

The goal of dbt is to make it accessible to build production-grade data pipelines without banking on data engineers. In the future, the Modern Data Stack, analytics engineering, SQL, and dbt will be working in unison to quickly provision data integrations that bring business insights in minutes rather than hours.

Analytics engineer sits between data engineer and data analyst.

For more information:

What is analytics engineering? — Claire Carrol

Analytics Engineer: The Newest Data Career Role — Madison Schott

Streaming databases/low-latency OLAP databases

User-facing analytics has been picking up speed during the past couple of years. Companies that build user-facing data products have been looking for low-latency databases that can serve OLAP queries at millisecond latencies.

Low-latency databases or real-time OLAP databases are designed to answer complex OLAP queries within millisecond latencies at very high throughput (more than 100k QPS). That’s a paradigm shift from the classic OLAP model restricted only to a handful of internal analysts who can wait for minutes to get a response.

These databases are also called streaming databases because they can ingest a real-time stream of data and immediately make it available for querying — aiming for greater data freshness.

When it comes to the implementation, there are mainly two strategies.

Incremental updated materialized view engines

Engines like Flink, ksqlDB, and Materialize allow you to define analytical queries in SQL. Then they use the principles of stateful stream processing to keep those queries incrementally updated as new data comes in.

Although this is somewhat similar to the materialized views in classic data warehouse literature, these databases operate in a more scalable, economical, and high-performance manner.

Real-time OLAP databases with the scatter-gather query execution

Apache Druid, Apache Pinot, ClickHouse, and Rockset are leading examples in this space who ingest incoming data to build structures called “segments.” These segments are then placed on different servers and indexed to minimize the time to read them back.

When an OLAP query comes, it is then scattered across servers that host the segments, and then the result from each server is gathered together to compose the final result.

Classification of streaming databases

For more information:

Understanding Materialized Views — Part 2

Metadata management and data catalogs

A data catalog creates and maintains an inventory of data assets through the discovery, description, and organization of distributed datasets. The data catalog provides context to enable data stewards, data/business analysts, data engineers, data scientists, and other line of business (LOB) data consumers to find and understand relevant datasets to extract business value.

— Gartner, Augmented Data Catalogs 2019

A data catalog is a central place to index and store an organization’s data assets across different data sources. It helps organizations discover, understand, and consume data better while serving as a single source of truth for any data item.

A typical data catalog delivers the following features:

  • Create a repository of all your data
  • Allow users to access the metadata
  • View and understand the lineage
  • Ensure data consistency and accuracy
  • Simplify data governance and compliance

To win today’s race to data analytics, it is no longer enough to have a massive amount of data. Organizations need solutions like data catalogs to centrally define, access and govern that data without running into chaos.

For more information:

What is a data catalog?

Data lakehouses and open data architecture

Data warehouses have been around for decades and leaped forward in 2013 when Amazon Redshift first introduced cloud-based warehousing. Data lakes also came into the picture when Hadoop, Hive, and the rest of the Big data technologies were introduced.

While companies have had to evaluate whether a data warehouse or a data lake is the right choice for their business, a new paradigm called “data lakehouse” emerged.

A data lakehouse contains the best of both worlds, from data warehouses and data lakes. Monte Carlo lists the following functionalities that are helping data lakehouses further blur the lines between the two technologies:

  • High-performance SQL: technologies like Presto and Spark provide SQL interfaces close to interactive speeds over data lakes. That opened the possibility of data lakes directly serving analysis and exploratory needs without requiring summarization and ETL into traditional data warehouses.
  • Schema: file formats like Parquet introduced a more rigid schema to data lake tables and a columnar format for greater query efficiency.
  • Atomicity, Consistency, Isolation, and Durability (ACID): lake technologies like Delta Lake and Apache Hudi introduced greater reliability in write/read transactions. They took lakes a step closer to the highly desirable ACID properties standard in traditional database technologies.
  • Managed services: for teams that want to reduce the operational lift associated with building and running a data lake, cloud providers offer various managed lake services. For example, Databricks offers a managed version of Apache Hive, Delta Lake, and Apache Spark. In contrast, Amazon Athena offers a fully managed lake SQL query engine, and Amazon’s Glue offers a fully managed metadata service.
Source

Data platform as a service (dPaaS)

Many organizations strive to build data platforms that cater to their needs, taking pride in that. But that’d not be for everyone.

Most organizations, especially those starting their data journey, prefer someone to build and manage the data platform for them. Or, at least get the data platform as a service so that in-house data engineers and analytics can quickly provision data pipelines without messing around server installations, backups, and monitoring.

All they need is reduced time to insights!

Companies like Meroxa and Decodable saw this opportunity and rose to the occasion on time by providing a self-service data platform as a service.

I believe there will be more like them in 2022.

Conclusion

The data infrastructure landscape is constantly changing. It is good to see many startups coming with innovative ideas to address the gaps in the current data space.

Although I did not discuss it in detail here, concepts like reverse ETL, metrics stores, headless BI, and DataOps are shaping up nicely in the industry.

The key takeaway is that many organizations will become more data-oriented and invest more in data analytics infrastructure in this Data Decade.

If you are a consumer, adapt and survive.

If you are a vendor, capitalize and thrive.

--

--

--

EdU is a place where you can find quality content on event streaming, real-time analytics, and modern data architectures

Recommended from Medium

Biopython to Run Bioinfromatics Tools and Protein Sequences Alignment

Why Do You Need a Data Warehouse?

Customer Segmentation Report for Bertelsmann Arvato — Capstone Project

15 Data Science Slack Communities to Join

Mick Veldheer vs Filip Cristian Jianu Live’StReam!!

Online live stream search engine

Central park Solar Folies

Concept of monitoring Machine Learning model.

Performance And Explainability With EBM

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Dunith Dhanushka

Dunith Dhanushka

Editor of Event-driven Utopia(eventdrivenutopia.com). Technologist, Writer, Developer Advocate at StarTree. Event-driven Architecture, DataInMotion

More from Medium

The Everything Bagel II: Versioned Data Lake Tables with lakeFS and Trino

OpenMetadata 0.9.0 Release

Humans of DataHub: Arun Vasudevan

The MAD Landscape 2021 — A Data Quality Perspective