Look Out for These Data Engineering Trends in 2023

Anjan Banerjee
HCLTech-Starschema Blog
7 min readDec 8, 2022

2022 saw large-scale layoffs and departures across major industries, and things generally seem to be on track towards a big reset. Credible voices are claiming we’re moving into a recession. So how can data-driven organizations prepare themselves to tackle whatever is to come in 2023 — and even emerge stronger?

An integral component of data-driven business is data engineering, and in this post, we’ll discuss four data engineering trends that you need to be aware of whether you’re a high-level decision-maker or a data engineer in the trenches. More than just fleeting fads, these trends are all likely to have a significant influence on data engineering in the coming years, beyond whatever recession may come, as they will enable any organization in any financial environment to fine-tune processes and optimize spending.

Photo by 夜 咔罗 on Unsplash

Data Cloud Cost Optimization

Because data engineering teams have had to concentrate on speed and agility to meet the extraordinarily high expectations placed on them in the last couple of years, best practices in the field of data cloud cost optimization are still relatively new — and rarely heeded. Instead of optimizing complex or degrading queries, teams spend the majority of their time developing new queries or feeding in more data.

The ease of cloud, elastic storage and compute options have driven demand to where it has resulted in massive cloud costs for organizations. The primary causes of this have included storing multiple copies of the same data and duplication of work performed by teams working in silos. Organizations will need to make significant efforts to optimize their spending on the cloud, and data engineers will play a crucial role in the process.

The optimized usage of cloud resources will be the key focus area in the coming year, and the key to achieving this faster and more effectively is another trend we expect to see in 2023, which is:

Data Catalog, Observability and Quality

The goal of data engineering is to create, maintain and optimize data pipelines that move data from its source to its end user. Even though the workflow in these data pipelines is now standardized, consisting of well-known steps to perform the extraction, transformation, and loading of data, this process is still very sensitive to changes in the data, whether in its structure or values. And these changes can directly affect the pipeline’s availability by leading to failures and making them unavailable. Thus, this is where data cataloging, observability and quality are relevant.

A data catalog is a useful tool for organizations looking to better manage and organize their data, and it can also be used to facilitate the discovery and access of data by internal users — end even external ones, but more on that later.

Observability is the ability to monitor and diagnose the performance and behavior of a system or application. In the context of data, observability refers to the ability to monitor and understand the data flows and processes within your organization and to identify and diagnose any issues or problems that may arise.

Data quality refers to the accuracy, completeness and consistency of data within your organization. Ensuring the quality of data is important because it can impact the reliability and effectiveness of the decisions and actions that are based on your data. Data quality can be improved through the use of data validation and cleaning tools, as well as by establishing and enforcing data governance policies and standards.

A few tools to keep an eye on are OpenMetadata, Monte Carlo, Castor, Atlan and Immuta.

Photo by Arif Riyanto on Unsplash

Unistore and Multi-Model Databases

The data you needed for analysis used to have to be moved from transactional databases to analysis-specific databases. Enter Snowflake with their new concept of the Unistore, which enables you can carry out any task directly and bridge the gap between OLTP and OLAP systems. This technology will help to reduce the number of systems that you need to maintain and eliminate the need to copy and move data between systems.

With the impact that Unistore will likely have on data engineering, we expect it won’t be too long before the competition starts coming out with alternative solutions to bridge the OLTP-OLAP gap.

At the same time, the line between relational, graph and document store databases is also blurring. “Multi-model” is a newer type of database management system (DBMS) that is designed to support multiple data models and data manipulation paradigms. This means that the DBMS is able to support different types of data structures and ways of interacting with the data, such as object-oriented, relational and graph-based data models. This allows users to choose the most appropriate data model and paradigm for their specific needs and applications.

Multi-model databases are especially useful in complex, large-scale environments where different data models and paradigms are needed to support different types of data and applications. Instead of having different tools and technologies to solve different data needs, a multi-model database will ensure a reduction in the total cost of ownership for any platform.

Emerging platforms to look out for include CockroachDB, Fauna and Firebolt — and, of course, it’s always worth taking a look at industry mammoths like Snowflake, Databricks, Redshift and Synapse.

Data Democratization

The trend of data democratization will continue to promote the empowerment of entire workforces, including data engineers and data scientists. Data democratization is the process of making data readily available and accessible to everyone who needs it within your organization— and even in an open data marketplace. This can lead to better decision-making, improved collaboration and innovation, as well as increased productivity.

One feature of data democratization that we can expect to be especially valuable in 2023 is that it helps to save cost in multiple ways.

First, ensuring that data and the tools for working with it are within easy reach for all employees makes for more efficient and effective processes and operations. This promotes cost savings by reducing the need for duplicate efforts and manual processes and by enabling employees to make better decisions that are based on higher-quality data — cutting down the frequency of and resources required for necessary course corrections.

Second, data democratization can help your organization optimize spending by reducing the reliance on specialized expertise and IT support for data-related tasks. When you lower the level of technical expertise necessary to engage more broadly and deeply with data, you also reduce your organization’s reliance on IT or data experts — who in turn can leverage their advanced knowledge to perform more value-added tasks.

Third, data democratization promotes cost savings by enabling you to make better use of your data assets. The wider the range of data that users can access, the greater the chances of them identifying new opportunities for using that data to drive business value and innovation. Whether it’s enabling your organization to generate new revenue streams or to develop more efficient and effective processes, products and services.

Forth, organizations that have the opportunity to set up a data marketplace can generate extra revenue by enabling external parties to subscribe to their data and charging them for it. In addition, this will incentivize data engineers to maintain and publish cleaner and more structured data in the marketplace — leading to increased trust in your data from both internal and external users, more subscribers.

The most prominent public data marketplace platforms you should be familiar with include AWS Data Exchange, Snowflake Marketplace, data.world and Dawex.

Photo by Renate Vanaga on Unsplash

Conclusion

Data engineering is a rapidly-changing field, with new developments constantly emerging on the horizon, which can make it difficult to predict the exact trends we’ll be seeing in any given year. That said, we can find solid clues in the trends shaping the business needs of the largest organizations — where it’s safe to expect that cost optimization efforts will largely define the coming year.

With the increasing adoption of cloud computing, it’s likely that data engineering will increasingly take place in the cloud, supported by AI and ML technologies. Together, these solutions will enable data engineers to automate and optimize data processes and pipelines well beyond current standards to introduce more streamlined data management and analysis practices with better outcomes.

About the author

Anjan Banerjee is the Field CTO of Starschema. He has extensive experience in building data orchestration pipelines, designing multiple cloud-native solutions and solving business-critical problems for multinational companies. Anjan applies the concept of infrastructure as code as a means to increase the speed, consistency, and accuracy of cloud deployments. Connect with Anjan on LinkedIn.

--

--

Anjan Banerjee
HCLTech-Starschema Blog

Senior Solution Director @HCLTech || Former Field CTO @Starschema Ltd