Modern Data Stack: The end is nearer than it appears…

Rakesh Kumar
8 min readJun 18, 2023

--

All Change is hard at beginning, messy in the middle and gorgeous at the end. …..So as our Data Pipelines.

The Modern Data Stack

We’ve witnessed a remarkable shift in the realm of data engineering, moving away from rigid hardcoded data pipelines to embrace automated and configurable low code/no code pipelines. This transformation has been powered by the porgress of Modern Data Stack, a suite of various vendors products and tools that have revolutionised the field of data engineering as a whole. With these, activities like data collection, transformation, operationalisation and analysis have become significantly more streamlined and accessible.

Journey to Modern data stack

We’ve come a long way from the era when databases like PG or DB2 were the primary sources of data. Today, enterprises rely on a multitude of systems, both on-premises and public cloud-native services, and even data sources originating beyond the confines of the organization. The Modern Data Stack have equipped us to adapt seamlessly to the ever-changing landscape of data sources and patterns. Gone are the days of data silos, as these modern data tools have shattered barriers and democratized data access across the enterprise. Now, a majority of users and processes can tap into the power of data available. Modern data stack has risen effortlessly managing the sheer volume and unpredictability of data in the enterprises where traditional ETL approaches struggled.

Thanks to these advancements, we now have the ability to meet the soaring demands for enterprise data transformation at scale. The Modern Data Stack suite has enabled us to navigate the vast landscape of data with unparalleled efficiency and agility.

Modern Data Stack Ecosystem (key players)

Our data pipelines underwent a rapid transformation, transitioning from the traditional ETL (Extract, Transform, Load) approach to the more agile ELT (Extract, Load, Transform) model. This accelerated shift was driven by players such as Databricks, Snowflake, DBT, and Fivetran, propelling us into the realm of the Modern Data Stack. This innovative paradigm offered diverse semantics tailored to the unique needs of each enterprise, empowering them to select the most suitable products and tools for their business requirements.

With the Modern Data Stack, our data teams gained the freedom to mix and match technologies, creating a customized environment that works best for their specific needs. This flexibility enables us to unlock tremendous value within our data ecosystem. Not only does it expedite the delivery of data to its consumers, but it also enhances accessibility, empowering growth and driving advancements in machine learning.

Modern data stack pipeline

“The convergence of Databricks, Snowflake, DBT, and Fivetran has propelled us forward, enabling us to harness the full potential of our data resources and fueling our journey towards data-driven Enterprises.”

ETL->ELT-> elT

…..The birth of zero ELT

Anyone working in data engineering understands the significant costs associated with data ingestion and movement. Data engineers often dedicate a significant amount of their time integrating with evolving source systems and participating in numerous meetings to discuss schema and frequency requirements. However, with the advent of zero ETL setups, the process of moving data from transactional systems to warehouses or analytical systems has become streamlined and efficient, eliminating the need for intermediate tools or data movement. This allows for the quick transfer of data between the transactional and analytical worlds.

Zero ETL Setup

Zero ETL solutions, initially introduced by Azure Synapse Analytics and subsequently adopted by Amazon Redshift, BigQuery, Snowflake Unistore, and Databricks Delta, offer a seamless data replication process without the need for data transformation or manipulation. This approach bridges the gap between the speed of transactional systems and the requirements of analytical processing. The burdensome data ingestion process becomes a thing of the past, one reduced point of failure, storage latency before data subsequently loaded into the warehouse.

“By unifying transactional and analytical storage (within the same vendor's ecosystem), zero ETL has transformed the landscape of data engineering systems. With pipelines no longer requiring effort on data ingestion and movement & as organisations embrace zero ETL, it prompts a reevaluation of the responsibilities and skillsets required in the evolving data engineering field.”

One pill makes you larger
And one pill makes you small
And the ones that mother gives you…

The journey of the Modern Data Stack is not without its challenges. The landscape is dotted with overlapping and competing features scattered across various vendor product suites. This lack of a unified platform view and foundational capabilities poses a significant challenge, especially for enterprises seeking centralized authentication, ingestion control, governance, and the ability to seamlessly integrate existing enterprise services. It’s like having all the puzzle pieces without a clear picture to guide you. To truly harness its potential, enterprises often find themselves building and stitching together various tools depending on their unique contexts.

“The legacy of the Modern Data Stack has been somewhat confusing for enterprises, leading them in different directions or, in some cases, nowhere at all. Its promise of removing data silos and democratizing data access is indeed noteworthy, but it falls short of fulfilling the dream of a comprehensive data platform infrastructure that many envisioned. The level of abstraction introduced by the Modern Data Stack has, at times, distanced engineering teams from the underlying details and considerations they once had to address.”

To truly unlock the power of the modern data stack, its integration with advanced technologies such as Machine Learning and Artificial Intelligence is imperative. The journey remains incomplete without embracing the native integration and “AI-fication” that is becoming increasingly crucial in the era of AI-driven data insights and innovations.

The way we consume data has become more complex in enterprises & at the same time original 3Vs has turned to 5Vs (velocity, volume, variety and veracity, value) focussing more on quality and value of the data.

Have you created a data pipeline architecture as below?

A realistic enterprise data engineering reference

More often or not, this is the typical representation of the data ecosystem within enterprises, consisting of disparate data sources, mainframe systems, and a variety of ETL tools employed for data ingestion and transformation. Moreover, multiple storage and processing tools are deployed, and analytics projects encompass numerous subsystems. Each subsystem requires a distinct set of capabilities, often necessitating the use of products from multiple vendors. Integrating these products can be a complex, fragile, and expensive endeavor & often involve abundance of interdependent jobs running to generate the desired datasets or reports. Despite these efforts, data silos, duplications and inadequate governance persist across the data products. Many data consumers lack visibility into the origin and curation process of the data they consume, as well as its quality. Any failure in the execution of a single job can result in the unavailability of the required data or reports, leading to potential business disruptions.

Fabric: Stitching it all together

We require a comprehensive data platform that eliminates the need for individual analytical services to be pieced together by data engineering teams who often assume the roles of data integration teams. Instead, we need a unified data platform that seamlessly integrates all aspects, including integration of data storage and computing, engineering capabilities, data science and AI functionalities, while ensuring robust security, governance, and compliance measures are in place.

A data fabric layer serves as a connection between various data services and tools used in data engineering, offering capabilities in areas like data access, discovery, transformation, integration, security, governance, lineage, and orchestration. Acting as a platform it enables the integration of different data sources and destinations, providing a unified approach to managing and consuming data. Enterprises can monitor storage, performance, efficiency, cost, and user activities. The specific technologies and components of a data fabric layer may vary depending on the organisation’s needs and the chosen underlying technologies.

Fabric addresses the gap that often arises when working with the fragmented services of the Modern Data Stack where integrating these products happens to be a complex, costly and frustrating undertaking.

4 Key pillars of data fabric are: Data integration Services, Auto collection & curation of metadata, AI/ML model for analysis & knowledge graph for data relationship.

Data Fabric Layers

The Fabric sum is worth much more than the parts. Providing storage to all tenants & computes payloads without requiring duplicating data is big deal.

Microsoft Fabric is the one such offering recently announced, it seamlessly connect various Azure data services while providing a robust semantic layer. Businesses can now enjoy the benefits of a unified experience, including a single AuthN/AuthZ (Authentication and Authorization) framework, a consolidated user interface, a streamlined storage layer, a unified permission model, and the added convenience of receiving a single monthly cloud bill.

Beyond MDS: AI-fication with LLMs

What was the sales last week?

In current data engineering processes, business stake holders express their requirements, models, metrics for data teams, who do the heavy lifting to produce required SQL output.

Tools of data analysis or data visualization, are usually R, Python, Tableau, Power BI, SQL etc. which are specialised skills & need professional experience to use them.

What if we give our data analytics ability to consume data in natural language rather than SQL.

Leveraging large language models would help with discovery, pattern recognition, entity matching and other challenges associated with lack of organization. AI doesn’t care if a book has table of contents or indexes, they are build for humans. LLMs like chapGPT would greatly simplify the access and speed of insights.

Should we build narrow table (star schema) or wide table (OBT), what should be our warehouse design looks like? What will be the best way to store & model data to leverage LLMs capabilities?

AI will change things about data too…

--

--