The Aspiration for a Modern Data Experience— part 1

Himanshu Gaurav
10 min readJun 17, 2024

--

In today's rapidly evolving business landscape, the sheer volume of generated data is staggering. This data deluge has transformed how businesses operate, making it crucial for organizations to leverage this data effectively to stay ahead of the competition; companies are heavily investing in the “Modern Data Stack.” The challenge is that the definition of a Modern Data Stack is frequently changing. Companies are changing their gears, falling prey to technologies and buzzwords (Data Fabric, Data Mesh, Data Foundation Strategy, Data Marketplace, Data Architecture, Data-as-a-Product, Data-as-a-Service, Uniform Formats, Open Architecture, Reverse ETL, Auto-scaling Pipelines, Open formats, Modern Data Platform, Serverless, etc.)

That being said, the new technology has generated a lot of excitement due to its impressive capabilities. Tasks that were once difficult, time-consuming, and costly have become quick, easy, and affordable. Previously, companies spent huge resources and still faced significant downtime maintaining their data warehouses and databases, with scaling being a complex and lengthy process. Today, these tasks can be accomplished within a few hours.

However, the modern data stack is less fantastical to the everyday analyst, business user, and other non-technical personas as it doesn’t provide the required aid in their day-to-day operations. The Modern Data Stack should focus more on delivering “Experience” than state-of-the-art technology, which becomes a pitfall in most cases.

Data Experience is the key (Credit: Co-pilot Generated Image)

Let’s consider the following analogy to understand the experience better. Imagine a cutting-edge car factory that has invested in the latest manufacturing robots, highly efficient assembly lines, and advanced quality control systems. The factory operates with precision, producing cars that meet exacting standards of quality and reliability. Thanks to the latest technology, the production process is streamlined, from parts procurement to the final assembly.

However, there’s a catch: the factory only produces one car model in a single color with a fixed set of features. Despite being able to customize cars, the factory doesn’t accept special requests. Customers can’t ask for different colors, additional features, or modifications. The factory is excellent at producing its one model efficiently, but it doesn’t offer the flexibility to meet specific customer requirements.

Extending the above analogy to data space ad-hoc analysis requirements, we need to make numerous customizations, such as obtaining data for rapid analysis and ensuring semantic consistency to combine and extract insights that leadership may find valuable.

How do you solve this scenario's experiential requirements of ad-hoc analysis ?

The above problem is not just a technical problem but rather an experiential one. Let's delve into the experiential aspirations of a data consumer step by step. First, let's discuss the aspirations for ad hoc analysis.

Ad-hoc Analysis

We will first understand the available solutions extensively utilized across industries to fulfill an organization's ad-hoc analysis requirements through an experiential lens, corresponding challenges, and plausible solutions.

Enterprise Datawarehouse (DWH)

Data warehouses are specialized data management systems (OLAP) designed to store recent and historical data from core business systems through various Data Movement Processes. They are structured to mirror an organization’s business processes, enabling easy insight and reporting.

These are designed with “Schema on Write”, a traditional approach in which data is first structured and transformed before being loaded into a data storage system.

The construction of a data warehouse involves modeling business entities and processes, representing the data as a logical data model, and implementing it as a physical data model based on database technology. This process transforms raw data into valuable business information that BI and Dashboarding Tools use.

Enterprise Data Lake (EDL)

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. These are designed with “Schema on Read,” a data management method that differs from the traditional schema on write approach. This approach allows storing various data formats as file systems in object stores.

Schema on read is known for its speed and scalability, making it faster than schema on write and enabling rapid scaling. Since it’s not necessary to define the schema before storing the data, it makes it easier to bring in new data sources on the fly, which leads to easy accessibility to data.

The data lake is designed to handle large volumes of data while accounting for veracity and velocity, but it does not inherently align with a business model. Extracting meaningful information, known as features, is essential to utilize this data for Data Science applications. Modern applications leverage machine learning and AI to analyze the data within a data lake, enabling the development of models for forecasting behaviors and facilitating informed decision-making.

Aspiration for Ad-hoc Analysis

In an Ecosystem where we have both EDW and EDL as separate entities, we will need to manually bring together all these required datasets from various sources (Vendor Platforms/in-house systems) to execute the ad-hoc analysis quickly. They might come in different forms and shapes, understanding the semantics of each dataset, blending them appropriately, and finally, deriving the required outcome. We are losing precious time and interacting with multiple teams to make this happen. This sounds like a herculean task for Data Consumers and quite time-consuming, too.

Is there a way for the company/data personas to leverage both the flexibility of data lakes and the structure of data warehouses? The aspiration is to unify the capabilities of both solutions so organizations can enhance decision-making, improve customer experiences, and stay competitive.

Plausible Solution & Journey:

The current trend in the Data Ecosystem involves attempts to unify all Data Silos and minimize the friction of operating Enterprise Data Warehouses (EDW) and Enterprise Data Lakes (EDL) as separate entities. This concept is known as Data Lakehouse ( Lakehouse terminology here pertains to a generic concept rather than a proprietary solution.).

A Data Lakehouse enables organizations to consolidate all enterprise Data Assets under a single roof with consistent Governance. It leverages the strengths of Data warehouses and Data Lakes, aiming to provide a one-stop shop for all enterprise data needs.

It's essential to ensure that the Data Assets are easily accessible and well-governed to fully empower users of our ecosystem with access to all the data generated and curated by the data team.

Let’s take as an example any ad-hoc analysis scenario that might require quick access to data from various sources and dealing with semantic consistency with data flowing from various systems.

Quick Access to Data

The ad-hoc analysis involves quickly exploring and querying data from various sources to gather insights on the fly. Query Federation comes to the rescue when rapid data access is needed without physically moving it. It simplifies data access for business users, data scientists, and analysts, enabling them to perform ad-hoc analysis without relying heavily on IT or data engineering teams.

Query federation is a data integration method that allows users to run queries and retrieve data from multiple data sources as if all the data were in one location. This approach provides a unified view of data without physically moving it, giving the impression that all the data is stored in a single place, even if it’s spread across different databases, data lakes, or other sources.

Query Federation in Modern Data Platform

Let’s try to understand this requirement with an example of a large retail chain that wants to enhance its customer loyalty program by leveraging data from multiple sources to provide personalized offers and improve customer engagement. The retailer operates online and offline stores and has data distributed across various systems, such as point-of-sale (POS) systems, customer relationship management (CRM) software, a data lake, and a data warehouse.

  • Enterprise Data Lake (EDL): Contains online transaction data, browsing history, and customer profiles.
  • POS System: Collects in-store purchase data and customer interactions.
  • CRM System: Holds customer contact information, interaction history, and support tickets.
  • Enterprise Data Warehouse (EDW): Aggregates historical sales data, inventory levels, and marketing campaign performance.

A marketing executive wants an analyst on his team to perform ad hoc analysis to understand customer behavior patterns, identify high-value customers, and tailor marketing strategies accordingly. The analysts will contact the data engineering team to inquire about the accessibility of the datasets and notify them of the high priority of the request coming from an executive.

This ad-hoc analysis requires integrating data from multiple sources, some of which might be available in a Data Lake or Data Warehouse and some directly from source systems. Due to this request's high-priority and urgent nature, the data engineering team might not have the option of building new pipelines for missing data sources.

The data team can utilize the query federation feature of their data platform to facilitate querying across different data sources without consolidating the data into a single repository. This can be achieved by configuring the tool to connect to the diverse data sources using system-specific connectors or adapters.

Analysts may struggle to write a single SQL query to fetch data from multiple connected sources. Even though they have the necessary data, they may find it challenging to merge the datasets because of differences in data models, schemas, data granularity, business rules, etc. The key to overcoming this challenge is to ensure semantic consistency.

Semantic Consistency

Let’s understand the Semantic Consistency more from the ad-hoc analysis requirement lens. Typically, EDW holds well-modeled BI-ready datasets enriched by the appropriate business context. In EDL, by its nature, we might hold Datasets of various quality in different Layers. Since the Workloads (AI/ML Workloads) that use Data Lake are often consumed in semi-curated form, they are Modeled in the way the use case demands. So, when we say we must bring semantic consistency to the data ecosystem, we must find a balance between two worlds (EDW/EDL) so that the datasets are reliable and consistently interpretable for ad-hoc analysis requirements.

Semantic consistency is a crucial aspect of data management that ensures that data is interpreted consistently across an organization, regardless of its source. This involves maintaining standardized definitions, formats, and meanings for data elements to ensure that they can be easily compared and utilized.

Semantic Consistency

Practical Example

In the above retailer example, semantic consistency can be achieved by creating an ad-hoc semantic layer as an intermediary between raw data from different sources or data lakes and the data warehouse. This layer will standardize the raw data from diverse sources into a consistent, business-friendly format, enabling it to be used for ad-hoc analysis.

Consistent Column Names: A raw e-commerce dataset in the EDL might have a column named “cust_id” while the EDW uses “CustID.” Map data fields from various sources to standardized definitions and column names in the ad-hoc semantic layer.

Aligned Grain: Customer transactions in the EDL might be recorded at the transaction level while aggregating daily in the EDW. Ensuring that the grain is aligned, for instance, by aggregating EDL data to a daily level before loading into the ad-hoc semantic layer, maintains consistency.

Common Dimensional Joins: Both the EDL and EDW should use a standardized customer dimension table with consistent keys and attributes. This will make joining purchase data with customer information across various systems in the ad hoc semantic layer straightforward.

Standardized Metrics and KPIs: Ensuring that the logic and rules applied to data processing and calculations are consistent across different systems and datasets within an organization. Define metrics such as “total sales,” “customer lifetime value,” and “campaign ROI” uniformly in the ad-hoc semantic layer following consistent definitions.

Understanding the experiential aspirations for ad-hoc analysis involves quick access to data, enabling data integration from various sources. Semantic consistency helps ensure data quality and consistency by reducing errors and discrepancies, allowing analysts to trust the data for ad-hoc analysis.

Achieving semantic consistency for data outside the EDW layers may be more feasible than integrating the data into the EDW by building an ad-hoc semantic layer. This could lead to faster and more exploratory development while the core EDW remains stable. If certain data proves to be valuable in the long term, it can be integrated into the EDW later. Data scientists can readily access data derived from the organization’s business information model, such as business KPIs, and merge it with semantically consistent data. Additionally, not all data needs to be semantically consistent, allowing the fast and exploratory nature of the EDL approach to remain applicable for modern use cases.

Business intelligence (BI) and advanced analytics are all coming together. We used to think that only data analysts used technical (ad-hoc analysis) tools, data scientists used programming, and everyone else used BI apps, but that’s not true. Data wrangling/SQL tools can be helpful for even top-level data scientists, and anyone can benefit from advanced analysis.

In today's data-driven world, it's essential for individuals to seamlessly transition from accessing crucial information from trustworthy data sources to analyzing and manipulating that data with diverse groupings and filters, drawing from multiple sources to ensure semantic consistency. This enables them to conduct in-depth technical analyses and effectively communicate their insights to stakeholders.

When working with data, individuals shouldn't have to switch between different systems or rely on someone else to standardize their data for ad-hoc analysis. They should be able to quickly self-serve their experiment data needs, and if the results are promising, they should easily transition to the data team for standardization.

The objective is to deliver a contemporary data experience, with the modern data stack as an instrument to achieve this goal. While the modern data stack offers an architectural roadmap, these aspirations (Stay tuned for Part 2 (https://medium.com/@DataEnthusiast/the-aspiration-for-a-modern-data-experience-part-2-f469b99e53ad)of this article to dive into detail) aim to outline an experiential roadmap. They will set the standard for how our new tools should collaborate and the potential collective impact they can have. Let's focus on discussing the experience we aim to create instead of getting tangled in technology and technical diagrams. This direct approach will better serve companies' data aspirations.

Hope you found it helpful! Thanks for reading!

Let’s connect on Linkedin!

Link to Blogs on various topics in Data Space

https://medium.com/@DataEnthusiast

Authors

Himanshu Gaurav — www.linkedin.com/in/himanshugaurav21

Bala Vignesh S — www.linkedin.com/in/bala-vignesh-s-31101b29

--

--

Himanshu Gaurav

Himanshu is a thought leader in data space who has led, designed, implemented, and maintained highly scalable, resilient, and secure cloud data solutions.