The Evolution of ETL Transformations

Nalin Bhan
Capillary Technologies
9 min readJul 6, 2023

In today’s business landscape, where customer satisfaction and retention are of utmost importance, unlocking the potential of data becomes a critical necessity. Enter Capillary Insights, a single platform that collects, analyses, and interprets data seamlessly and works together using a centralized data model and measured approach to representing businesses in one complete view. By harnessing its capabilities brands can elevate their customer data collection and analysis practices, translating raw information into valuable, actionable insights that fuel customer loyalty.

Powering Insights+ is a robust ETL (Extract, Transform, and Load) pipeline that operates in a multi-tenant fashion executing a DAG of workflows. With ETL pipeline isolation for every tenant, we cater to approximately 500 tenants across geographies. As a multi-tenant system, the dataset and schema for every tenant can vary based on multiple factors, for eg., the vertical they operate in, the geography, etc. The data pipeline we have at Capillary has also evolved as the nature of the data our system ingests changes. In this article, we specifically address the challenges faced by the Transform phase of ETL and how it evolved to meet those challenges.

Central to our data infrastructure is our data warehouse, which employs a star schema model (Isolated across tenants). At its heart resides the essential fact table, where each row represents a specific event that occurred at a given time. Within the fact table, we have a combination of attributes and measures, and foreign key references to dimension tables.

The dimension tables provide valuable context that helps answer the fundamental questions of who, what, where, when, how, and why for each event. Notably, the fact and dimension tables are generated from data ingested from diverse sources, which belong to various products offered by Capillary like Loyalty+ and Engage+. Each fact represents an event associated with one of our Products, making it a subset of those products.

Originally utilized by analysts for their reporting requirements, our data warehouse revealed a need for evolving as our journey progressed. Analysts often required derived data that involved joining different fact tables, leading to performance problems and resource choking.

Recognizing this recurring need, we made a strategic decision to enrich the existing facts within the ETL itself. This proactive approach was required to empower analysts by directly providing them with comprehensive data, simplifying their analysis, and eliminating the need for additional joins.

Given the fact-to-fact joins involved, this enrichment process becomes computationally demanding, potentially causing delays in meeting our ETL completion SLA. To comprehend the reasons behind its computational expense, let me provide a glimpse into the scale at which we operate. During the Extract phase, our ETL pipeline ingests data from various sources (such as Mysql tables or Mongo collections) for all tenants. Following this, the pipeline branches out to handle each tenant separately to ensure isolation. One of the bigger source tables houses an enormous ~ 4.5 billion records. Picture this: the aforementioned table is just one of roughly 250 source tables within that single region, and we have a total of five distinct regions.

A potential solution must take this and the following into consideration…

  1. Metadata modeling: In order to support various use cases that analysts may come up with, the enrichment process is carried out, requiring a highly versatile and adaptable nature for metadata modeling. It should be noted that transformations valid for one brand may not be applicable to another brand due to different use cases leading to varying data patterns of events. Hence, overriding the transformation metadata at the brand level should be possible. We’ll discuss some examples pertaining to this aspect later in this blog.
  2. Late Arriving Data: To ensure seamless data management, it was imperative that our ETL pipeline had the capability to efficiently handle backfill data (up to a few months old), which can happen in scenarios such as, for example when a failure in the pipeline necessitates data reimports…
  3. Leveraging Batch Pipeline Efficiency: Our ETL pipeline operates as a batch pipeline, with a batch window of 1 day. Each day, we extract only the inserted/changed data, which is then processed and merged with the existing fact table history. Since the extracted data represents a small subset (less than 1%) of the total fact data, we can leverage this subset for efficient computation of computed measures and dimensions, avoiding the need for full computation on the entire tables.

The solution:

Let’s first discuss how these computed attributes are derived. Consider two facts, namely, Transactions, and Points.

Transaction refers to a transaction done by the end-user at one of the brands.

Points represent any event in which the loyalty points were debited/credited to/from an end user’s account.

Though Transactions and Points are different facts, there is a correlation between them. An end-user may be awarded points on doing a transaction or he may redeem points to use as currency for a specific transaction. Hence, each transaction can be associated with a certain number of points. Our objective is to add a column called TotalPoints to the Transactions fact table, representing the total points awarded for each transaction. To achieve this, we need to derive the values from the Points fact table. The SQL query for this calculation is given below:

SELECT 
trans.*,
nvl(totalPoints, 0)
FROM
Transaction trans
LEFT JOIN (
SELECT
t.Id,
SUM(
nvl(p.Points, 0)
) AS totalPoints
FROM
Transactions t
LEFT JOIN Points p ON t.Id = p.TransactionId
GROUP BY
t.Id
) pts ON trans.Id = pts.Id;

Running this query on the entire fact tables will be resource-intensive, as it involves joining two large tables. However, in day-to-day use cases, we rely on batch computation to optimize the process. Since we only consider the changes (delta) in the last 1 day, the size of the Transactions table is significantly reduced, comprising less than 1% of the total data.

How Batch computation works

Since it’s not guaranteed that an update in the transaction table will lead to an update in all corresponding points records, we cannot directly join this delta transactions table with the delta points table. This requires a join of the transactions table with the full points table, which is again not very desirable, as doing an unbounded join with the points table will be very expensive. Also, the cost of this unbounded lookup will keep increasing with every passing day as the size of the full points table increases incrementally.

Points History is huge, creating derived columns by joining delta transactions directly with history is not feasible

All our fact tables have a user dimension denoting the end user of a particular event. We leverage this to optimize our join in the following manner:

Instead of joining directly with the full points table, we create an intermediate table, say, the points_reduced table, by filtering the data for the users who are present in the delta transactions table. This resulting intermediate table will be much smaller than the full points table. Now all the fact-to-fact join use cases can be done using transactions_delta and points_reduced tables. This intuitive yet simple optimization works well when the number of distinct users in the incremental data is a small percentage of the total number of users.

The following query illustrates how we further reduce the size of the points dataset:

SELECT 
pts_history.*
FROM
`target`.points__history AS pts_history
INNER JOIN (
SELECT
DISTINCT `b`.`bill_id`,
`b`.`dim_event_user_id`
FROM
base_fact.transactions_delta AS b
) AS transactions_unique ON pts_history.bill_id = transactions_unique.bill_id
AND pts_history.dim_event_user_id = transactions_unique.dim_event_user_id

This reduction logic is generic in nature that could be extended for any transformation, i.e., for any query required for the computation of derived columns, the reduced table can be injected into that query.

We talked about the versatility of this metadata modeling earlier in the blog. Now, let’s delve into the significance of this aspect. Consider a scenario where we need to capture the TotalPoints for a different brand. However, in the case of this specific brand, the billId is not captured as part of the points events. Consequently, a direct join between Transactions and Points is not feasible due to the absence of a direct relationship between these two facts. As a result, the transformation defined earlier is not applicable to this brand.

To address this issue, we have the option to override the transformation at the brand level by defining an alternative SQL definition specifically for this brand. Analysts may choose to join the two tables based on store or date for attribution purposes, but the underlying logic for optimization remains consistent.

Our system is a platform and is agnostic of the queries that analysts want to use.

Effectiveness of this optimization

In Capillary’s initial days, most of our clients were from Fashion and Retail. We observed that the end users in these verticals had a relatively low percentage of presence in the delta data compared to the entire user database. This could be attributed to the users’ infrequent transaction pattern that’s typical to these segments.

The following chart illustrates this phenomenon.

The Y-axis captures the percentage of distinct users in the delta against the user database over a period of 3 months for a particular brand in Fashion and Retail. The percentages ranged between 0.3 and 0.8 illustrating the tiny proportion of users in the delta data against the total user base

The reduced points dataset obtained through the intermediate table will also be smaller than the full points table by the same factor. Therefore, this optimization significantly reduces the overall data size, which improves the query performance and resource utilization during ETL. We observed that with this optimization the ETL completion time was reduced by ~ 50% — 70% with brands having bigger datasets being on the higher end of the spectrum.

Caveats

This optimization is not without caveats. It has some limitations preventing it from guaranteed ETL completion within SLA. The two major ones are listed below:

Unpredictability of data magnitude

The effectiveness of this optimization is inversely proportional to the % of users in the delta. While this approach works wonders when the number of users in the delta is a small percentage of the total number of users, this is not always the case and depends on the amount of data being ingested…

If, for example, on a specific day, greater than 15% of the user’s data is ingested due to a large file being imported in the source, or due to promotions targeting millions of users, the optimization would not be very effective and it would lead to the ETL process getting delayed. As there are no means to control the % users in delta, consequently there is no way to guarantee the completion of the ETL process within the specified SLA.

Heterogeneity in user behavior among segments

The premise of this optimization relies on the fact that the number of distinct users is a small percentage of the total number of users. But, as it turns out, this is true only for certain segments such as Fashion and Retail, and is not always the case across segments. The following graph illustrates this phenomenon for a brand operating in an entirely different segment.

As shown here, the distinct user percentage for this brand is as high as 25% and is also more or less constant across several months.

In scenarios such as this, our optimization would not be helpful because the compute would entail doing an unbounded lookup over the full data which is very expensive.

It is safe to summarize at this point that while our optimization provided big improvements in ETL completion times for brands operating in the Fashion and Retail segment, due to the heterogeneity in user behavior observed across brands operating in other segments, it did not bring big overall improvements. But how did we solve this problem across brands?

Stay tuned

In our next blog in this series, we go into the details of how we solved these problems for brands across various segments by developing an innovative in-house solution capable of performing fact-to-fact joins while remaining impervious to various external factors such as the ones discussed above.

We introduce our solution, namely the Bounded Batch Architecture, built by leveraging concepts such as watermarking and state management inspired from streaming engines, and dive deeper into the inner workings of this groundbreaking solution, explaining how it expanded our computing capabilities from handling millions of records to effortlessly processing billions.

Join us on this exciting journey as we unravel the secrets behind its scalability, resilience, and its ability to empower businesses to navigate and thrive in the face of unparalleled data challenges.

--

--