A Tale of ETL Workflow Scheduling in Multi-Tenant Architecture

Published in

Capillary Technologies

10 min readFeb 26, 2024

Capillary Technologies’ Insights+ serves as a centralised data analytics platform, designed to gather, analyse, and interpret data. At its core lies a star schema model within the data warehouse (Isolated across tenants). Insights+ operates as a multi-tenant batch processing ETL system extracting data from diverse data sources in OLTP systems, and denormalizing the data into simplified facts and dimensions. Following transformation, the data is loaded in our data warehouse supported by Amazon S3. Analysts and clients leverage this data to generate reports, execute campaigns, and export data for external integrations. A simplified ETL workflow is shown in the Figure below.

Let’s consider a scenario with 50 tenants, each requiring 20 tasks in the ETL as shown in figure 2. This totals around 1000 tasks. When a new transformation is introduced, it adds 50 tasks to the overall workload, and similarly, adding a new tenant results in an additional 20 tasks. Incorporating new transformations or tenants significantly increases the size of the workflow. Existing off-the-shelf workflow orchestrators in the market encountered challenges with code maintainability or struggled to effectively handle the workload. Additionally, since different tenants had varying SLAs, there emerged a necessity to create our own orchestrator equipped with a custom scheduling algorithm. This blog will explore the evolution of our workflow scheduling algorithm within the orchestrator, highlighting how we manage substantial workflows while meeting SLAs.

Iteration 1

Figure 3 shows a model of our initial workflow where a single transformation task is executed for all the tenants. Post the transform phase there is one task per tenant to load the data into the data warehouse. Although this approach was simpler to handle with off-the-shelf workflow orchestrators like Azkaban/airflow and enabled successful execution of multi-tenant ETL operations, it revealed a significant flaw. Any activity, such as data imports or campaign runs, affecting the data volume for a specific tenant resulted in a cascading delay in ETL completion for all other tenants — a phenomenon known as the “noisy neighbour” issue. In a multi-tenant architecture, such cross-tenant impacts are highly undesirable. Additionally, there arose a necessity for flexibility in the transformation phase to accommodate the diverse operations of tenants across various verticals. To tackle the noisy neighbour issue and provide tenants with greater flexibility, a decision was made to isolate the ETL workflow DAG for each tenant. This significantly increased the size of our workflow DAG, prompting us to develop our own orchestrator.

Iteration 2

We constructed a DAG resembling the structure depicted in Figure 4, ensuring that each tenant or organisation follows its own execution path, independent of others. However, due to resource constraints, there is a limit on the maximum number of tasks that can run in parallel. As the DAG execution commences, tasks are randomly selected and submitted to the Spark cluster for execution.

Eventually, the DAG may reach a state depicted in Figure 5 below. In this state, Tasks T4 (for org 1), T3 (for org 2), and T2 (for org 3) are ready for execution. Employing a higher depth-first logic for scheduling, Task T4 (for org 1) is selected and submitted for execution.

The higher depth-first logic ensures that tasks belonging to tenants whose tasks are completed quickly receive more priority. This prevents them from being stuck behind tasks from tenants that take more time to complete, resolving the noisy neighbour problem.

While this approach helped us address the noisy neighbour issue but has brought forth new challenges in the system. To illustrate this, let’s examine Figure 6, where we’ve presented a sample query that runs for two different tenants. Tenant 1 has a larger volume of data to process compared to Tenant 2, resulting in Tenant 1’s tasks taking more time to execute.

Due to the inherent characteristics of our scheduling algorithm and the diverse nature of the data, a direct correlation exists between ETL completion and the volume of data being processed. Typically, customers handle larger volumes of data and adhere to more stringent SLAs for ETL completion. However, the tight coupling between data volume and ETL completion has resulted in challenges meeting SLAs for these tenants

To tackle this challenge, we implemented a tiered approach by categorising tenants into different priorities based on their defined SLAs. Subsequently, we divided our primary DAGs into three smaller DAGs: batch-0, batch-1, and batch-2, with batch-0 assigned higher priority. We initiated the execution of batch-0 ahead of the other batches, providing it with a head start. Leveraging the scheduling algorithm’s higher depth-first logic, batch-0 receives higher priority due to its advanced start, ensuring efficient prioritisation of tasks.

Figure 7 illustrates the typical timeline of our ETL execution, managed through cron jobs. Under normal circumstances, we successfully met our SLAs without any issues. However, on days when system issues occur, such as cluster launch failures or prolonged data import processes (Sqoop) due to large data volumes, significant manual effort is needed to address these anomalies. This manual intervention entails adjusting cron timings to ensure batch 0 receives a head start. Additionally, before the next day, cron jobs must be reverted to their original state. The reliance on manual tasks introduces the risk of errors and contributes to on-call fatigue among the team.

Iteration 3

With this iteration, our primary aim was to address several key issues:

Org level isolation: We aimed to implement a solution that ensures each organisation operates within its isolated environment, minimising interference or impact from other tenants.
Enable business to define the SLAs: We sought to empower businesses to define their own Service Level Agreements (SLAs), allowing for tailored performance expectations based on specific organisational needs and priorities.
Minimal manual intervention: We aimed to streamline processes to reduce the need for manual intervention, particularly in addressing system anomalies or adjustments in scheduling, thereby enhancing operational efficiency and reducing the risk of errors.

We sought to achieve this by using The As Late As Possible (ALAP) scheduling technique used in different industries and domains like Manufacturing, Supply Chain Management, project management, and also in workflow orchestration systems to optimise task scheduling by assigning tasks to start as late as possible while still meeting project deadlines. Generally, it is used in conjunction with other optimization algorithms to maximise flexibility and resource utilisation. By delaying task starts until necessary, ALAP scheduling helps to minimise resource idle time and allows for better adaptation to changing conditions.

Let us understand it better with an example

Let’s analyse the dependencies among the tasks in the DAG:

Task 1 needs to be completed before Task 2 can start.
Task 2 needs to be completed before Task 3 and Task 4 can start.
Task 5 needs to be completed before Task 6 can start.

By scheduling Task 3, Task 4, and Task 6 for execution at or before time unit 3, we can ensure that these tasks meet the SLAs as shown in figure 9 .

Continuing with the same logic and applying it backward:

For Task 2 and Task 5, we must ensure that it starts its execution before or at time unit 2. This ensures that Task 3, Task 4, and Task 6 can start on time to meet the SLA.

Similarly, for Task 1, we need to guarantee its completion before or at time unit 1. This enables Task 2 to start on time, cascading down the workflow to ensure all subsequent tasks meet their deadlines and adhere to the SLA.

Figure 10 — DAG with populated ALAP times

In our multitenant workflow, we have different groups of organisations, each group having a different SLA, let’s examine a simplified workflow with tasks for three different organisations and Let’s assume each task takes 30 min time:

Organisation T: SLA is 9:00 AM
Organisation G: SLA is 10:00 AM
Organisation F: SLA is 10:30 AM

We will proceed to backtrack and determine the latest start times for each task in the workflow to ensure that all organisations meet their respective SLAs. Finally we will end up with a schedule that looks like the one shown in Figure 12. If we have sufficient resources[1] we will be able to execute the workflow within SLA if we start executing the start task before 6:30 AM.

Given the defined SLAs and scheduling priorities, tasks will be executed in accordance with these priorities, regardless of any delays encountered during the process. This ensures that tasks for organisations with earlier SLAs are given priority, even in situations where delays occur. By adhering to the established priorities, the workflow can effectively meet the deadlines set by each organisation’s SLA.

Acknowledging that task execution times may vary in reality, we must adjust our approach to accommodate this variability. This adjustment will ensure that our scheduling strategy remains robust and adaptable to the dynamic nature of task execution times.

Proactive ALAP Scheduling Strategy

Let’s consider a simple workflow for Organization G with 4 tasks and an SLA of 8:00 AM.

On day 1, we proceed with the workflow assuming each task takes 30 minutes to complete. After the execution is finished, we record the actual execution times of each task in the database for future reference.

On day 2 ,Execution times from the previous runs are used to compute ALAP scheduling times for tasks. After each run, the actual execution times are recorded and averaged with the previous execution times.

Over time, the algorithm uses the average execution times from the last 5 days to compute ALAP scheduling times, ensuring adaptability to changing execution patterns. By incorporating historical execution data into the scheduling algorithm, you can dynamically adjust task schedules based on real-world performance, improving the efficiency and reliability of your workflow management system.

Results

In our expansive ETL workflow, a single tenant typically comprises approximately 1200 tasks within a single execution. With approximately 250 tenants in one workflow DAG, it spans to a vast network of around 300,000 tasks. Despite operating at an extensive scale, we are proud to report that our system consistently meets Service Level Agreements (SLAs) for all tenants, maintaining a 90% adherence rate to SLAs every month. This achievement underscores our commitment to efficiency and reliability in managing complex data operations at scale.

With our current approach, we have effectively tackled all the challenges we initially encountered:

Tenant Isolation: As depicted in Figure below, our system ensures that higher-priority tenants irrespective of volume of data to process, complete their ETL tasks before lower-priority tenants. This prioritisation mechanism guarantees that lower-priority tenants’ ETL completion does not hinder the performance of higher-priority ones.

Figure 16 — ETL completion time for different priority tenants (over 2 days)

Enable business to define the SLAs: Depending on business needs, we can tailor SLAs to meet the specific needs of each tenant in the workflow. This flexibility allows for the establishment of different deadlines for various tenants, ensuring that the system aligns closely with individual contractual agreements.
Minimal manual intervention: Remarkably, our workflow operates seamlessly, demanding minimal to no manual intervention even when confronted with anomalies or issues. Leveraging computed ALAP times, the system intelligently schedules tasks to their relative times in the event of disruptions, ensuring smooth continuity and adherence to deadlines as shown in figure 17. This automated approach not only enhances efficiency but also minimises downtime, enabling the workflow to maintain its robust performance even in challenging circumstances.

Figure 17 — ETL completion time with SLA for higher priority tenants (over 1 week)

Conclusion

In our pursuit of refining our workflow management system, we’ve implemented As Late As Possible (ALAP) scheduling principles with a nuanced approach. Recognizing the varied priorities across different task groups, we introduced diverse Service Level Agreements (SLAs) to cater to distinct priority levels. Moreover, rather than relying on fixed assumptions for task execution times, we turned to historical data to predict and incorporate task execution trends into our ALAP computations. This adaptive strategy allows us to dynamically adjust task schedules based on real-world performance, ensuring optimal resource allocation and timely completion of tasks aligned with their respective priorities. By embracing these enhancements, we’re fostering a more agile and efficient workflow ecosystem, poised to meet the evolving demands of our operations with precision and reliability.

[1] In workflow optimization, there exists a fundamental trade-off between fixed Service Level Agreements (SLAs) and resource optimization. Typically, organisations either set fixed SLAs and optimise resources accordingly or allocate fixed resources and optimise for meeting SLAs. In our specific scenario, we adhere to the former approach, where the emphasis lies on maintaining consistent SLAs while optimising resource allocation.