AWS Design for ETL Workflow Orchestration

Published in

iBrains Cloud Data Engineering

4 min readAug 31, 2021

By Abhijit Patil and Hukumchand Shah

In our previous article “Data Flow Design and Architecture to Build Data Warehouse on AWS” We have covered the design of the Data Flow(ETL Job). To build any warehouse or a Data Mart, we know it depends on multiple data sets and simple or complex data transformations. It would result either in a simple workflow or multiple workflows with dependencies. In this design, we have considered a complex scenario where there are dependencies of one or more workflow(s) to another workflow to explain the ETL orchestration.

Problem Statement:

In ETL(Extract Transform and Load) or ELT (Extract Load and Transform) workflows, orchestration is required to handle scenarios such as Late Arriving Data, Jobs Pause(hold) and Resume, Business Date dependency, Data dependencies, Idempotency, etc. On AWS to perform transformation we have multiple options(Using EMR, Glue, etc.), which can grow in a highly complex mesh structure of dependencies. This could become a challenge for the support teams to manage the application and supporting the SLA’s.

With the below design, ETL orchestration can be easily defined, managed, extended, and supported for any kind of workflow.

In the illustrated ETL Workflow orchestration design, you can see that many workflows have complex dependencies. Each workflow can internally use different AWS services like EMR, Glue, Redshift, etc.

Here we have 3 main scenarios in which all other scenarios would fall into.

One-to-One workflow dependency. The illustration shown in the diagram as Workflow4 has a dependency on Workflow1.
One-to-Many workflow dependency. The illustration shown in the diagram as Workflow5,6 is dependent on Workflow2.
Many-to-One workflow dependency. The illustration shown in the diagram as Workflow8 has a dependency on Workflow4, 5, and 6.

The generic lambda in the design has the following responsibilities.

Date control validation: This would read the metadata from Business Date Control table and validate the current date against the business date or before triggering the respective workflow.

Note: Business Date is calculated based on the date control table and by referencing the Business holiday calendar.

Workflow Status check: This would refer to the Workflow Status Table to update the current status and ensure the workflow idempotency.
Trigger Dependent workflows: It would read the workflow dependency from the Workflow Control table and trigger the next dependent workflow(s).

To perform all the above responsibilities, workflow metadata can be configured into DynamoDB tables as described below. You can use any other service such as RDS to store this metadata.

There are 4 main DynamoDB tables which used in this design.

Business Date Control table
Business Holiday Calendar table
Workflow Status table
Workflow Control table

Business Date Control Table:

In this table, the main attributes are workflow-id, workflow-name, previous-business-date, current-business-date, next-business-date, created-timestamp, updated-timestamp, created-by, and updated-by.

Note: DynamoDB is schema-less and you have to define only the key columns while creating a new table other attributes are stored freely as JSON documents.

JSON schema:
{
    "Table": {
        "AttributeDefinitions": [
            {
                "AttributeName": "workflowID",
                "AttributeType": "N"
            },
            {
                "AttributeName": "workflowName",
                "AttributeType": "S"
            }
        ],
        "TableName": "BusinessDateControl",
        "TableStatus": "ACTIVE",
        "KeySchema": [
            {
                "KeyType": "HASH",
                "AttributeName": "workflowID"
            },
            {
                "KeyType": "RANGE",
                "AttributeName": "workflowName"
            }
        ],
        "CreationDateTime": 1421866952.062
    }
}

Business Holiday Calendar table:

This table will hold the holiday dates per year for all the data sets as well as workflows. You may want to further enhance this based on region, country, or any other classification.

JSON schema:
{
    "Table": {
        "AttributeDefinitions": [
            {
                "AttributeName": "workflowID",
                "AttributeType": "N"
            },
            {
                "AttributeName": "year",
                "AttributeType": "S"
            }
        ],
        "TableName": "BusinessHolidayCal",
        "TableStatus": "ACTIVE",
        "KeySchema": [
            {
                "KeyType": "HASH",
                "AttributeName": "workflowID"
            },
            {
                "KeyType": "RANGE",
                "AttributeName": "year"
            }
        ],
        "CreationDateTime": 1421866952.062
    }
}

Workflow Status table:

This table will hold every execution status of all the workflows. The main attributes are execution-id, workflow-id, workflow-name, business-date, workflow-status, description, created-by, created-timestamp, updated-by, updated-timestamp.

JSON schema:
{
    "Table": {
        "AttributeDefinitions": [
            {
                "AttributeName": "executionId",
                "AttributeType": "S"
            },
            {
                "AttributeName": "workflowId",
                "AttributeType": "N"
            }
        ],
        "TableName": "WorkflowStatus",
        "TableStatus": "ACTIVE",
        "KeySchema": [
            {
                "KeyType": "HASH",
                "AttributeName": "executionId"
            },
            {
                "KeyType": "RANGE",
                "AttributeName": "workflowId"
            }
        ],
        "CreationDateTime": 1421866952.062
    }
}

Workflow Control table:

This table will hold different workflow dependency details and other configurations like an idempotent flag, On-hold flag and you can extend this further. The main attributes are workflow-id, workflow-name, Dependency array/JSON, isIdempotent, is-on-hold, created-by, created-timestamp, updated-by, and updated-timestamp.

JSON schema:
{
    "Table": {
        "AttributeDefinitions": [
            {
                "AttributeName": "workflowID",
                "AttributeType": "N"
            },
            {
                "AttributeName": "workflowName",
                "AttributeType": "S"
            }
        ],
        "TableName": "WorkflowControl",
        "TableStatus": "ACTIVE",
        "KeySchema": [
            {
                "KeyType": "HASH",
                "AttributeName": "workflowID"
            },
            {
                "KeyType": "RANGE",
                "AttributeName": "workflowName"
            }
        ],
        "CreationDateTime": 1421866952.062
    }
}

This workflow orchestration would help support building an operational dashboard. This dashboard would be helpful for the Support and Business teams.

Conclusion:

With this design, we have tried to cover the various scenario involved in the ETL workflow execution and orchestration. The ETL orchestration can be designed in many other ways as well.

Authors:

Abhijit Patil: Director-Cloud Data Engineering and Architecture, expertise in building enterprise-scale products and services with 20+ years of diverse experience in various technologies and finance domains.

Hukumchand Shah: VP of Engineering, specialized Cloud Data Engineering, Big Data, Microservice, Leadership, and Health Enthusiast.