GenAI Assisted ETL Migration — Transforming SSIS Packages to Databricks workflows
Author: Thomas Zheng, Senior Principal @ West Monroe Partners
Motivation
As many Data Engineers have discovered, GenAI is a powerful tool for assisting with ETL migration projects. In my previous blog post, GenAI Assisted Code Migration — A Quick Test of Code Generation Using Databricks DBRX, I shared some promising initial results of using GenAI to transpile a T-SQL stored procedure into PySpark functions. Applying this approach to client delivery projects, we’ve observed an impressive 60~80% efficiency gain in ETL code conversion and a 50–60% overall improvement in migration efficiency.
This time, I’m shifting focus to SSIS package conversion. In SSIS-based ETL projects, it’s common to use SSIS as an orchestration tool to invoke other data manipulation tasks, such as T-SQL stored procedures. It’s estimated that only about 20% of SSIS-based ETL projects rely solely on SSIS. This blog aims to address that specific 20%.
Dissection of a SSIS Project
Although SSIS is a GUI-based ETL tool, it stores projects as XML files. A typical SSIS project contains four types of files, as illustrated in Figure 1 below. The “conmgr” files store connection configurations, the “databases” file holds the configuration for the MSSQL database, the “params” files contain project-wide parameters, and the “dtsx” files are the SSIS packages that contain the ETL logic.
Figure 2 illustrates the components of a typical SSIS package. The ETL logic, including input, output, and data transformation, is contained within “DTS: Executable” elements, and there can be multiple “DTS: Executable” components in a package. If necessary, the execution order is documented in the “DTS: Dependency Constraint” components.
With this basic understanding, the migration of an SSIS package to Databricks workflows can be outlined as follows:
- Migrate each “DTS: Executable” to a set of PySpark functions.
- Construct a “main” function to define: a) Parameters b) Connections c) Execution logic
This approach ensures that the structure and functionality of the original SSIS package are preserved in the Databricks environment.
Migration Approaches
Since the backend code for an SSIS package is written in XML, one approach is to create an XML parser that extracts the ETL logic and converts it to the target platform’s code. For Databricks, this could be a set of Python or SQL notebooks, or a Python module.
In this blog, we will demonstrate how to use GenAI to migrate a simple set of SSIS packages found in a public GitHub repository to a Databricks workflow.
Inspired by the talk Franco Patano gave during the Data & AI Summit 2024 (reference 1), we decided to adopt his proposed approach with some modifications. The modified approach is illustrated below:
This approach can be outlined as follows:
- Module 1 (Source Code → Source-Target Mapping): Using GenAI to extract the source-to-target mapping, including data validation and transformation rules.
- Module 2 (GenAI Generated STTM → Validated STTM): Human validation and improvement of the source-to-target mapping (STTM).
- Module 3 (Validated STTM → Databricks Workflows): Applying an existing metadata-driven framework to transform the validated STTM into Databricks workflows. This includes generating a Databricks Asset Bundle YAML file, workflows with tasks, and/or DLT pipelines.
There are several advantages to this approach:
Module 1: This is a net-new component that can be achieved using various GenAI tools. Options include prompt engineering, compound AI, prompt auto-tuning with DSPy, or any of your favorite GenAI tools. With the rapid advances in GenAI, new applicable options can be incorporated for this step.
Module 2: While the latest LLM models, such as Claude 3.5 Sonnet, GPT-4o, or Llama 3.1, can achieve high accuracy in terms of correctness and completeness, our experience shows that having a “human in the loop” to validate and improve the responses from the LLM is essential for the success of any ETL migration project.
Module 3: We have an existing metadata-driven solution that can automatically generate Databricks workflows and DLT pipelines from a source-to-target mapping document. This framework consists of comprehensive Python packages developed during our past projects, which can be refactored for this purpose.
Description of SSIS Workflow
For demonstration purposes in this blog, we utilized an SSIS workflow from a public GitHub repository by Ana Escobar (link). This workflow (Fig. 4) ingests employee data from an Excel worksheet and constructs a Type-2 dimension table.
The workflow consists of three SSIS packages:
- Ingestion Package: This package imports data from an Excel sheet into a database table (landing).
- Transformation Package: This package applies transformation and validation rules to the data from the landing table and writes the result to another database table (hub).
- Dimension Table Package: This package constructs a Type-2 dimension table in the data warehouse (gold).
For demonstration purpose, we assume that only “updated records” are added incrementally (add, delete, update)
Analysis of the LLM Response
The Source-to-target-mapping by GenAI
Figure 5 displays the dataframe of the “the source to target mapping” extracted from the response generated by the Llama 3.1 model using the Databricks model serving endpoint.
Here are the observations:
- The Llama 3.1 model correctly identifies “derived” columns (CurrentDate, CurrentFlag) and the appropriate transformation rules for both.
- The data types for some columns (FirstName, LastName, City, AddressLine, Country, EmployeeID) are identified as “wstr,” which is likely specific to SSIS. We can modify the prompt to map these to the “String” type.
- No validation rules are extracted. This is consistent with the SSIS package, which indeed contains no validations.
Validated STTM
Figure 6 displays the human-validated source to target mapping. We added two columns: validation_name
and transformation_name
. Additionally, we incorporated several validation rules based on the DDL of the DimEmployee
table. A special rule, salary_between_20k_100k
, was added to test the Databricks DLT data quality implementation.
Auto-generated Databricks Workflows/DLT Pipeline
Applying the existing metadata driven ETL framework, the a Databricks DLT pipeline (DLT-STTM-SSIS-01) and a Databricks workflow (DIM_RAW_TO_GOLD) were generated based on the “human validated STTM” (Fig 6).
The workflow consists of three tasks:
- Excel2CSV: This “notebook” task extracts employee data from an Excel worksheet and saves it to cloud storage as a CSV or Parquet file.
- DLT-STTM-SSIS-01: This task executes a DLT pipeline (Figure 8) that was auto-generated by the existing metadata-driven ETL framework. The pipeline includes four DLT tasks: a. Ingest Data: Ingest employee data from the cloud storage to a Bronze table using Databricks Autoloader. b. Apply Rules: Apply data quality and transformation rules to the STTM table (Figure 6). In this DLT, we used “expect_or_drop” rules. If desired, data that fails validation can be sent to a quarantine table (not implemented). c. Build Gold Tables: Construct two gold tables:
gold_employee
(a Type-1 table) andgold_employee_scd2
(a Type-2 table). d. Create Materialized View: Create a materialized view that matches theDimEmployee
table from the SSIS package, except for the identity column (EmployeeKey).
- Dim_Surrogate_Key_Maintenance: As the name suggests, this task maintains the identity column of the
DimEmployee
table. Currently, Databricks DLT does not support identity columns. This optional task is necessary for customers who wish to implement the Kimball model.
Takeaways
- The framework proposed by Franco Patano at the Data AI Summit 2024 is practical and can effectively reduce the complexity and effort involved in migrating a legacy ETL platform to Databricks workflows.
- LLM models, such as Llama 3.1 or GPT-4, are highly effective in extracting source-to-target mapping information from an SSIS package, even just with some prompt engineering.
- Existing metadata-driven ETL solutions can be leveraged to further reduce the effort and time required for migration.
References
- The Rise of Medallion Mesh (https://www.databricks.com/dataaisummit/session/rise-medallion-mesh-repeated) — Talk at DataAI Summit 20204 by Franco Patano
- SSIS_DWH (https://github.com/escobarana/SSIS_DWH) — GitHub Repo by Ana Escobar