Transformation Generator: Transformation Pipeline Automation

Andrew Hancock
Johnson & Johnson Open Source
4 min readOct 31, 2022

Johnson & Johnson is proud to announce our newest open source project; Transformation Generator.

As software developers it is our job to understand the challenges, needs, and requirements of our users. This understanding enables us to build systems to empower our users to do things they could not do before, or to dramatically simplify what they are already doing. As part of the development process we often encounter many tasks and challenges which are not directly related to the user’s needs: ensuring code quality, setting up infrastructure, deploying systems to different environments, and more. Some of these tasks are repetitive, predictable, time consuming, and tedious.

What can we do to spare the developer from spending time on these tasks so they can focus on solving the user’s challenges? The answer, of course, is Automation! After building several Data Pipelines, we developed a system to automate many aspects of pipeline development.

What is a Data Pipeline?

A data pipeline is a set of processes to automate the movement and transformation of data between a source system and target repository. These pipelines are often found in data-lake or warehousing systems where data is being ingested from many different sources. The data in these source systems may differ in many ways including how tables are organized, the names and types of fields, and even the level of granularity in those tables.

Within a data pipeline you have individual transformation activities or processes that read from one or more objects or tables, apply the desired transformation, and write a transformed object or table. These activities also need to orchestrate to ensure that the transformations are applied in a safe order and executed in parallel where possible. Failure to do so could produce incorrect or inconsistent results and may have inferior performance.

What is Transformation Generator?

Transformation Generator is a set of Python command-line programs and REST APIs which automatically generate the transformation and orchestration code for a data pipeline. Out of the box we can target ANSI SQL for transformation and Azure Databricks for orchestration. Alternate platforms can also be targeted by providing custom code generators.

How does it work?

Transformation Generator follows a declarative approach to development. In Declarative Programming, the user specifies “what” needs to be done in contrast to Imperative Programming which describes “how” to do it. The Declarative approach is used in many areas of software development including build systems, infrastructure as code, functional programming, and even SQL. The declarative approach is more concise by allowing the user to focus on describing only what is needed, and sparing them the effort of conceiving and describing the steps to achieve it.

The user begins by specifying “data mappings.” A data mapping is a set of field-level transformations. Each transformation within a data mapping specifies a target table, a field within that table, and a “transformation expression.” The transformation expressions are written as SQL queries and describe the source tables, join conditions, and any transformations to be applied to the source fields to determine the target value.

Data mappings can be specified in CSV or JSON format. Transformation Generator will parse the data mappings and transformation expressions into Abstract Syntax Trees (ASTs) to facilitate further analysis and code generation.

One key analysis step is dependency analysis, which builds a graph of the data flow within all the data mappings. A dependency graph tells you precisely how data tables or objects depend on each other. Transformation Generator creates a dependency graph from all of the data mappings so that it can make intelligent decisions about scheduling and orchestrating the transformation activities.

Once all parsing and analysis is complete, the code can be generated. A Transformation Code Generator generates SQL queries which are aggregated from all data mappings targeting the same table. The Orchestration Code Generator produces JSON files which can be uploaded directly to Microsoft Azure. Leveraging the dependency graph allows the orchestration code to precisely define dependencies. This allows all the transformation activities to be performed in a safe order while leveraging parallel execution where possible. These are some of the ways Transformation Generator helps to streamline the coding process for the developer.

We have tested Transformation Generator with Azure Databricks and Data Factory.

Transformation Generator Roadmap

In addition to general improvements to code quality, testing, and functionality, we are currently working on the following enhancements to Transformation Generator:

· Migrating all command-line functionality to our Fast API REST services

· A new, modular CLI using the Typer library

· A cleaner interface to the code generators

· Additional code generators for orchestration, such as Airflow or Control-M.

Closing Thoughts

Hopefully we’ve given a good overview of what Transformation Generator can do. By specifying the desired transformations in a declarative fashion, you can let the automatic analysis and code-generation figure out the implementation details. This allows developers more time to concentrate on other aspects of their projects while the Generator does the tedious work for them. Automating these aspects of development also reduces the opportunity for human errors, which can save significant time, money, and prevent other negative consequences. Users can further extend the system through custom code generators allowing a single code-base of data mappings to target many different platforms.

We are always looking to fix bugs, improve usability, and add functionality so please let us know if you have thoughts or suggestions.

--

--

Andrew Hancock
Johnson & Johnson Open Source

Andrew is a Software Developer and Data Engineer with over 20 years of professional experience.