Data Warehouse Automation

Ganesh Nathan
9 min readFeb 15, 2022

--

Introduction

Data warehouse automation (DWA) has become a household name in the data warehouse space. As a consultant I see several clients explore DWA capabilities with varied degree of success. As this space is taking shape and maturing, there are some core capabilities that are consolidating as pillars for DWA. Off-the-shelf tools from various vendors have started to enter the market to meet the DWA demand. This article attempts to help understand DWA components and discuss Fisher Analytics open-source solution Data Build Tool (DBT) in detail.

Why

A date warehouse automation need comes to light when we compare application development with data warehouse development. Modern app developers use serverless infrastructure to do development. App developers can build the application in parallel and perform continuous integration and continuous deployment. Being agile and nimble allows app developers to innovate and iterate quickly. Change in scope or logic has less impact due to the incremental agile development. Most app developers build unit tests during build phase resulting in creating test suites concurrently. This results in seamless automated testing.

On the Data warehouse space, the data engineer is bound by the data model. Data warehouse strives to publish data in an optimized format to help business quickly get answers to their questions. Optimization is achieved through data model. Business questions can span across several modules or even several applications. “What is my sales for this quarter” may have an simpler data model compared to “why is my sales for this quarter low”. Adding AI/ML use case to this equation gets complex quickly. Changing ‘one to many’ to ‘many to many’ in an existing warehouse is not for the faint of hearts. It is hard to bring agile development into data warehouse and do a test-driven development.

What

I see DWA as a set of processes put in place to automate the data warehouse lifecycle. DWA focuses on improving productivity, quality and cost. This is achieved through code generation, effective change management, enhanced documentation, and granular testing.

Lifecycle

Staging data, transforming data, and publishing the data in a query optimized fashion are typical steps performed in a warehouse. The heart of the warehouse is in the transformation of the data with a data model that aligns with the business need. DWA establishes processes and provides tools to automate the data warehouse life cycle.

Productivity

One of the complains on data warehouse is embracing change. When new requirements come in, or existing logic changes, it takes considerable time to adopt based on the complexity and historical implication of the change. DWA paves the path for a faster response to change. Supports agile development and targets quick adoption there by improving overall productivity.

Cost

DWA relies on automation. This reduces reliance on expensive consulting resources and empowers data engineers with tools to quickly deliver solution for most common use cases. This has a significant impact on cost. The focus is shifted from infrastructure to data, resulting in delivering faster, cheaper solution iteratively and incrementally.

Code Generation

DWA is all about code generation. DWA consumes data model and generates code reflecting the model. This allows data modelers to focus and tweak the model faster and scales up data engineers to see the big picture during development. Well established best practices are embedded in the generated code elevating the overall quality.

Change Management

DWA manages code changes. It supports continuous integration and continuous deployment. The proven best practices of application development are now opened up for data engineers to embrace.

Documentation

DWA allows data engineers to document while developing the solution and empowers with automated documentation generation features. The by product of this automation is data lineage that has always been a battle in traditional data warehouse. Since lineage is generated, it can be scheduled and versioned resulting in effective governance.

Quality/Testing

DWA opens up the opportunity to do Test Driven Development (TDD). Data engineers can build unit tests while building the solution and can compile test suites. This helps validate their assumptions and continuously checks for quality, there by improving the overall reliability of the solution.

How

There are few vendors like WhereScape, Astera, Data Build Tool (DBT) etc.. in the market place offering compelling DWA solutions. In this article I will explore open source DWA tool DBT deployed in Amazon AWS. Amazon Web Services has changed the way we handle infrastructure to build solutions for software development. With DBT features and serverless infrastructure capabilities from AWS, one can build a scalable and reliable data warehouse solution.

DBT is an open source framework that has several DWA capabilities. I will explore couple of core features with a simple use case.

dbt is a transformation workflow that lets teams quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation. Now anyone who knows SQL can build production-grade data pipelines.

These features are not new to software development. They have all been tried and vetted out in a typical application development. Unfolding them to data warehouse with model automation and simple sql interface to automate pipelines is truly a game changer.

Model as Code

DWA allows data modelers to build Star Schema, ODS, Data Vault 2.0, 3NF, Wide Format etc. that best fits the business requirement. Here the pipeline, infrastructure and programming language is abstracted. Standards are enforced and documentation/lineage is generated and persisted with versions from the model.

In Data Build Tool (DBT), the modeler or the data engineer analyses the source data. The data is loaded in the staging area. The modeler identifies the business keys and optionally columns that are targeted to be tracked for changes in the data set. Exploring what is a business key and how to track slow and fast changing data set is beyond the scope of this article. DBT allows to track these changes elegantly in an automated fashion. More information about modeling in DBT can be found on the references section. Here the code (SQL Statements) can be generated from the data model taking into account the keys mentioned before.

Lets assume a use case that is easy to relate for everyone. World covid patient dataset. Lets assume we get world covid patient data every day and we would want to generate reports for analytical needs. We get seed data on the country as a one time load and patient data every day. Here the first step is to build a data model on the given patient data set. Following is the first iteration of the data model for the patient data. Here I am using Data Vault 2.0, Kimball Star Schema and Wide Format to showcase DBT capabilities in handling complex models.

The above model has four layers. Ingestion layer where the data is ingested in its raw format from the source. The curation layer where the data is transformed with business rules and data vault 2.0 model is implemented. The distribution layer reflects the Information mart per DV 2.0 modeling and in simple terms Kimball’s Star Schema. The publishing layer where data is transformed to wide format to help with reporting.

This model can be generated as SQL statements in DBT. I used AWS Lambda to generate the SQL statements for DBT as a starting point for the data engineers. The models were persisted as individual sql files representing a table. Here is the screenshot for the generated model.

The models folder in the screenshot represents DBT models and the folders under it like 0_raw_sources, 1_raw_vault, 2_bus_vault etc.. represents the data model shared above.

The generated code follows the data model as designed by the modeler and the sequence number represents the layer in the data model.

For example: 0_raw_sources represents the ingestion layer raw data source in the raw format. Here data format is preserved reflecting the source.

1_raw_vault represents the curation layer and the sub directories are created to reflect the sequence of data flow. i.e, 1_hub generates the hub tables, 2_link generates the link tables and 3_sat generates the satellite tables and so on..

Generating the code from the model expedites the development process for the data engineer. The generated code follows organization standards and enforces governance.

Test Driven Development (TDD)

The generated code is handed over to the data engineer to add transformation rules to curate the data set. Similar to application development, data engineers can choose to start with test cases, build test suites. Initial execution will result in failure message(s), this is designed on purpose to allow the data engineer to work their way to get a success message essentially validating the business rule and/or transformation. The benefit of this approach is, for any new changes the entire test suite is executed to ensure all existing business logic is preserved and the code change is bound to the new logic. This helps improve quality and deliver functionality in quick iterations.

In our patient data, lets assume that data from Brazil and Australia had no ‘age’ attribute but ‘birth_year’ attribute. The business rule is to calculate the age of the patient using birth_year. By modifying the SQL statement in DBT the data engineer can calculate the age.

This is an example of a simple business rule. However, complex validations and transformations can be achieved in DBT using the same approach. As data is transformed, it can be validated for accuracy.

The unit test case for this age calculation can be added to the SQL statement and DBT allows to execute this unit test along with other test cases and validates the code holistically.

The screenshot to the right provides an example of how to create unit test script to check if column patient_hkey in hub_patient table has no nulls and has only unique values. Complex test cases can be created using DBT. Please find the link to custom generic tests in the references section below for additional details.

Data engineers are armed with DBT test framework to validate their solution and verify their assumptions during development. Test suites can be compiled to test out the entire solution in an automated fashion. DBT allows to execute the test suites in a schedule if needed.

Code Management

Concurrent development has always been an elusive feature for data engineers. DBT allows data engineers to work on the same dimension, fact, hub, satellite, etc. at the same time with confidence as they can merge changes effectively. In a traditional data warehouse development, concurrent development is a challenge.

As part of Code Management, version control is enabled and each data engineer can create feature driven branch to independently work on the baseline code and merge their changes with others. This promotes parallel development and allows team members to deliver solution in quick iterations. Critical features are easily promoted to production as a single unit of change and can be rolled back if desired.

Enabling continuous integration empowers data teams to deliver solutions incrementally and in quick iterations. This is a game changer for the data warehouse space.

Orchestration

Pipelines to execute code that transforms and load data can be auto generated and scheduled. DWA provides a single framework to build, schedule and execute the pipeline. DBT has the built in framework to execute the sql statements individually or as collection based on the need. The execution sequence can also be scheduled.

dbt lets you transform data where it lives, using just SQL SELECT statements to infer dependency graphs and run transformation models in order, without writing DDL & DML.

Continuous deployment allows data teams to deliver in small iterations and get feedback from the consumers quickly. This opens up the opportunity to be nimble and agile. This is a game changer in the data warehouse space.

Doc Generation

Documenting code has always been a challenge in data warehouse development. Since DWA enables model driven development data lineage is a by product. Generated lineage document provides deeper insights on the data set relationships.

Following diagram is the generated lineage from DBT that outlines the relationship between the tables. This provides an opportunity for the data engineers and the data modelers to validate the code and the model concurrently.

When a new change is desired, nested relationship of dependent object are visually displayed to effectively forecast changes. Version controlling documentation and lineage allows to track historical changes on the data model.

Conclusion

Hope the article provides insights on the data warehouse automation space. I have successfully implemented several data lakes at peta byte scale and have deeper understanding and experience in deploying enterprise data solutions. Please feel to reach out to me if you need any additional details.

References

Originally published at https://www.linkedin.com.

--

--