Datawarehouse tests automation strategy in Redshift

Ismail Addou
Doctolib
Published in
12 min readJun 13, 2022
Wind tunnel testing
image source: https://www.plm.automation.siemens.com/global/fr/webinar/wind-tunnel-testing/59234

At Doctolib’s data core team, we aim to apply changes and deliver new parts of datamarts every day. So our challenge is to keep the workflows resilient, despite the high delivery rate of SQL Redshift queries.

This blog post presents how we manage to ensure a daily SQL code integration that runs in Redshift “ Safely ”, in order to keep both data analysts and data engineers happy 😃 ! (Actually all the dashboards users)

A testing pipeline is implemented for that. So we will describe in this article the progression of how we started from a basic solution to a CI pipeline that rejects bad SQL queries automatically.

A CI adapted for SQL ?

CI (or Continuous Integration), was born in web development teams to automate testing and code integration. In general, Tests are adapted for deterministic applications. They should run quickly and block the bad Git pull requests from being merged.

However, for SQL code, there is no standards or common ways to test it, because scripts couldn’t be anticipated: This is the non deterministic aspect that makes things more challenging, especially when it’s about hundreds of lines of analytical queries per task in the workflow.

In this blog post, we are going to focus on how we manage to integrate new SQL scripts or changes, to guaranty the workflows from failing without tackling the data quality problematic.

Our goal is making sure that the pipeline goes green ! independently of how accurate your SQL queries results !

Pareto principle

The big part of issues occur after code deployment to production, and most of them are either syntactical or database administration issues ! The other part is caused by changes coming from external data sources, and that is hard to anticipate. Fortunately it’s not a frequent thing !

Pareto principle applied to data warehouse issues

Technical context

The solution works mainly for Airflow and Redshift ! We also use a database migration tool and a CI tool that could be replaced since the logic is the same.

We will explain the little steps, to go from a basic solution to the solution that we are currently using at Doctolib’s data team.

The Flyway and Airflow combo

(You can skip this part if your datawarehouse is a NoSQL tool)

Redshift is a strongly typed database: that’s why workflows are sensitive to incoming data structure changes. Hence it is mandatory to anticipate them by upgrading the schemas on Redshift accordingly.

For this purpose, we use a database migration tool called Flyway, as a central service that executes the migrations on our environments, which is also a database version control that keeps migrations history on track.

Flyway and Airflow work together to execute migration and scheduled queries. For that reason, both codes are put in the same git repository. So they are reviewed, tested and pushed synchronously because they make sense together.

If, for example, you have a new table to add into your workflow: you may first create this table in Redshift through Flyway, then fill it with data via Airflow every day. It won’t work if the PR contains only the Airflow change.

Step 1: Re-run everything in Preproduction

We can’t call it a CI, but a workflow double check

We have a pre-prod environment which have Airflow and Redshift. The first approach was to deploy a release candidate to the pre-production and trigger all the critical DAGs in it. Then, if everything goes green, the “duty” is allowed to push this same release to production. ( duty: a team member who takes care of release process for a week)

Code reviews are not sufficient to prevent human mistakes before pushing. Because we need to test all the new commits together. For example: a column renaming in one commit or a simple typo, is enough to ruin the next night run. This double check enhance at least the confidence about the different changes compatibility.

In addition, we test the collecting steps using real data. The advantage of this practice is to anticipate the structural data changes coming from external sources and fix them before going to prod.

All eggs are in the same bag. The eggs here are “bad” and “good” commits and the bag is “pre-prod” !

However, if problems are detected in the release candidate, this represents a heavy workload for the “duty”. He has to do a lot of craftsmanships to fix the bugs, restart the failed steps, cherry-pick the fixes 🍒 and push the repaired release to production. In general, we do not restart the entire tests workflows after fixing the pre-production, because this testing process takes a few hours to complete.

So far, this solution is considered as a quick-win step ! We’ve only gained more confidence on our versions that go into production, but it’s not perfect because it’s a lot of manual work for the “duty”.

Pros :

  • 👍 Decreases crash rate on production
  • 👍 Anticipates structural data changes coming from external sources

Cons :

  • 👎 Tedious job for the duty to repair the pre-prod and deliver the same day
  • 👎 The fixed release-candidates are not tested from end-to-end

Among all the problems previously met on the workflows, three kinds of issues come up a lot:

  • 🚀 Unpredictable data sources changes are mostly resolved 🚀
  • ⚠️ Human mistakes (syntaxical errors)
  • ⚠️ ️️Coordination mistakes (database conflicts like column renaming, table renaming … )

In the following sections, we will explain how to avoid the other two types of problems: “human mistakes” and “coordination issues”. However, these two problems are only related to data structures, which is a good news, because we won’t need data for the solution we build.

Step 2 : Introducing the light weight tests

The burden of testing with data is the long duration of the tests. However, the tests need to be fast in order to be able to detect and correct problems more quickly. Therefore, without data, we will be able to run quick tests, several times in the same day.

Each light-weight test generates a similar database to production in terms of structure, but with zero data. Hence we became able to run six light tests during the morning. Then start the tests using data that we previously explained before going to prod.

The light database is recreated before every new test iteration. So we had more time to fix and revert bad PR during the morning in order to be ready for the ultimate “test with data”. Let’s call it the heavy test !

To create a database clone with zero data in `Redshift`, we can generate DDL’s from the production database. And use them to create the assets in the new database.

Results

The duty has more time to detect bugs earlier, thanks to the multiple light tests iterations. However, the heavy repairing work remains on his shoulders.

More importantly, we expected a drastic decrease in test duration after running the workflows without data. It actually takes an hour to run. This is much less than the heavy tests time duration, but it should be at least in the order of minutes to be satisfied. This longer than expected duration, is due to the waiting time observed between tasks, it could be longer than the execution itself.

Actually, it is the Airflow scheduler which is behind this latency. By definition, Airflow’s scheduler constantly loops over the metadata database as well as the DAGs directory, in order to update tasks states and decide which one to trigger. So this latency will continue to increase as workflows will grow bigger.

Pros:

  • 👍 More time to react to have a safe pre-prod
  • 👍 Queries run very quickly on Redshift (in seconds)

Cons:

  • 👎 Waiting time between tasks is not optimised
  • 👎 Test workflow duration exceeds one hour
  • 👎 Scheduler’s latency could be optimised only by improving the infrastructure. Thus, in order to run more tests in less time, the cost will be systematically high.

Step 3: Using a CI tool

Let’s transform this “double check process” to an efficient integration pipeline using a CI tool.

The idea here is to isolate the test environment for each development process. So that the CI tool triggers a test pipeline in the same logic for each new commit in the repository.

With this tool, tests will be run in isolated environments in parallel at the pull request level. This way, “bad code” could not be pushed to the main branch, then to the release, since the tests are not yet green. Now most of our releases are good, and last but not the least, it no longer depends on the “duty” to make it work.

The key of our “CI success” is the environments isolation. It allows to run multiple tests, for multiple commits in the same time. To achieve that we should create an isolated and ephemeral Redshift database and Airflow environment for each new commit in the PR’s.

Ephemeral databases

To create an isolated Redshift environment, we create a database that we will drop after the test will finish. Once the database is instantiated, we clone all the assets that we already have on the production database, inside this ephemeral database: without copying the data ! The entire database recreation script could be regrouped in a global SQL query that recreates everything.

To create an isolated Redshift environment, we create a database that we will drop once the test is complete. Once the database is instantiated, we clone all the assets we already have on the production database, inside this ephemeral database: without copying the data! The entire database recreation script is generated using DDL queries. These queries allow to dynamically generate SQL queries in order to clone the production database.

I advise to separate the CI process from the production environment. Hence it is not recommended to directly query the production database from the CI in order to get the DDLs. So what we do ? We generate this big SQL query that recreate all the assets (Schemas, tables, views, udfs …), after each Flyway execution. Because it is the only occasion where the database structure changes.

For that, we created a Redshift cluster dc2-largededicated for tests. Which is the lowest configuration possible in Redshift. The good news, even with a starter configuration, it takes really quick to instantiate and populate a new database. Only 15 seconds is enough to create hundreds of schemas and thousands of tables ! But beware of quota limits. because large instance type has a limit of 9900 tables in the cluster. To upgrade this quota, xlarge instance type or more is required.

Hacking the Airflow scheduler

Now that we found a solution to isolate the database, we are going to find a trick to optimise workflows duration instead of upgrading the machines.

The trick is to get rid of the scheduler because we don’t really need it to run our tasks. So we refactored the orchestrator by implementing a custom script. It navigates through the Airflow code to find DAGs objects and detects task dependencies, so that it can execute them in the right order.

Here is a sample code that shows how the DAG object could be used to retrieve the workflow elements, followed by a representation of a task instance.

Code pattern to retrieve elements from DAG object

Technically, the script’s logic is to import the DAG objects to be tested. It contains all the elements to reconstruct the workflows: task instances, tasks downstreams and upstreams, external sensors and external triggers.

Task instance representation

Each task instance has the information of the next and previous tasks directly attached to it, similar to the concept of “linked lists”. This makes it easier to explore tasks recursively. It is also possible to use multiprocessing or threads to parallelise certain executions.

The execution part is normally handled by the scheduler, it provides the execution context to the worker. This context is in fact a dictionary which is required for the abstract method def execute(context) defined in the parent class BaseOperator . This dictionary contains all the execution informations that are only known at the runtime, such as the execution date or the results of previous tasks (Xcom). But in our case, we only need the execution_date, so we can provide it without the need for a scheduler.

To summarise how it works, the diagram below is a simple representation of how the recursive algorithm works at the task instance level. Only a root task needs to be passed as an argument to the algorithm, and then the entire workflow will be executed through thanks to the links between task instances.

Recursive algorithm that explores and executes an Airflow DAG

These tips have saved us a huge amount of time!
- Each CI database is created in only 15 seconds.
- The whole lightweight testing process is done in 5–7 minutes!
- 90% of the time was wasted between task executions, now we save it.

The final process: CI, heavy tests, deploy

The full process in one picture: CI, heavy tests and deployment

For each commit in a PR that impacts the datawarehouse, a CI process is triggered to perform the following steps. If the CI fails, the logs will contain an explicit Redshift SQL error and will drop the CI database anyway. Here is the big steps:

  • 🤖 CI: Create a new database in Redshift
  • 🤖 CI: Generate data structures and assets based on pre-production state
  • 🤖 CI:Run new migrations using Flyway
  • 🤖 CI:Run the workflows through our script (scheduler by-pass)
  • 🤖 CI:Drop the database before exiting
  • 👌 Manual: Merge the PR to main branch if the tests are good 👮🏻‍♂️
  • scheduled: Create and push a release candidate to pre-prod environment
  • scheduled: Launch heavy tests on the pre-production
  • 👌 Manual: Deploy to production if heavy tests are green 🚀

Results

Human mistakes are rejected at the PR stage: This concerns all the avoidable mistakes like typos or syntaxical error in SQL queries or table ownerships …

No more database conflicts: every change is tested in an isolated and the most up to date environment. Because recent commits on the main branch are integrated before the tests to begin.

Tests velocity improved from 1 hour to 7minutes: quicks CI’s = more commits to release = enhanced productivity.

Flyway Integration: This was a difficult part to implement, as Flyway would have played the full migration history with the normal behaviour. So we made Flyway only read the new migrations on the current branch and the ones recently pushed into the main branch, the ones that have not been migrated yet. I won’t go into details as this part deserves to be explained in another post, dedicated to Flyway implementation.

Pain points

  • All PR’s should be re-tested just before merging into the main branch to ensure effective testing.
  • The Redshift instance type limits the total number of tables across all the cluster’s databases, which limits the number of parallel CI’s that the cluster is able to support in the same time.
  • As Redshift is an AWS managed datawarehouse, there is no way to run it offline or in a standalone container in the CI. AWS doesn’t provide the official docker image for Redshift.
  • Our airflow hack doesn’t support the use of variables or XComs as the context is manually provided.

Conclusion

The story of the CI hasn’t come to an end yet because the process is subject to optimise, as long as the datawarehouse is growing. The system will face new limitations, or simply new features will be released which will offer opportunities for new optimisations. It could be new features in Redshift, Airflow, Github Actions or Flyway. For example: Amazon is working on a new feature “Redshift Serverlesswhich is still a preview at the moment of writing, It could be useful to host the ephemeral databases.

This post comes to its end here, I was very happy to share our way to test SQL in our team. It helped us a lot to industrialise code delivery ! hoping that you’ll find inspirations or do exactly what we achieved because it works well !

Thank you for reading !

--

--