Amazon Redshift CI/CD — How we did it and why you should do it too

Databases CI/CD

Database development should not be handled differently from application development. Database changes should be subject to the same activities (source control, continuous integration and continuous delivery) as application changes.

The goal of continuous integration and continuous delivery (CI/CD) is to reduce time-to-market and create a steady stream of end-user value with frequent, high-quality software releases.

Organizations without a Database CI/CD pipeline go through a very slow and manual process for each database change that needs to be made:

In addition, scripts are often lost because they might be stored at each user’s or at the DBA’s local filesystem (if they are kept at all), making it hard to reproduce changes, keep track of them and debug issues.

The aim of Database CI/CD is to bring the same integration, testing and deployment best practices to the database and enable SQL scripts to flow through the software release pipeline, synchronized with application code, enabling secure and faster deployments.

At HERE Mobility, we wanted to enjoy all these benefits and went on a journey to implement a CI/CD pipeline for our Redshift code changes. This article describes what was required for our Data Engineers to build, for our Analysts to change and the challenges we faced and how we approached them.

Our current CI/CD pipeline

The Analytics team is in charge of writing PySpark ETLs that read from the Data Lake (S3) and write into Redshift, our Data Warehouse.

They are also in charge of writing the SQL scripts to create the relevant tables, views, UDFs, and Stored Procedures in Redshift. As in any other DWH, this can range from dim/fact tables to enrichment or reporting tool-ready views (Tableau in our case).

Applications’ CI/CD

Spark code has a proper CI/CD pipeline:

Database’s CI/CD

But Redshift SQL Scripts were just pushed to Gerrit (our VCS) without testing nor versioning and were manually applied by the analysts (most of the time skipping the Redshift instance in DEV, generating a discrepancy between both DEV and PROD environments).

ETL Projects’ Structure

Each ETL project corresponds to a Redshift schema, therefore, each project contains both Spark code and Redshift Scripts.

The Scripts folder is divided into tables and views folders, where each folder contains the scripts that create the relevant database object.

For example:

As you can see, in addition to not having a proper CI/CD pipeline for the scripts, all table fixes are appended to the same file without an easy way of keeping track of changes.

The journey begins

The application code’s CI/CD pipeline was already robust so we thought about integrating Redshift’s scripts into the same CI/CD process.

We came up with a list of requirements and goals for this to happen:

1. Database code is version controlled

Database code changes should be tracked in the same version control system as application code. Some immediate benefits this will give us:

  • Easier to keep track of changes.
  • Single source of truth.
  • Every change to the database is stored, allowing easy audit if any problems occur.
  • Prevent deployments where the database is out of sync with the application.
  • Changes can be easily rolled back.
  • New environments can be created from scratch by applying these scripts.

All database changes are migrations, therefore, each code change needs a unique identification.

We will need to keep track of which migrations have been applied to each database.

2. Database Code Validation

Once database code is checked in, a series of automated tests are immediately triggered. That way, analysts can get immediate feedback on SQL code, just as they do with application code. Some of the things we want to automatically test are:

  • Check all tables/views/functions are following the naming conventions set by the team and that they align with the requirements of the organization.
  • Check correct dependencies between tables and views (column names, views referencing table columns, etc.).
  • Validate Redshift/PSQL Syntax.
  • Run unit-test for complex views.
  • Test code changes in a clean instance before testing them on a real one to make sure if the deployment will work as intended.

In addition, everybody gets their own database instance so the same tests can also be triggered locally while developing to get feedback even faster.

3. CI/CD pipeline

After validating the database changes, an immutable artifact should be created together with the application code so automated DB deployments can be consistent, repeatable and predictable as application code releases. All this pipeline should be integrated into our current Jenkins pipeline.

4. Automatic Deployments

Analysts used to have Redshift permissions to run DDL queries manually, these permissions should be removed and automatic DB deployments should be allowed instead.

The automated deployment should be applied first on our Redshift DEV instance so the database changes and ETLs can be tested before deploying the changes to production.

In addition, this should allow us to keep both DEV and PROD instances in sync.

Meet redCI — Our Redshift CICD tool.

We researched for a product that can answer our requirements above, and though some nice options came up, none of them were what we were looking for, and that’s how redCI was born.

redCI is a tool that allows running unit and integration tests on PSQL-compliant queries in addition to deploying the SQL scripts into Redshift in the relevant environment.

This is how redCI allowed us to change the way analysts at HERE Mobility test and deploy their database objects:

The Tool

redCI is a tool written in Python that can:

  • Connect to Redshift and Postgres.
  • Read and run PSQL-compliant script files.
  • Keep track of the latest script deployed for each table, view and function on the Database.
  • Convert Redshift syntax into PSQL.
  • Run unit tests for complex views.

The tool is used across all ETL projects in addition to some changes to the projects’ Jenkins pipeline and Makefile. The following are all the changes we had to make in the projects to solve the requirements listed above.

1. Database code is version controlled

SQL Scripts that create or modify tables used to be appended with new changes. Not any more! From now on, each table has its own folder and each change is a new script with the version number:

Because database changes are migrations, each script name is prefixed with a sequential number to indicate the order in which code changes were made.

Every time the need to deploy a database change arises, we simply create a new SQL script with a version number higher than the current one. The next time redCI starts, it will find it and upgrade the database accordingly.

redCI tracks which migrations have been applied to each database by keeping a changelog table in Redshift.

2. Database Code Validation

To allow automated tests of the Redshift changes, we needed a new Redshift instance for each deployment. Ideally, we would have used a Redshift Docker Image, but it doesn’t exist (all the images with that name are just Postgres Images). We decided to use a Postgres Docker Image (because Redshift is based on Postgres 8, these bring some issues of their own and we will talk about them later).

We changed our ETL projects so each time they are tested (locally or in Jenkins), the Postgres Docker is executed and redCI deploys all the Scripts in the project in this new, sterile, environment, thus capturing SQL bugs as fast as we can.

Catching bugs early

Many bugs are caught in this stage (in the past they were caught in production), for example:

  • Broken dependencies between tables and views.
  • Views referencing table columns that don’t exist.
  • Views referencing tables or columns with a typo.
  • Objects that are not following the naming conventions.
  • Tables that don’t have required columns as defined by the company.
  • Redshift SQL syntax errors.

But wait… Redshift is not Postgres

That’s right! Redshift is based on Postgres 8 but there are many differences you should be aware of. Some Postgres functions are not supported by Redshift and the opposite.

Take for example:

datediff(second, d1, d2)

this Redshift function doesn’t exist in Postgres, but there’s an equivalent method:

extract(epoch from (d2::timestamp - d1::timestamp))::int8

Or for example:

DISTSTYLE
DISTKEY
SORTKEY

These Redshift keywords do not exist in Postgres.

This all means that Redshift-SQL scripts can’t be run as-is on our Postgres Docker. That’s where redCI comes into the picture, it will parse every SQL script before being executed in Postgres and convert Redshift-only syntax to its Postgres’ equivalent where possible (first example) or comment it out when there’s no equivalent (second example).

Testing complex views

Part of the suite of automated tests is composed of unit tests for database views.

Some of our database views can be very complex, and each change should be carefully checked. To help analysts with this hard challenge we added support for redCI to run special unit tests.

These tests are written as SQL scripts and can be used in one of the following three ways:

  • An SQL script that runs a view and returns “TEST_SUCCEEDED” or “TEST_FAILED”. If running the view returns the last value, redCI will mark the test as failed.
  • An SQL script that fills an “expected output” table and some “input tables”. The view-to-test reads from these input tables and its output is compared with the “expected output” table. If they differ, the test will fail too, with an indicative error.
  • Sample data can be written into the tables to verify column types and schemas.

This adds a new layer to our redCI tool since we can also catch database views logic errors before they are deployed into Redshift.

3. CI/CD pipeline and Automatic Deployments

Pipeline Integration

As explained above, the existing application code’s CI/CD pipeline was upgraded to include the Postgres Docker execution and the testing of database changes.

If the Jenkins pipeline finishes successfully and the code passes one of the team member’s Code Review, an artifact containing the application and database code will be deployed.

Deploying to Redshift

Each project has an equivalent Airflow DAG that runs the different ETLs inside the project and fills the relevant tables.

We added a new operator, called redci-operator to each of these DAGs. It’s the first thing that runs on each DAG and all other tasks depend on it.

This operator just calls our redCI package and deploys any new table or updates any edited view/function in Redshift.

Our Airflow instance running in DEV takes the latest ETL project deployed. After deploying our changes to our DEV Redshift instance and making sure no errors are found, we can update the ETL project’s artifact version for the Airflow instance in PROD and run the same redci-operator.

This change brings many benefits:

  • Both environments remain synced all the time
  • Redshift deployments are always checked in DEV before being applied to PROD.
  • Individual users don’t have permission anymore to manually run DDL queries in PROD, reducing the risk for mistakes.

Conclusion

It is important for you to treat database changes as application code changes. They should be tested and integrated into a robust CI/CD pipeline.

We hope this article together with redCI can help you solve your own challenges.

We’d love to hear about your own challenges and how you approached them! And of course, every suggestion is welcomed!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store