Building a Staging Environment for Data Teams

Orit Mansur
Riskified Tech
7 min readJan 25, 2023

--

“What happened? Why is the system down?”

All you did was make a slight change in a script or a query. It wasn’t supposed to be a big deal, but you broke something anyway. It could have been a table deleted by mistake, a script duplicating records, or a snapshot that didn’t run.

These things happen when you don’t have a staging environment, especially when you work with data.

What is a staging environment and why do we need one?

A staging environment is like a sandbox: you play with the data and its logic in a separate environment. Only once you’re sure that you answered the requirement and didn’t break anything along the way, do you deploy the changes to production.

A staging environment enables you to collaborate with your stakeholders and deploy changes only once they have been reviewed and approved. It makes your work process more organized and your deliverables much more reliable.

Building a staging environment for data teams is challenging. It’s complicated since tables don’t have a simple version control system. Changing scripts isn’t the only thing you need to do; you also need to find a way to change data in tables without affecting production.

A bit about our Team

I’m a BI team leader at Riskified. The BI team’s role is to develop data flows and deliverables to support decision-making. We develop tables, reports, dashboards, predictions, and alerts.

Business teams are our stakeholders — they define the business requirements and consume the data we provide.

Our development stack includes the following tools and methodologies:

  • We use Snowflake as our main database
  • Our ETL processes are stored as SQL and Python scripts
  • We use Git as a version control tool and manage our scripts in a Git repository
  • We use CI/CD processes as part of our deployment
  • We schedule and run our processes using Airflow
  • Our BI frontend tools are Tableau and Looker

Our development process and its pain points

Before we integrated the staging environment, we didn’t have an easy way to test an entire data flow before we rolled it out to production.

In order to test our work, we had to create a parallel process to the one in production. All tables, scripts, and pipelines had to be duplicated and renamed. Then, we had to run the new pipelines parallel to the production pipelines. The reports were also duplicated and adapted to the new model.

This process was so time-consuming and tedious that we couldn’t support it over time. As a result, we often deployed changes to production before thoroughly testing them.

The outcome was that our stakeholders couldn’t check the changes before we deployed them to production. They took screenshots or downloaded reports before we made the changes so that they would be able to compare “before and after”. They couldn’t check the complete functionality or review edge cases.

As developers, when we worked on a new requirement, we created a new branch, made our changes, and merged it into the main branch. Each change in the main branch triggered a CI/CD process that turned the main branch into an image. Airflow was configured to only run the image of the main branch. Changes in other branches didn’t trigger any CI/CD processes, and there was no way to run them.

We took into consideration the constraints of the current process and defined the requirements for the new solution.

Our main requirements

Cost-effectiveness

We didn’t want to replicate all our tables in a separate environment. It would be hard to maintain, consume a lot of storage and processing power, and be very expensive. On the other hand, we wanted our staging environment to be as close as possible to our production environment. It would help us know we were on the right track.

Ease of Use

A staging environment must be easy to use. No one would use it if it required a lot of time and preparation to maintain.

Supporting Multiple Developers

We wanted an environment that would enable several developers to work in parallel at the same time. We wanted each developer to be able to run their processes without affecting others. We also didn’t want to keep them waiting in line for the environment to be available.

Ease of Deployment

The deployment process had to be as easy and frictionless as possible. We wanted most of our deployments to end up with a single merge, no more. Any manual change we need to do in both environments is a risk we wanted to avoid as much as possible.

The main challenges and solutions

Designing and developing a staging environment isn’t an easy task. It requires changes in many aspects of the development flow. When doing it, we first needed to define the development workflow we were aiming for.

The development workflow we were aiming for

The second phase was to outline the changes required in every component.

Step 1: Creating a new staging branch

We open a staging branch at the beginning of every development cycle. Each branch name begins with a STAGING prefix.

Step 2: Develop and push changes to the remote repository

We make all the required changes and then commit and push them to the remote repository.

Step 3: Creating an image per staging branch update

Once the update of a branch with a staging prefix is pushed to the remote repository, a CI/CD process is triggered. The CI/CD process creates an image with the branch name. This enables us to run staging branches via Airflow, not just the main branch.

Step 4: Running a staging branch via Airflow staging

We run our staging processes using an Airflow staging environment. Each developer triggers tasks or DAGs using Airflow UI. We identify the user running Airflow staging and run the current branch they are working on.

We created a mapping file containing a branch name per developer. Each developer updates this file with the branch they’re currently working on. We use this file to let Airflow know which branch to run per user.

In the Airflow configuration file, we specify the image that airflow should run: If someone is running a task from the staging Airflow environment, we need to check who is running the process. We run the image of the branch specified for this user name (using the mapping file). If using Airflow production, we run the image created from the main branch.

Step 5: Creating and updating tables in the staging database

Since we work with data, a big part of our work is updating tables in a Snowflake DB. We have complicated processes that involve updates of many tables with many dependencies. When we develop in staging, we need to ensure that we don’t affect production tables. We also want to avoid changing table names or schema prefixes in our scripts.

It would be great if we could create a version of the same table per branch we run but unfortunately, we can’t (at least not with the current tools we’re using).

The solution we chose was to replicate the tables we use in the current development cycle to another database — the staging DB.

In our project configuration file, we defined that running a process from Airflow staging would only refer to tables in the staging database. The production database wouldn’t be affected at all.

In order to replicate the required tables to the staging DB, we created a script that clones the source tables for the first tables we need to change: the “read-only” tables. Now, when we want to work on a project in staging, we first run a script that creates the relevant tables in the staging database.

Step 6: User Acceptance Test

Once we complete our development, we ask our stakeholders to run their tests on the dashboards that rely on the staging DB tables. Once the changes are approved, we can move forward to the deployment stage.

Step 7: Deployment to production

The deployment is effortless. We need to create new tables in the production DB and merge the staging branch we worked on into the main branch. The revised process then runs on production tables and scripts.

Outputs and business impact

Following the rollout of the new staging environment, our development process is structured and clear and our work processes have improved significantly:

  • We can develop big projects for long periods without affecting production
  • Each of us can work on several projects at the same time
  • Development is much easier — no need to duplicate tables or scripts
  • Deployment is super easy — a quick merge away
  • Data is more reliable and our end-users don’t need to compare screenshots anymore. They have 2 live dashboards

As a final note, it does take planning and infrastructure work to create an easy-to-use staging environment. The complexity even increases when you update data stored in a database.

But although it’s time-consuming, it’s worth it.

Using a staging environment makes a significant impact on your data deliverables. It improves collaboration with your stakeholders, optimizes your work processes, and assures data trust.

Kudos to Shahar Lustig who managed and developed this project, and Lea Ozer Hempel for taking a key role in making it live and kicking.

Special thanks to Ran Salmona, Topaz Fishman, Assaf Levy, Amit Ehud, Dor Polovincik, and Or Sagiv for your priceless contribution to the project.

--

--