Building CI/CD Pipelines for Data Applications at Snowflake

“Continuous improvement is better than delayed perfection.” -Mark Twain

This post was written in collaboration with Shubham Goel, Sudhir Tekavade and Praveen Padige

We are part of the Data Platform Operations team at Snowflake that builds and maintains an environment for the data pipelines across multiple data teams. Over the past few months we have implemented a CI/CD pipeline at Snowflake and it is being used by all the data teams. With the help of this pipeline users are able to integrate and deploy their code in the repository frequently. Moreover, we have also found a significant reduction in code failures after going live with this implementation.

In this article, we’ll discuss:

  • Why do we need a CI/CD pipeline?
  • What is CI/CD?
  • What are we testing?
  • How did we implement CI/CD pipelines?
  • How have we benefited from using CI/CD?

Why do we need a CI/CD pipeline?

At Snowflake, data teams (data scientists, data engineers and data analysts) from different departments (Product, Finance,Sales,Marketing. etc) integrate their respective applications logic (data pipelines and machine learning models) to a GitHub mono-repository.

Earlier, there were cases when minor errors, like syntax errors, were leaked into production causing workload failures, as codes were tested locally. They were sometimes not even noticed in peer reviews due to their itsy-bitsy nature. Moreover, even after failures, it was difficult to debug the error and report it to the responsible developer. In fact, the reporting itself was a manual process. We needed an automatic testing body that would scan and check the code before pushing it into production.

Secondly, due to absence of continuous deployment, developers had to wait as releases happened only once in a day. Developers need a facility to deploy even small features with confidence, more frequently.

What is Continuous Integration and Deployment?

Continuous Integration (CI) is a development practice that requires developers to integrate code into a shared repository several times a day. Each check-in is then verified by an automated build, allowing teams to detect problems early.

Continuous integration reduces costs as businesses won’t have to spend resources resolving outages [1]. Ultimately, the team avoids exhausting one of the most important resources - time.

“Continuous Integration doesn’t get rid of bugs, but it does make them dramatically easier to find and remove.” -Martin Fowler.

Continuous integration enables continuous delivery and deployment. Deploying new features and patching bugs on a regular basis can not only help in keeping customers happy, but also allows you to augment your customer base and be on top of the competition. Feedback from stakeholders also forms as a potential opportunity for improvement. Discussions encourage transparency and accountability for the work done by every developer. This helps developers work efficiently, subsequently improving the product quality and enhancing business.

Continuous Deployment relates to the release into production of software that passes the automated tests [2]. This essentially provides an early release and helps the software development team to align with the user requirements.

As we know, CI/CD helps developers integrate regularly, can detect errors quickly, and locate them more easily. However, the application of CI/CD’s spans well beyond IT and Product teams. Developers from Sales, Marketing, Finance, and Business benefit from this approach. For instance, various phases like planning, bundling, supplying, distributing, and monitoring from Sales and Marketing can be modeled as a CI/CD process.

What are we testing?

By far the most important question — Well…the Code?…Certainly. But a more appropriate question could be “What do we ‘test for’ in the code?”

We automated the testing for:

  • Syntactical Errors (Python and SQL code)
  • Airflow Workflow Validations

Syntactical Errors

Minor errors like syntax errors and bugs are one of the major contributors to workload failures. Here, in our CI pipeline, we have installed linters to solve the above problem.

“Thou shalt run lint frequently and study its pronouncements with care, for verily its perception and judgement oft exceed thine.” — Henry Spencer (University of Toronto, Zoology) in one of the ten commandments for C programmers

Linters are static analysis tools that identify various syntax errors and other warnings in code and report them to the user with their location in the code. For example, a missing import statement for some package, missing colon or missing argument. The majority of the code that we do is in python and sql. For linting python codes we are using pylint and for sql we are using SQL fluff.

Airflow Workflow Validation

We use Airflow [3] for the scheduling and orchestrating of our data pipelines. In Airflow, data pipelines are Directed Acyclic Graphs or DAGs which are a sequence of dependent tasks.

What should be validated in DAGs? Firstly a DAG has to be essentially “Acyclic”. We are validating whether or not the dependencies are acyclic. If there exists a cycle it’s an error. Additionally, missing dependencies are checked.

Secondly, it is important that every DAG file has to have some parameters defined. With custom tests we check the presence of these parameters:

  • Cyclic dependencies
  • Owner
  • Email-id and send mail on failure
  • on_failure_callback in case of failure
  • SLA check
  • Execution timeout checks

So this ensures there are no cyclic dependencies in workflows, whenever there is a failure, an email and a Slack notification is sent to the owner of the DAG and also ensures all prerequisites are met for all Airflow workflows.

How did we implement CI/CD pipelines?

  1. CI/CD for Airflow DAGs

We used the GitHub Actions to implement the CI in the pipeline. GitHub actions are very easy to use as there is no installation required and they are tightly integrated with GitHub. Since they are event driven, they trigger immediately when an event like a pull request is created by any developer.

Linters and DAG validation checks execute within separate github actions workflows.

This also facilitates CD as it allows developers to deliver the code into production anytime after merging with the master branch.

Workflow

  • When a developer makes a certain change in the test branch or adds a new feature in the feature branch and raises a pull request, the github actions workflows trigger immediately.
  • When all the workflows execute successfully, code can be taken for peer review, otherwise there and there itself the developer must make changes or corrections and reiterate the process.
  • If the peer review is successful, the branch is merged. This merged code is then deployed into the production environment after passing the Jenkins build as seen in the above figure CI/CD Pipeline Flowchart. The master branch code is always in sync with the production environment code.

2. CI/CD for SnowSQL Code

Being a data engineer at Snowflake is fun as you can easily ingest, transform & deliver all your data for faster and deeper insights. Over the last few years data engineering is going back to its roots, which is SQL. Data Engineers here spend very little or no time in managing infrastructure and spend most of their time in building tools to deliver data pipelines at enormous speed and accuracy. Snowflake provides cutting edge features like external functions, which enable developers to connect to external remote services ( e.g. Calling Rest API via AWS Lambda ) and you can extract, transform and load data into target tables, hence most of the code that we end up writing is SnowSQL.

However if you don’t have any CI/CD system in place to integrate and deploy this SQL code into a Snowflake database, then it becomes really cumbersome to handle it after the code grows in size. You can not keep version control and deploy SQL code manually every time. That is really not a good option as it’s prone to human errors.

To overcome this, we implemented a CI/CD system using Sqitch (Open source change management application) and Jenkins. The high level architecture of the system is given in the diagram below:

SnowSQL CI/CD Architectural flow
  • Data Engineers push SQL code bundled via Sqitch Project into a github repository for source control. As soon as you create a pull request, the github actions workflow triggers, immediately checking for any syntactical errors using SQL linters implemented earlier.
  • After all the checks are successful, code can be peer reviewed. If one of the checks fails, users will fix the errors before trying to push code again.
  • Once code is merged into the master branch, a webhook is triggered, which triggers a CD build on Jenkins Server.
  • Jenkins build pipeline pulls an image from the Snowflake internal docker hub registry, which is essential to run sqitch deploy/verify/revert commands via docker container.
  • You can store secrets required to access your Snowflake account into secure storage like hashicorp vault or Jenkins credentials and use it to connect to Snowflake while deploying your code.
  • With Sqitch you can deploy, verify, and revert SQL code very easily. All the history gets captured into a database schema called SQITCH automatically inside your Snowflake database making it super easy to track and build insights on that data.
  • You can access the CI/CD documentation at [4]

Conclusion

Whether it is an improvement or a bug fix or addition of a complete new feature, a developer will always require to integrate his work frequently. There is no end to ongoing modifications and advancements. That “improvement” is anything but terminable and there is no such thing called “The Best Version” is incontrovertible. The frameworks mentioned above illustrate the pipelines that we implemented at Snowflake and serve as a good example of CI/CD. Since we deployed this pipeline in a new environment, we were able to make the most of CI/CD by frequently integrating and deploying. Moreover, we were able to reduce failures by a significant amount after migration.

References

[1] https://dzone.com/articles/benefits-of-continuous-integration-for-businesses

[2] https://www.thoughtworks.com/continuous-integration

[3] https://airflow.apache.org

[4] https://snowflake-sqitch-ci-cd.readthedocs.io/en/latest/index.html

--

--

Sarvesh Khire
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

MS Data Science @ UCSD | Data Engineer @ Snowflake. Talks about Databases, Data warehouse, ETL/ELT, Python, SQL, DBT, Apache Airflow, Apache Kafka, Salesforce