Airflow on Snowpark Container Services

Orchestrating data workflows is a vital aspect of data engineering, and Apache Airflow stands out as a top choice for many organizations thanks to its flexibility and robust features. However, deploying Airflow in a scalable, secure, and efficient manner can be quite challenging. That’s where Snowpark Container Services (SPCS) comes into play. SPCS offers a cutting-edge solution that allows for the seamless integration of containerized applications within Snowflake’s powerful platform. In this article, we will delve into running Apache Airflow on Snowpark Container Services, discussing its advantages and providing a detailed step-by-step deployment guide.

What is Apache Airflow®?

Apache Airflow is an open-source platform to programmatically author, schedule, and monitor workflows. Using Directed Acyclic Graphs (DAGs), Airflow allows users to define workflows as code, ensuring flexibility, scalability, and maintainability. It’s widely used in various scenarios, from ETL processes and data pipeline automation to machine learning model training and deployment.

What is Snowpark Container Services?

Snowpark Container Services (SPCS) is a feature provided by Snowflake that allows users to run containerized workloads within the Snowflake environment. It is designed to enable the execution of custom code and applications in a scalable and efficient manner, leveraging the Snowflake data platform’s infrastructure

Why Choose SnowPark Container Services?

Running Airflow on Snowpark Container Services offers several advantages over traditional methods:

  1. Ease of Deployment & Maintenance: Traditionally, deploying a production-grade Airflow on platforms like Kubernetes requires significant collaboration with Cloud Engineering teams, which can delay production deployments and maintenance of the platform. However, with Snowpark Container Services, any data engineer with basic containerization knowledge can deploy a production-grade Airflow without needing to be a Kubernetes expert. This simplifies the process and speeds up both deployment and maintenance.
  2. Integrated Ecosystem: Seamlessly integrates with Snowflake’s ecosystem, providing a unified environment for your data workflows.
  3. Scalability: Scaling Airflow Celery workers dynamically is critical for handling fluctuating workloads. With SPCS, you can configure auto-scaling policies to ensure optimal resource utilization
  4. Compute Pools: Easy access to wide varieties of SPCS Compute Pools (Including GPU instance) for your data pipelines and ML models.
  5. SnowGit Integration: Automatically deploy new DAGs to your Airflow environment as soon as they are merged into the repository.
  6. Security: One of the paramount concerns when deploying any application at scale is security. Snowpark Container Services ensures that your Airflow instance and associated workflows operate within Snowflake’s secure environment.

Architecture Overview

Below is the architecture diagram for Airflow with Celery Executor on SPCS:

Airflow on SPCS Architecture

Components Breakdown

  • DAGs Repo: A Git repo to store DAG files and data pipelines logic. This repo is integrated with SnowGIT to allow Airflow service to seamlessly access DAG files and data pipeline logic.
  • Airflow Server Service: This service will have 2 containers running inside it Airflow Webserver and Airflow Scheduler, both these containers are spinned up with the same Airflow docker image.
  • Airflow Webserver: The user interface for managing, monitoring, and configuring Airflow workflows.
  • Airflow Scheduler: The component that triggers tasks to run at scheduled intervals based on workflow definitions
  • Airflow Worker Service: Executes tasks distributed by the Airflow scheduler using Celery for parallel processing. This Service is spinned up using the same Airflow image used in Airflow Server Service.You can have more than one instance of Airflow worker service to enable it for auto scale in and out based on workloads.
  • Redis Service: Acts as a message broker to facilitate communication between the Airflow scheduler and Airflow Celery workers.
  • Postgres Service: Serves as the metadata database to store information about DAGs, tasks, and their states. This database is stored in a block storage mounted as a volume to this service. This is to ensure data is not lost when the postgres service is restarted.
  • Snowflake External Access Integration: By default egress communication from SPCS services are restricted. Using Snowflake EAI you can manage egress rules to more granular levels.
  • Snowflake Secrets Object: In Snowflake, a secrets object is used to securely store and manage sensitive information, such as API keys or credentials, within the Snowflake environment. All the secrets required for the data pipelines and for Airflow services are stored in Snowflake Secret objects.
  • Snowflake Internal Stage: All service specification files and Airflow task log files are stored in Snowflake internal stage.

Step-by-Step Deployment Guide

Follow → Snowflake Quickstart guide to run Airflow on SPCS

Conclusion

Running Airflow on Snowpark Container Services simplifies the process of setting up a robust, production-ready orchestration environment. By leveraging Snowflake’s integrated ecosystem and scalable infrastructure, you can focus on building and managing your data workflows without the overhead of traditional deployment complexities. Start leveraging the power of Airflow on SPCS today and experience the benefits of a seamless, efficient deployment process.

--

--