Implementing CI/CD in Cloud Composer Using Cloud Build and GitHub — Part 1

Amarachi Ogu
5 min readFeb 21, 2023

--

Continuous Integration and Continuous Deployment (CI/CD) have become integral to modern software development. The main aim of these practices is to streamline the development process and make it faster and easier to produce high-quality code.

In the previous blog post, we explored an approach to using Google Cloud Composer, Google Cloud Storage (GCS), and BigQuery to build a stock data workflow. We used a simple method to deploy the Directed Acyclic Graphs (DAG) to Cloud Composer environment by uploading the Python file to a Cloud Storage bucket. This method is straightforward and convenient when starting out.

However, if you have multiple people working on the same project, there’s a risk of version control conflicts, and the manual process of uploading the file to GCS can also lead to errors. Also, if you have a large number of DAGs to deploy, it becomes challenging to keep track of all the changes, making it difficult to deploy updates efficiently.

Further concerns when deploying DAGs without proper controls and processes in place may include the following:

  1. Accidentally dropping a harmful file in the GCS bucket and getting yourself into trouble.
  2. Losing track of what you have and haven’t tried to make your DAG work.
  3. Deploy your DAG and see that big red ‘broken DAG’ error in your Airflow environment.
  4. You may even look through your environment only to realize that you didn’t actually deploy the changes you made in your DAG to your environment.

So what can we do to fix these issues?

In this blog, we are going to explore some of the things we can do to automate away these issues.

The potential solution to the aforementioned concerns include:

  1. Using a version control system
  2. Using DAG validation tests
  3. Having multiple Airflow environments

Version control

Using version control can simplify the process of tracking changes made to your DAGs, as well as facilitate collaboration with other team members and enable a clear and organized deployment history.

Version control systems, such as Git, allow you to keep track of changes made to your DAGs over time. This makes it easy to revert to previous versions if necessary, track who made changes and when, and maintain an organized and clear deployment history. It also helps to avoid conflicting changes between team members by providing a centralized repository where each member can work on their own branch and merge changes into the main branch when they are ready.

Additionally, it’s advisable to establish a code review process, where a team member before deployment to the production environment reviews your DAGs. This way, you can catch any potential issues before they make their way into your production pipelines.

DAG validation tests

DAG validation tests can help you avoid some common issues that affect all DAGs, in addition to using version control. By implementing DAG tests, you can ensure that your DAGs meet specific criteria and that the data flowing through your pipeline is valid. This can help you catch issues before they become problems in production, such as syntax errors, missing dependencies, or invalid data types.

You can write a wide variety of tests to validate a DAG. Which may include

  • Python static analysis. This will catch things like missing import statements.
  • Unit/integration tests on custom operators or plugins.
  • Unit tests on DAGs and Tasks — validating your DAGs with unit tests help to prevent incorrect code. For example, You can perform a DAG Loader Test to verify that all your DAGs load without any issues.

Incorporating these tests into your deployment process can mitigate DAG parsing errors and improve the stability of your web server. By performing these checks, you can catch invalid DAGs early and prevent issues that could otherwise cause the server to crash, especially when dealing with a large number of faulty DAGs.

Multiple environments

In addition to your local environment, you can have, for instance, two Airflow environments: development and production environments.

Local environment — Initially, you can use your local environment to test and experiment with your DAG by running a specific version of Airflow on your computer. This allows you to troubleshoot and refine your DAG. However, when your DAG outgrows the capabilities of your local environment or as you progress, you may then use the development environment.

Development Environment — The development environment provides a more realistic setup for working on and deploying DAGs. It allows for debugging as the configuration of this environment is similar to the production environment, including the same version of Airflow, Python packages, and plugins. This makes it easier to identify and resolve issues before deploying to production.

In this environment, it is possible to experiment with changes to the environment itself. For instance, if there is uncertainty about the impact of a new version of Airflow on DAGs, it can be tested here before being introduced to production.

Also, by keeping a change log in the development environment, one can track the modifications made to the environment and the debugging process of the DAGs. This serves as a valuable resource for future reference and team collaboration.

Production environment — The production environment is the most critical aspect of your Airflow setup. It is where your production pipelines run, so it is important to maintain the highest level of security and stability. It is recommended to follow the principle of least privilege and only deploy DAGs that have undergone thorough testing and vetting in both the local and development environments.

When deploying DAGs to the production environment, it is important to be strategic and choose a time that is least disruptive if things were to go wrong. It is also important to have a rollback procedure in place in case of any mistakes. Having a plan in place beforehand can make it easier to rectify the situation.

Putting it all together

Having explored various approaches for preventing errors in a Cloud Composer workflow, let’s now apply these strategies to our stock data workflow.

The workflow process follows this pattern:

After the code is written and pushed to a branch in a GitHub repository from a local environment, a pull request to merge to the main branch triggers a Cloud Build job to run validation tests on the DAG. Once the tests pass and the code is merged, another Cloud Build job is triggered to sync the DAGs with the development Cloud Composer environment.

Part 2 will be hands-on, where we’ll dive deeper into the technical details to build our stock data CI/CD workflow.

Thanks for reading. You are welcome to follow me on LinkedIn and Twitter.

Next Step

--

--