A production-grade CI/CD Pipeline for Serverless Applications

Motivation

For moving to the cloud, all companies must change the way they built software. On-premises systems were shipped in yearly, half-year or quarterly releases. Often customers skip multiple versions and postpone upgrading to some date in the future. But, the customers want to consume even enterprise software as SaaS.

For achieving the necessary cloud qualities (especially scalability and high availability) software is split into hundreds of microservices. With microservices, it is easier to achieve high availability and scalability but the trade-off is having hundreds of loosely coupled services without ACID-like transaction consistency. For operating a fleet of microservices an end-to-end automated CI/CD pipeline is a prerequisite.

In this article, we, two architects working for SAP, present a production-grade pipeline for testing and deploying Serverless microservices including canary deployment to production with rollbacking automatically in case of failures.

Disclaimer: This post describes our personal experience and our personal views. The views expressed are our own and do not necessarily represent the views of SAP SE. This post is not an official guidance or recommendation of SAP SE.

Introduction

First, we want to briefly define what we understand as continuous integration and continuous deployment.

As continuous integration, we understand various testing activities for every commit (or pull request). Those include unit tests, code style checks, and integration tests with other services. Additionally, there can be any test which is considered necessary to ensure a high level of SLAs such as availability or to support the transition from one version of the software to another. This broader definition seems to be very common, lately.

As continuous deployment, we understand pushing merged changes on the master branch into production. If all tests are passed, the code is ready to be used by the customers. Therefore, intensive testing is essential.

However, we all agree that sometimes there are singular situations that result in unforeseen states or nasty bugs that only occurs on production traffic. With a fleet of hundreds of microservices, this is unavoidable. For avoiding downtime and achieving 99.999% availability, we need a system for gradual deployment to limit the impact of defects. Additionally, a mechanism for aborting the deployment in case of failures is needed. The rollback must be fully autonomous.

All in all, at least the following steps are required. Of course there can be reasons to include more steps.

  • testing every commit (or pull request) before merging to master
  • executing unit tests, code checks and so on
  • running integration tests with other services in a QA environment
  • optionally deploying to staging (only on master branch)
  • deploying to production (only on master branch)
  • automatically rollbacking in case of failure during deployment
  • all steps above must be completely automated

Below, you can see an overview of the pipeline we want to build.

Overview of the target pipeline

Serverless Applications

If you are already with Serverless applications you can skip this section. If not, we give you a 10,000 feet overview of Serverless applications.

Serverless is a concept for operating cloud applications in comparison to VM based or container based architectures. The code is executed by Function-as-a-Service (FaaS) services like AWS Lambda. You only provide your source code (think NodeJs files and libraries). AWS takes care of the execution, scaling and so on. But FaaS is only the core offering of Serverless. The idea of Serverless architectures is using additional services like S3 for storage, DynamoDB as a database, Cognito for user management and so on. All these services scale with your demand, you only pay for used resources (not for idle) and they are all operated by your cloud vendor, e.g. AWS.

Serverless applications can be a cost saver, but above all, they remove most of the operational complexity. Serverless allows writing cloud applications with high availability and scalability easily.

Serverless 101 architecture

The “Hello World” Serverless web application is shown above. Your static web assets are stored in S3 (1). Your website does a REST call to API Gateway (2). For handling an HTTP (REST) request API Gateway invokes your Lambda function (3). The Lambda function retrieves data from DynamoDB (4) and sends a (JSON) response back to API Gateway which responds the HTTP request.

You only have to deal with your custom business logic in the Lambda function. Everything else is operated and provisioned by AWS. You can influence the behavior by configuration (such as the keys of the tables in DynamoDB). Because of that, with this easy setup you have a highly available, highly scalable, elastic, and low-cost solution without investing time in any of these capabilities. You do not need to invest any time in infrastructure operations. Because of that Serverless applications are a game changer.

Building the Pipeline

Even though we really like AWS’ services we admit that CodeCommit (Git) and CodePipeline (CI/CD) are not ready for production use, yet. The problem of CodePipeline is that you need to create a pipeline for every branch. You cannot create it easily automatically without running CloudFormation. Additionally, if a pipeline runs for two consecutive commits concurrently you get lost with the state. You cannot distinguish between the two different runs.

Because of that, we happily used GitLab and GitLab CI. However, we assume it can be equally achieved by using GitHub, Bitbucket, and so on and Jenkins, Travis CI, or one of the many other CI services.

Before you can build the pipeline, your infrastructure must be totally automated. We rely on CloudFormation which is AWS’ own Infrastructure as Code offer. But there are others like HashiCorp’s Terraform. We created a SAM template that automates everything.

Note: All coding can be found at GitHub. The repository contains the multiple versions of the same file, e.g. 1.gitlab-ci.yml and 2.gitlab-ci.yml . The first one is the initial file, the second one the updated file. All changes are explained in this article.

The following SAM template creates a Lambda function and a REST API that invokes the Lambda function for all get requests on the root.

Below you can see a simple Hello World Lambda function. We added a dependency so that we show how to install dependencies using the pipeline.

We do a very simple unit test by using the following test.js file. It simply checks whether the function returns a valid Hello World response:

For linting, we use the linter StandardJS.

What do we have so far? A SAM/CloudFormation template for creating our stack, the coding for the Lambda function, a unit-test, and a linter. Let us start building the pipeline. GitLab uses a YAML configuration in the root of the project (.gitlab-ci.yml ) for defining the pipeline.

GitLab organize pipelines in stages. A stage consists of one or multiple jobs. The jobs of a stage are executed concurrently. A consecutive stage is only started if all jobs of the previous stage finished successfully.

For starting with the pipeline, we will have two simple stages. The first stage installs all dependencies. The second stage is the test stage.

Add the file as .gitlab-ci.ymlto your repository, commit it and push it to your GitLab repository. Next, you can see the pipeline running on GitLab. If all jobs are passed, it should look like the following:

Great! We have already basic testing. Next, we want to deploy your web application to AWS.

We need an S3 bucket for your source code. Create a new S3 Bucket, e.g. pipeline-demo-your-name.

Additionally, you need an AWS user with programmatic access and the administrator policy attached. It is good practice to restrict the rights in production use to the bare minimum. For this demo purpose we are fine with administrator access. Add the AWS_ACCESS_KEY_ID and the AWS_SECRET_ACCESS_KEY as secret variables to GitLab. Go to your repository, click settings, click CI/CD, and then secret variables.

Great! All the annoying configuration stuff is done. We can focus on your pipeline again. We want to deploy your application to AWS. Therefore we extend your .gitlab-ci.yml file. Change the S3_BUCKET variable to your S3 bucket. Besides, choose a region. That must be the region of your bucket!

First, we need to package your app. Basically, that means the source code must be uploaded to your S3 bucket. Next, we can deploy your app using the generated template file. Commit and push your changes to your repository:

If you go to the API Gateway configuration in the AWS Console and open the API you see that there are two stages: prod and Stage. Unfortunately, Stage is created due to a bug of AWS SAM [Link]. Click on the stage prod and click on the “invoke URL”. You should see your Hello World response.

Next, we want to add another stage to your pipeline. We want to create the full stack, test against the REST-API and check whether the result of the REST API is valid. If the tests are finished, we want to delete the stack again.

For being able to test your API, we need to export the API Gateway URL. Because of that, we extend your template.yml file:

Next, we add three stages to your .gitlab-ci.yml. The qaCreate, the qaTest, and the qaRemove stage. In the first stage, we create a full copy of your stack, in the second we test against this test stack, and in the last, we remove the stack. Thanks to Serverless, we don’t pay for the second stack. We pay only for the invocation during testing.

Apply the changes, commit them and push them to your repository. Next, watch your pipeline running:

Wow, that’s already a pretty good pipeline. But, we can do better. Sometime, we will deploy a bug to production, despite the comprehensive testing. We want to deploy gradually and watch how it performs under real-world traffic. Therefore, we want to use the Gradual Code Deployment feature of AWS.

The idea is the following, instead of shifting all new traffic to the updated Lambda function, the old and new run concurrently. At the beginning, a small fraction, e.g. ten percent, of the traffic is handled by the new version and the rest by the old. The share is changed gradually to send all the traffic to the new version.

During the deployment, CloudWatch alarms can be checked. If one of the alarms changes its state to alarm, the deployment is aborted and all traffic is handled only by the old version again.

In order to achieve this, we need to change three things in your template.yml. First, we add a CloudWatch alarm, which triggers once there are more than zero errors. Second, we enable the gradual code shifting feature. Third, we need to change the API endpoint to invoke the live alias.

To see some changes, we add a version output to your Lambda function:

Push to your repository and wait for the deployment. Note, the first deployment will not use the traffic shifting because this is the first time we use the alias.

Now open your endpoint URL. It shows your message and the current version.

Next, we introduce a small bug. See the coding below. Unfortunately, the bug won’t be identified by your testing:

Commit your changes and push to your repository. While you wait, you can open CloudFormation and CloudWatch. Wait until the pipeline executes the deployProd job. Click on your production stack, e.g. demo-pipeline-prod-stack and open the events trigger. During the deployment, you will see a link to CodeDeploy in the “Status Reason” column (see image below):

The code deploy shows you the current status of your traffic shifting:

Now invoke your endpoint and append ?crash=true to the URL. Reload the page several times. Sometimes you will see your HelloWorld message and the old version number, sometimes you will see Internal Server Error.

Next, open CloudWatch and wait until the alarm state changes to Alarm. The error needs roughly one or two minutes to propagate. Once, the alarm status changed, you can see in CodeDeploy that the deployment was aborted.

Open you your API Gateway endpoint, i.e. the invoke URL of the prod stage, again. You will only get answers from the old version of your Lambda function.

Congratulations, you’ve built and end-to-end CI/CD pipeline for a self-healing Serverless service

Last but not least, you could at only: master as property to each job of the pipeline, that should only be executed on the master branch. You can find GitLab’s documentation here. Additionally, in the settings of your repository you can configure that branches can only be merged, if the pipeline succeeded. Finally, if you mark the master branch as protected branch, no one is allowed to push directly to it. Thereby, you ensure that each commit on the master branch passed the pipeline, before it is merged.

Closing Remarks

This post explained how to build an end-to-end CI/CD pipeline for a Serverless microservice. For operating high-quality cloud apps, i.e. ensuring high availability and scalability, you must automate everything. This pipeline can help you achieving 99.999% availability by detecting errors as early as possible and stopping deployment, once you identified an error.

If you use this pipeline in production, great. But consider using more alarms to monitor the healthy state of your application and use the pre-hooks and the post-hooks. And of course the permissions of the user.

I would like to thank my colleagues at SAP, especially Carsten Ziegler, for reviewing this blog posts and providing feedback.