This article is for sharing the knowledge of how our team accomplished the safe deployment of a new version of Serverless Stack in production with zero downtime. Our intention to provide a vision to see through a possiblility to identify the scope of Canary based deployment for any application. The design which we discuss in this article is not generic approach but it explains the way of finding the scope for a Blue/Green deployment to our Giant application. Before you go into detail, this design would work only with the IaC model.
It is all started with breaking down a big fat serverless stack into multiple loosely coupled stacks with sharing resources. It is always recommended to have small easily manageable applications rather than a single monolithic application. We use AWS Service Discovery & SSM services to share resource attributes across multiple stacks.
Our entire Infrastructure is serverless IaC stack and powered by AWS CDK & TypeScript in a proper DDD model. Stacks can be deployed by single CDK deploy command. CDK is a developer-friendly version of Cloud Formation.
The core stack in our project is called ‘Transformer’, it performs batch ETL jobs on Athena, EMR Spark, and ES. The precedent application enables Transformer job by invoking a Pipeline lambda by its ARN, which leads an ETL process and creates an orchestrated workflows called as StateMachines in AWS Step Functions. This StateMachine has series/parallel executions, performs batch process and has complete with final end status.
We designed a very efficient CI-CD pipeline for Dev builds and Staging & Production release. Our release approach is Tagging, Tagging a branch refers to releasing a new version of the stack. We follow semantic versioning for Tag to bring some sanity to the management of rapidly moving software release targets. Cloudformation stack name suffix with Tag on every deployment. We use Code Build & Code Pipeline services for CI-CD stack. We filter for only PR & Tag events from GitHub Webhook in CodeBuild and implemented the build logic in buildspec YAML.
Feature/BugFix Branch PR request → CodeBuild in dev account
Staging branch Tag → CodeBuild in dev → Artifact to S3 → CodePipeline → Build & Deploy in Staging Account
Master branch Tag → CodeBuild in dev → Artifact to S3 → CodePipeline → Build & Deploy in Prodution Account
Artifacts get copied into the S3 Artifact bucket by post-build commands in CI CodeBuild which event is a trigger for CodePipeline’s source action. This bucket should have bucket policy with cross-account access.
This design allows us to have faster seamless deployments, can even have multiple versions of applications at the same time. In order to implement Immutable infrastructure, we have started releasing the new version of the stack on every deployment. Since we have adopted the IaC, deploying a stack is a fairly simple & straight job over CI-CD. But there are various challenges like human faults, risking of production by identical applications, and inconsistency issues with other dependent applications in an Immutable design. These things enforce us having Canary deployment to our Immutable Transformer stack.
As described in the lifecycle of one ETL request in the Transformer stack, we have seen a decent scope to accomplish safe deployments by Blue/Green. Even we would be willing to update the design of the Transformer to make the safe deployments possible.
In order to implement the safe deployment, we designed our own canary router API endpoint with a lambda proxy to control traffic between multiple versions. We made sure that there was a one-time change for precedent applications as they would read this router API endpoint and it’s key from AWS Service Discovery based on environmental attributes.
Previous Design: Request → FrontEnd App → Transformer
Canary Router: Request → FrontEnd App → Router API → Transformer
There are 2 SSM parameters to store current and latest stack data pipeline lambda ARNs. Ideally, the current stack param always has a current stack ARN and the latest stack param is blank unless a new version of Transfomer gets released and overrides with ARN. The router lambda reads two SSM parameters on every invocation to identify the latest version of stack availability. If Router finds a valid ARN in the latest stack param, it becomes active and starts routing the traffic (1–5 requests) randomly. Router records every routed traffic details in the DynamoDb table.
Router record some relevant details to track the ETL StateMachine’s final status before its route more traffic than allowed threshold. As such, we allows only 2 requests to the latest stack. The router would not do any validation for the first 2 requests, it’s point the requests to latest stack and record the details in Dyno table with initial status as ‘RUNNING’. Each request has a Transform Id which is StateMachine Id. The router scans the corresponding StateMachine’s status based on their transform ids and would not allow any more traffic to the latest stack until the first 2 requests status updated to ‘SUCCEEDED’. Since the Transform jobs are natively batch processes, these 2 requests may take a while to complete. As long as the aggregated status of 2 routed requests status is not ‘SUCCEEDED’, traffic simply passing to a current stable version in a usual manner.
Once the StateMachines are completed successfully, the Router will be satisfied with the performance of the latest stack and initiate the flipping mechanism from the latest to current. It simply overrides the current SSM with the latest stack ARN and overrides the latest stack SSM to blank. Since, no latest SSM value, the Router go to sleep and would not perform any routing activities.
Conclusion: As mentioned, this is not so generic approach to simply implement this design for other serverless stacks. We are talking about a serverless stacks that may have a group of lambdas for various activities such as eventing, queuing, polling, copying, processing, performing ETLs, etc and every application might be completely contrasts with each other. By considering all these odds, we came up with our own custom-build Canary Router as per our project scope. We recommend you to overview your application scope and start having safe deployments to achieve Immutable Infrastructures.