POC of zero downtime blue/green deployment with AWS CodePipeline and CloudFormation.

Published in

Pablo Perez

35 min readNov 14, 2022

Zero downtime deployment means that our website will never become impaired or unavailable during an update. In order to incorporate this feature to our solution, the application load balancer won’t serve the new code until all the tests and reviews have been taken in environments like development and staging, and finally a manual approval being confirmed before going to production.

The new code will be served immediately, and an instant rollback is possible by swapping back the traffic from the blue (production) service to the green service (non-production).

This approach can be shared and customized across different teams based on their needs, because we use infrastructure as a code, which allows to save the whole mode in few templates than can be reproduced in other environments within minutes to set a CI/CD pipeline for a microservice.

Security it’s implemented at the highest level, using encryption at rest and in transit, and the permissions are hardened based on principle of least privilege.

On the other hand, encryption in transit avoids an attacker to sniff sensitive data in our operations by means of SSL protocol, and encryption at rest means that when our data is encrypted, only the user who have accessed it, will be able to read or write if this user has the correspondent decryption key.

Apart of aiming speed of releasing software, we implement prevention of non-desired releases by adding a manual approval step with an automatic email notification to the email of the person designed.

By leveraging various accounts in the cloud where each corresponds to a specific application environment, we isolate the workloads avoiding any interference or security gap.

A blue/green deployment consists of keeping current running version (blue) in service while new version (green) is provisioned, then the load balancer changes the routing of requests to the newer version, in our case, we will do this with a python script triggered during the pipeline execution.

First you can see that in the tool account that we have AWS CodePipeline which is a workflow management tool where your code can built, tested and deployed in different stages into different environments.

If any problems occur with any step during the process, the pipeline is halted, and a notification is sent via email.

It’s a good practice to have different environments, different workloads isolated on each different account where different permissions are granted or limited. Also, Integration and Functional tests must be automatically performed in development and staging environments separately. Each account has its own set of resources to do the tests and also permissions are different between test and production.

Another important point is that changes in a development pipeline should not block production pipelines. The pipeline deployed in the tool account reacts to any updates in the AWS CodeCommit repo development branch.

Workflow steps are:

1.- Developer pushes a change in the CodeCommit repository development branch which triggers CodePipeline pipeline execution.

2.- Next during the build stage the public docker image is pulled and image is built, tagged using the unique CodeBuild job execution id and pushed to ECR tool repository.

3.- A script in AWS CodeBuild will execute and by using CodePipeline environment variables identifies which target group of the load balancer is linked to listener 443 or 8443 and which are production and non-production, also for each target group gets the value of another tag that contains the docker image tag that is currently running on each.

4.- CodeBuild sends the tags values containing the image id for the blue and green service to the next deploy stage which uses CloudFormation as parameter values. The green AWS Fargate service receives the new image tag, which will cause the Fargate service to update the new task definition.

5.- A lambda function is executed, it assumes a role on each account and swaps each listener rule to point to the other target group , also it modifies the tag Boolean value to indicate which target group is in production after the swap.

On production environments ,there’s a manual approval action associated with an AWS SNS topic and the subscriber will receive and email with a link to the pipeline console to approve.

Steps 3, 4 and 5 are repeated on each environment.

With regard to security, the CodePipeline service itself will assume a role that has been created previously on every account to allow CodePipeline service to perform, AWS CodeBuild, CloudFormation and AWS Lambda actions along other services involved like Elastic container service, load balancer service, S3 and CloudWatch actions.

Moreover, for the pipeline to work securely, a customer master key will be created to encrypt at rest and in transit the data shared on the S3 artifact and scripts bucket, and the Docker images on central ECR repository. The accounts’ role will have permissions to use the CMK to decrypt.

In the above diagram above we can see the infrastructure is within a virtual private network where we define a private IP CIDR block. The VPC has an internet gateway which is the component that allows communication between your VPC and the internet. By default, it is highly available and scales automatically as per the AWS documentation.

Also, It has different availability zones for fault tolerance purposes, on them you can create different subnets to isolate different workloads, either public or private.

By default, the application load balancer will distribute requests evenly across the registered containers in all the enabled Availability Zones. One benefit of Application Load Balancers is that they support path-based routing and rules, this means that multiple services can run in our Fargate cluster, and we can adjust listener port and traffic to each specific service.

We will configure Fargate autoscaling which can increase or decrease the number of tasks, this is the number of containers running by using ECS scheduler on Fargate with CloudWatch alarms and Application Auto Scaling. When a CloudWatch alarm triggers an Auto Scaling policy we have defined, it increases or decreases the count.

EFS storage will allow us to mount the same file system on our containers. It supports encryption at rest of course by specifying our customer master key and encrypting data in transit with TLS.

Furthermore, we will use Privatelink, which is a service that lets you add VPC endpoints which basically allow you to privately connect your VPC to supported AWS services. Fargate Clusters will be in private subnets. They can communicate to other AWS Services through VPC endpoints, traffic will be within the Amazon network instead of using the internet. Also, to allow your tasks to pull private images from AWS ECR.

Finally, in order to communicate with CloudWatch, we configure the awslogs driver in the containers in your tasks to send log information to CloudWatch Logs. This allows you to view the logs from the containers in your Fargate tasks.

In order to benefit from the advantages described earlier about having different environments for better testing, security and management, we will create different accounts by means of AWS Organizations. AWS Organizations facilitates the sharing of resources across different accounts reducing permission management effort and resource duplication, allows to separate workloads in different accounts with all benefits this provides regarding security and auditing.

The final structure is as below, where we will have aside of our management account, we create an Organization Unit where four new accounts have been created, the tool account, where we will deploy the pipeline, and the development, staging and production accounts where we will have the web application and needed resources deployed on them.

Organization units allow us to organize our different accounts by hierarchy and environment type, so we can apply specific permissions and controls to each.

Furthermore, not only we can organize accounts by infrastructure resources deployed on them, but by the software development life cycle is achieved by having development staging and production accounts with different workloads thus hosting all stages required for the SDLC.

Moreover, we will enable CloudFormation StackSets trusted access on our AWS organization accounts.

Defining and deploying the resources and logic required for the solution.

Deploying Network resources stack on Dev , Stg and Prod accounts.

For this, we will use a single template and leverage Stack Sets CloudFormation feature to deploy them from the management account.

Stack Sets allow you to create, update, or delete stacks across multiple accounts and regions with a single operation. A stack set ensures consistent deployment of the same resources, with the same settings, to all specified target accounts.

First, it should be outlined that we are creating a whole virtual network logically isolated from other virtual networks within the account and of course in the cloud .

Network physical isolation has been implemented traditionally on on-premises environments, so in order to implement better security, ability to escalate further in the future and better management of resources, routing and security rules I have decided to deploy a new VPC per environment which allows me to deploy my resources in logically isolated virtual networks entirely defined by me where I can set all requirements to secure outgoing and incoming traffic.

Below we can see the first section of the template is the Parameters one, this section allows to input strings, comma delimited list among other values, into the template in order to setup resource property values dynamically. For further insight and accuracy regarding the parameters section in CloudFormation we have referenced to the CloudFormation User Guide.

We are using on each parameter of the whole project, the constraint attribute AllowedPattern, here I define a Regex expression to accept only a valid CIDR block.

The default attribute I’m defining allows me to not have to input anything when launching the template from the CloudFormation user graphical interface or by just executing a script in a terminal.

We are defining the whole CIDR for the network, and then different subnets , without internet access the ones intended for the databases and the containers. The public ones will host the load balancer and the NAT Gateways, a Network Address Translation (NAT) service to access internet from private subnets.

Next, within the section Resources we are defining what we are going to deploy, the main things are the network itself along an internet Gateway to have access to internet in the public subnets. It can be observed as well that we are using intrinsic function !Ref to refer to the parameter input values.

We will apply tags to all the resources of the solution to further work with them in future automations.

It’s paramount to refer to the CloudFormation resources reference in the user guide to understand what resources can be defined and what properties and values are allowed on each.

CloudFormation StackSet Resources foundational components

For the databases we will create two different private subnets along a route table for each which will allow connectivity within the whole VPC. We create two because we want to have a database with fault tolerance, in case one datacenter goes down.

Each will be in a different Availability zone, this is, a separate datacenter, and for this we will use the intrinsic function select, which returns a specific value from an existing list by passing the index. Along with select, we are using the intrinsic function Fn::GetAZs , which retrieves the list of availability zones.

For a list of all available intrinsic functions in CloudFormation we have referred to the intrinsic function reference in the CloudFormation user guide.

The serverless container cluster network will have the same approach as the previous databases subnets.

Next we will define the NAT Gateways resources, each NAT Gateway needs to be allocated with an Elastic IP as per the AWS documentation for this topic, this is a reserved public IP address that we will own until we release it when we don’t need it anymore. This public ip is needed because the traffic from the private subnets will be translated to this specific IP when going outside.

We are using the intrinsic function !sub below in the AllocationID property to link the NAT Gateway with the Elastic IP, this function accesses the attribute the AWS::EC2::EIP resource type has.

Next we have defined the application load balancer network components, the necessary public subnets ( the load balancer is exposed to internet) and the corresponding route tables.

Next we define the security groups for the network, security groups are like Linux Iptables firewalls where we filter by protocol and port ingress and egress traffic as described by AWS.

We are allowing 80 and 443, former will be redirected to the latter. And we are allowing 8080 and 8843, same as the previous, each will accept either production or developer version of the web site served by a docker container.

We can see specific port to allow access to PostgreSQL, and 2049 port which will be used as the shared storage for the container, in this case AWS Elastic File system which works similar to a NFS file system that grows as per demand and can be shared among different Docker containers.

Moreover, we should take into account when defining these rules that security groups are stateful meaning that traffic responses are allowed when traffic is sent through specific port no matter the ingress rule.

The, the last group of resources we are defining in this template are the VPCEndpoints along their own security groups. VPC endpoint are virtual network devices that allow our connections to other services in AWS, like S3 objects storage or AWS Secret Manager to connect internally within the virtual private cloud and not requiring to reach the public endpoints of those services through internet.

The advantages of them are basically two:

- Internal traffic has not cost, unlike NAT Gateway which has cost.

- Higher security as traffic is not exposed in Internet.

We are using several ones like Elastic File System (NFS like), Secret Manager to store secrets, and monitoring metrics and logs managed and stored by AWS CloudWatch.

Finally, we define an Outputs section, where we declare output values from the previously defined resources attributes, that you we will import into other stacks.

We use a CloudFormation pseudo parameter AWS::StackName being fetched by intrinsic function !Sub as each output has to have a unique value in the region of our account which is required in order to create an output resource.

Deploying Sources stack in the tool account

The template below deploys the following:

- A S3 bucket to store the Python scripts that will be executed by a build type stage in the pipeline, so we will identify which load balancer listener is acting as production or not in order to perform the listener swap properly to serve the new version of the Docker Image.

- Another S3 bucket used by the pipeline to store its artifacts.

- An AWS CodeCommit Repository that is deployed as part of this stack. This is a Git repository solution available in AWS.

- An AWS IAM group with permissions restricted to the project CodeCommit repository.

- An AWS KMS (Key Management Service) keys to encrypt all the artifacts.

- An AWS SNS (simple notification service) topic to notify the CodePipeline approval actions.

- AWS IAM roles and policies needed by the pipeline to execute actions across the accounts of the organization leveraging other services.

I’ll describe in detail each excerpt composing the template that will create a Stack in CloudFormation thus deploying the resources.

Below, I have defined an AWS::CloudFormation::Interface resource type, this is the grouping and ordering of input parameters when they are displayed in the CloudFormation console.

It’s convenient because, by default, the console alphabetically sorts parameters.

Within the interface, there’s a single parameter group defined, and each label creates a subgroup where we specify which parameters from the parameters section is shown as part of this layout.

In the next step we will define the parameters section.

Parameters section allows to pass values into the template in order to setup resource property values dynamically.

We are using on each parameter of the whole project, the constraint attribute AllowedPattern, here I define a Regex expression so only matching input values, can be passed in the input, like for example, a string representing an email address.

In this template all are of type “String” and have the attribute Default , which allows to set a predefined value instead of requiring a new input when the Stack is launched.

The following are a series of excerpts containing the resources defined along an explanation for each.

Above we create an AWS CodeCommit Git Repository.

!Ref function points to the value in the parameter provided, in this case pCodeCommitRepoName.

Also, we create an IAM Group named “devs”, all the users added to this group will we applied the policy document defined where they only can perform the API calls needed to operate Git branches and do push and pull Git actions.

We can see there’s an intrinsic function called !Sub. In this case !Sub function will get the value from the resource attribute “arn”, this is the unique identifier of the resource, thus these actions will only be valid when working with this repository.

Above we define a managed container registry to store Docker images and artifacts for containers.

We enable encryption at rest on the registry in the property EncryptionType, for this we will use a Key that we will define later in the template and refer to it in the property KmsKey.

We define a policy as well for the registry where only the tool account and a specific role in the dev , stg and prod accounts can upload and download docker images from this registry.

Above with define an object storage type container, this is an S3 bucket to store the CodePipeline artifacts.

We setup encryption at rest using the same Key as we specified in the previous excerpt, and that we define later in the template.

To specify a unique name for the S3 bucket we are using a CloudFormation pseudo parameter, this fetches the actual account number and compose a unique name for the bucket after adding a suffix.

We block all public access to the bucket, basically, the S3 bucket will reject calls although the bucket has public access enabled therefor restricting access to it only to AWS services and authorized users.

VersioningConfiguration property enables different variants of the same object uploaded, each has a version id every time an object is uploaded, so we can go back to the previous version.

Above I define a permissions policy for the S3 bucket, so I specify which API calls can be invoked against it and by who.

In order to understand what API, call on each service does we can refer to the BOTO3 SDK API reference where we will find all the services and available actions, we use this guide to create any script with python.

It denies unencrypted object uploads, non-TLS/HTTPS connections, and allows only the required AWS S3 API actions from specific roles.

The tool account will be able to invoke gets and putobject api calls along a CloudFormation service role created in the dev, stg and prod accounts, as I will store in this bucket other templates known as nested stacks that I’ll go into detail later.

I’m using intrinsic functions !Sub and !Ref, along a pseudo parameter AWS::Partition, which for standard geographical regions, (China would be a non-standard) is “aws”.

Moreover , aside of this pseudo parameter, in order to compose dynamically a string to specify who is allowed to invoke the api calls, I do string substitution by just doing ${parameter} specifying the parameter where I want to fetch its value.

Above we define the AWS Key managed service key that we will be using to encrypt the pipeline artifacts, the S3 bucket and the container registry.

We add dev, stg and prod account roles in the policy, we are defining roles to allow them to use the key, the pipeline actions will require that these roles have permissions to encrypt and decrypt using this key to perform the operations that the pipeline will have on each stage.

Above we define two AWS Simple notification service resources, a topic and a subscription to the topic of type email. This is required to add a manual approval step before the application is deployed to production. An email is sent to the address provided above in the parameter pEmail.

Next we will create a Service Role for the AWS CodePipeline itself. This is like giving a role to an entity which is the mechanism itself under the hood to process the pipeline, that will perform operations on our resources, and although we confirm the permissions to this entity the pipeline won’t work.

The permissions are defined in the form of an AWS IAM (Identity and Access Management) policy that link the service role created by using an intrinsic function !Ref in the corresponding property of the policy definition in CloudFormation.

Above we can see that for every service that is related to the pipeline, minimum permissions required are granted following the least privilege strategy. We select the API calls required whenever is possible instead of granting all permissions.

This role is assumed as well by the dev ,stg and prod account, so we need to grant permissions even for this action as well.

Above, at the beginning of the definition of the policy we have added a DependsOn attribute, where we add in specific order some resources. We need this in this specific case, to avoid any dependency error when deploying the policy, CloudFormation it’s supposed to take care of dependencies but in some resources like defined policies in a property it doesn’t manage the order correctly.

Above I define the AWS IAM Role for AWS CodeBuild, this service will execute the scripts to build the Docker image, push it to the registry and afterwards check load balancer listener to find out which one is in production along executing the logic that allows this.

Among other permissions that we have already discussed, we can find permissions to publish logs to AWS CloudWatch logs to keep logs so we can troubleshoot any issue with scripts execution.

Furthermore, permissions to work with the registry, the S3 buckets, encryption and decryption with the key defined, EC2 permissions to apply tags in the load balancer target groups in order to achieve a logic that identifies which is production or not.

Roles in dev ,stg and prod account will be given permissions as well to allow AWS CodeBuild to execute the logic on each account.

Next, we can see a role and its policy definition that will be used by each dev, stg and prod account role, this will swap the listeners and mark the pipeline task as success or failure when lambda function is executed.

Once the AWS CloudFormation stack is created, resources values that we will use in other stacks can be accessed after specifying them in the Outputs section. Outputs section declares output values that you want to return for cross stack referencing where you can export and import values, or just only use it to provide descriptions.

When we define other resources that are needed, like the pipeline itself and its stages for example, all the attributes’ values of resources already deployed will be imported, like the Key identifier or a role identifier, etc…

They have to have unique identifiers, that’s why again we use !Sub intrinsic function with CloudFormation Pseudo parameters .

There is one caveat here, in order to create the role policies, which refer to entities from the target accounts that don’t exist yet, such the roles in those accounts the CodePipeline Role will assume to execute actions in those accounts, and like the KMS Key policy which will allow the target account roles to decrypt artifacts in order for the pipeline operations to succeed, because we haven’t created them yet, we will in the next step, we are commenting out these principals on the policy roles. For this we save the template file with a name that indicates these principals are commented, and after the next step is completed, we will update this same stack with the uncommented template.

Deploying Application Load balancer and IAM Roles stack in the target accounts

We start deploying resources on the dev, stg and prod accounts. First we will define a template where we define the application load balancer and different roles needed to use the services and tools that compose the solution and are integrated into the pipeline.

ALB-Roles stack parameters, mappings, and conditions

Above we can see that we are suing some parameters with values that we need for our stack, these values cannot be imported from other stack as the tool account Key identifier can’t be imported because imports and exports don’t work between different accounts.

We are defining two new sections, which are Mappings and Conditions, combining both we will assign a private subnet for the load balancer from the development environment and public subnets for the staging and production ones.

A mappings object declares different levels of name value pairs to provide the resource property with the correct value by means of the fn:findinmap intrinsic function. This allows you to implement some conditional logic, and Conditions define when a resource is created or when a property is defined. They are evaluated based on input parameter values that you specify when you create or update a stack. At stack creation or update, CloudFormation evaluates all the conditions in your template before creating any resources.

An example can be seen below where , after previously defined the mappings and conditions we define an application load balancer, and we use the !If function to refer to a condition met or not to assign a value.

Application Load Balancer definition with conditions

Above it can be noticed that we are using the intrinsic function Fn::ImportValue which allows us to import from the network stack the values of the subnets and security groups that we will provide to the load balancer configuration. That’s the reason why we were asking for the network stack name in the parameters section which was created previously by the StackSet functionality which generates a random name with a specific prefix which we ensure is present with a regex rule.

The rest of the template contains definitions of roles that will be used by the Pipeline:

A CloudFormation service role to deploy stacks that will be passed to the pipeline for each target account to be assumed in order to deploy the defined resources.

CloudFormation Service Role for the target accounts

A Cross account role that will allow the tool account to access resources and services from the target accounts.

Role to grant the tool account permissions on target accounts

Both task and execution role for the Elastic Container Service, in this case we are using Fargate serverless service to launch and maintain the containers. As per the documentation of AWS, the execution role executes actions related to the cloud service that manages the containers while the task role allows the container to invoke other AWS services when needed.

An AWS Lambda role that allows the tool account to execute the AWS lambda function in each target account with permissions to invoke API calls related to the load balancer service and EC2 tags as well as notifying the status to the pipeline, all of this allows the logic in the function to be executed correctly.

Finally, we define another KMS Key, so each account will use it to encrypt its own resources. Within the policy we will grant permissions to the tool account to decrypt the data from each target account as well.

Defining and deploying the pipeline along scripts in the tool account.

First I’m going to show how the pipeline looks once is deployed. In the picture below we can see the pipeline is working smoothly on every step, however, in this step we will not release any code to our Git repo yet as another stack will be needed in the target accounts.

An then we will go thoroughly through the stack where we define all the Pipeline stages and steps that results in a CI/CD cross account pipeline with different isolated environments.

As it can be seen above in summary there’s a stage named Source in the tool account which triggers the pipeline execution whenever any asset of the three “roots” is updated, these are, the AWS CodeCommit repository where the developers push their Dockerfile along files it needs, the AWS S3 Object Storages bucket, where other CloudFormation templates needed, and scripts are fetched, and a logic is executed to achieve the blue/green deployment, all of these are in the tool count.

From here, there will be a Build stage that will use AWS CodeBuild to connect to Docker, build a new docker image and push it to our AWS elastic container repository, while generating an output artifact with the tag id set to the new image. We are using the build unique id for this.

Then in the next stage we will perform the following steps in the dev account that are executed later on staging and production if anything fails, and we approve:

- Discovery. AWS CodeBuild executes a script that verifies the application load balancer used for our solution exists and which target group, by checking the tags, is forwarding the traffic to the Docker container in production and what image it’s this using.

As per the AWS documentation a target group is a logical entity that acts as a router by distributing income requests from the load balancer listener to the specified targets, based on a defined rule and its conditions.

- Deploy. AWS CloudFormation updates the AWS Fargate Cluster stack, (we will deploy it later), and will pass the proper parameter with the new Docket tag to the cluster service that is not in production, which is running an old Docker image. For this to succeed we use input artifact resulting from the previous step containing the new Docker image tag on the specific parameter that corresponds to the specific service to be updated.

- Swap Target Group AWS Lambda function. We have created an AWS Lambda function that executes a Python script on the environment or account specified, which will swap the target group by modifying the load balancer listening rule. This will forward the traffic to the new AWS Fargate Service running the container with the new Docket image.

And finally , only in the Staging and Production accounts, the pipeline will have a manual approval step. This manual approval will send an email to the address we have defined in the previous “sources” stack, and which will have the pipeline stopped for 7 days, giving us time to review carefully our new released before approving it and executing the lambda function that will swap the load balancer target group.

We receive and email with a notice regarding the approval of the blue green deployment , this is a load balancer target group swap to the container that runs the newer version of the docker image.

We review the manual approval and approve it for the workflow to go ahead , which will trigger a lambda function that performs the target group swap in the load balancer using the AWS SDK to build the logic.

In the picture below we can see the target group is correctly swapped to the AWS Fargate Service that contains the task with the docker container that has the new version of the website.

We will discuss now the most relevant parts of the pipeline template.

Source Stage

In the property ArtifactStore we specify that the input and output artifacts used by our pipeline will be storage in an AWS S3 object storage bucket, which is encrypted with a specific key being both created previously in our Sources CloudFormation stack in the tool account, so we can import these values with the intrinsic function ImportValue and do variable substitution with the function !Sub “{x}”.

Then we define the steps, as we have explained some paragraphs above in this document. We start by specifying the category source, and our CodeCommit repository and branch that will trigger the pipeline. We have enabled the option for the pipeline to poll for any changes in the repo, and we specify the name of the output artifact as we will used later in another step.

With the same stage , there are other two sources, where we store the CloudFormation templates needed to update our AWS container cluster, and the scripts for the logics involved in the blue green deployment regarding the load balancer traffic rule. Because RunOrder attribute is “1” for the three sources, they will be polled in parallel.

Build Stage

Next the build stage will point to the CodeBuild project defined in the same template for this purpose. Again, we define the input and output artifacts naming.

Below, we can find the project definition to build the Docker image, which we are linking to in the previous screenshot definition of the first build stage, it is executed only once in the tool account because the Elastic Container Repository is centralized in the tool account and shared to the rest of the accounts and environments from there.

As in all the resources defined, the project is encrypted with our custom key, this applies to encryption at rest and in transit. We use a small Linux container type where we pass the variables needed, like the one pointing to the AWS Secret vault object that contains my DockerHub personal account, needed to build the image by downloading Apache Docker, we are using the CloudFormation resolve syntax specific for AWS Secret objects in this case.

The Service Role with the proper permissions is imported from the sources stack we deployed previously, and then in BuildSpec we define different phases of execution of our script, where we first update the operating system and the awscli, we create from the CodeBuild logs and the repo path the final tag for the build id which we will use as the Docker image tag, and in the next pipeline stage to discover the tag that was set in the last image pushed to our Docker private repo.

Docker login we use variable substitution to use our Docker credentials, and then we log to our AWS ECR Repository and with those temporary credentials in the session we push the image.

The output artifact will be inspected as described in the previous paragraph.

Build Discovery Stage

And the same happens with the logic that checks the load balancer target group tags, we specify the CodeBuild project which in this case fetches the scripts.zip object and produces and output artifact with the parameter values to update the Docker image in the specific service which is not in production.

Now, we will go through the AWS CodeBuild project definition to discover the actual status of the traffic forwarding in the application load balancer. For each environment we define one CodeBuild project like the one below because each uses a specific account role to execute the script on each account.

CodeBuild project definition for build discovery stage

In the environment section we specify the image of the instance that will run, in this case a Linux instance with python 3.6, we specify variables, like the role of the account that we had created previously and that allows the tool account to assume this role which grants the permissions required to invoke the required API calls on the target account by the script.

The BuildSpec property defines the logic we need to get jq package installed which is needed to export new environment variables for the current session with the keys and session token that the script will assume. The rest of the environment variables are used by the script which runs in CodeBuild.

Once we have upgraded boto3 ,the former being the AWS SDK for python we execute the script which was passed as an input artifact in the CodePipeline stage definition that links to each project.

Defining the Discovery Build Stage AWS Boto3 script

The following script produces an output json artifact (cf_inputs) with the unique tags that we will pass to the CloudFormation stack that contains the Fargate services to perform an update on the non-production service that launches a new container that pulls the new image from the repository specifying this tag.

First it gets the build id from the last execution id of the build stage where the docker image is built and then it gets the tag from the artifact produced in that execution /tmp/build.json. This tag will be passed to the target group to the target group that is associated with the port 443 production and 8443 non production.

For the above to work we need a project that builds the image previously. This is embedded CodeBuild Project definition in the pipeline template too that we will describe hereafter.

Let me add more insight, above we have the Python script that is executed in the second build stage, right after the docker image is build and pushed to the registry, which will use the AWS SDK for Python, which is Boto3, and by invoking the proper API calls on each defined function we have made a logic that sets the proper tags on each application load balancer target group , the ones holding the green and the blue service. We get the Docker image tag pushed in the previous step, by getting the previous execution id and inspecting the output artifact produced previously, and then we get the actual tags on the target groups to pass the newest tag to the proper target, the one that is not running the image in production and the current tag that was already set in the production target group thus not causing any update on the “blue” service but on the “green” service by terminating and launching a new container with the new image.

This script will get session variables for each account, dev, stg or prod and gather the right tags to pass to the deploy step, as parameters of the CloudFormation stack that updates the target group tags and the AWS Fargate service.

Updating the green service in the AWS Fargate Cluster

The step of updating the CloudFormation stack in the dev, stg or prod account is very interesting.

We specify CloudFormation as the provider, the action mode is CREATE_UPDATE , this means if the stack doesn’t exist will be created and if it does, updated. We provide the capabilities needed by CloudFormation to work AWS IAM identities resources and also CAPABILITY_AUTO_EXPAND which refers to the ability of the CloudFormation stack to create substacks, as we defined them as any other resource.

The ParameterOverrides property allows CodePipeline to update the value passed in the parameters to the stack, which in this case is the correspondent Docker image tags for the blue and green services as per the logic executed before, the Docker repository name that we passed in the parameters and the S3 bucket containing the CloudFormation templates for the operation. Moreover, we pass the stack name with the network resources that was created in that account previously with a StackSet .

In order for the pipeline to know where to deploy this, we specify the role created on the specific account that allows the tool account CodePipeline role to assume the target account role with permission to deploy CloudFormation stacks.

Defining the Manual Approval step before a Lambda function swaps the Application Load Balancer traffic.

This excerpt above shows the Approval action, we specify the AWS Simple Notification Service Topic that we defined previously in the Sources stack by importing its Arn or unique identifier.

We can see the step to forward the traffic to the new AWS Fargate service container that runs the new Docker image.

We specify AWS Lambda as the provider, being the Lambda function defined in this template and passing the name of the application load balancer for the AWS SDK Python script and the specific account role that allows the pipeline to execute the lambda function in a cross-account manner.

One of the scripts that we have uploaded to our source AWS S3 object storage bucket is the one that an AWS Lambda will execute once the approval has been given when the workflow requests it when after executed the deploy stage.

AWS Lambda executes this Python script by passing it the handler method, the handler gets some information from the AWS Lambda event, the event object argument is a Json dictionary that contains info we are going to use in our logic. Information about the pipeline job and inputs like the application load balancer name and role to assume in order to start a new AWS session within the lambda.

Then, when invoking the method to swap the target groups, by using the AWS Boto3 SDK we will invoke the proper API calls to describe the application load balancer listeners and each rule for each target group, where we modify the traffic destination, this is the target group for each one of the rules.

Finally, we set a Boolean value for the tag that tells us if the target group is in production or not and we send a failure or success signal back to the AWS CodePipeline pipeline.

Defining and deploying the AWS Elastic Container Fargate Cluster

It’s needed to say, that when we deploy the solution, we won’t run the Pipeline until the container cluster is up and running.

For this we will need to first of all , clone the AWS CodeCommit Git repo with a user that belongs to the devs group in order to be allowed to push one version of our Docker image.

Once we have pushed our Dockerfile to the repo we will create the CloudFormation stack that creates an AWS Fargate cluster and at the same time launches two separated nested stacks with two Fargate services running on each.

Nested stacks allow you link a template within another template by defining resources of stack type on the parent one. Nested templates are uploaded to S3, and you need to specify the resource type AWS::CloudFormation::stack specifying the S3 URL of the nested template in the property TemplateURL.

How do they work? As per the CloudFormation documentation CloudFormation treats the nested stack as a resource of the main stack. If you update/delete the main stack CloudFormation update/deletes the nested stack. Also, If for example you create two main stacks that both include a nested stack, each nested stack is independent, you can customize each one with input parameters.

Below we can find the definition of each nested stack in the parent template, in this case both the green and the blue services:

You can notice the difference is in the ports, being 443 production and 8443 nonproduction, so the non-production traffic is forwarded to 8443, which is what happens every time the lambda function swaps the target group rules. And the property pTag, acts as a parameter for the nested stack , like all the properties defined in the nested stack resource. This parameter contains the tag that was set in the image during the build and will cause only the non-production service to be updated launching a container with the new image and terminating the old one.

We are defining too an encrypted log group to keep track of all on going events in the cluster, and two Elastic Filesystem mount points, so the Docker containers can read and write to the same file system which at the same time has high availability. The file systems are also encrypted we the key we defined at the beginning.

In the nested stacks we have the following resources:

Two listeners for HTTP which redirects to HTTPS by default, and the HTTPS one uses FindInMap intrinsic function to get the Certificates identifiers that correspond to each domain, these are part of my own registered domains and are stg.pabloperfer.net and prod.pabloperfer.net.

We also define a rule to forward traffic to each target group

The target group itself of the application load balancer previously deployed by other Stacks.

Health check protocol is HTTP because the containers are exposed on port 80 and , because the redirection to 443 is implemented by default, the description of the connection is carried on by the load balancer itself.

The we define the Elastic Container service, with LaunchType property Fargate because we want it serverless, we specify the target group with which the service is registered, and the private subnets we defined in the first stack.

The service guarantees we will always have a number of tasks running and will implement health checks, to launch a new task if needed.

Where do we define the container? In an Elastic Container Task. Each AWS Fargate service can have different tasks, where we define the volume to mount on the container, the port exposed on the container, role used by the container itself, log driver and resources like CPU and memory.

Fargate cluster stack deployed resources

To minimize any management of the infrastructure we will setup a simple autoscaling based on the Elastic Container Service CPU usage, in this case whenever the containers use more than 50 percent of CPU, a new container will be started until a maximum of 10. We will set one for the blue and another for the green service. Whenever the number of visits increase to our website , the load balancer increases automatically and the number of containers too because of the CPU usage.

Testing

Once Docker image has been modified and pushed to the Git repo, the pipeline is triggered and in the first stage, Code build logs show how it’s automated the building, tagging and pushing of the image.

Then in the second build phase where we check the target groups tags to see what the image id is, and which one is in production in order to update only that target group and the AWS Fargate service that is linked to it with CloudFormation.

During the deploy stage, each account has deployed AWS Fargate cluster stacks with two nested stack resources that contain an AWS Fargate service with the Docker container task that is not in production at the moment, based on the current tag of the load balancer target group, the non-production service, will get the new Docker image running as it will receive a different parameter to perform the update on it.

Before approving the swap of the load balancer target groups, We can see an example of the load balancer and the rules to redirect the traffic to HTTPS port, as well as the two ports, 8443 and 443 used, one for blue and the other for green service. And the most important part, the target groups associated to the ports the containers are exposing.

When we approved the notification sent to our email, the lambda function swaps the target group, the ALB 443 port redirects traffic to the target group Fargate Service Task Container that has the last version of the Docker image.

The screenshot below illustrates the tags we use to implement the logic that indicates which target group connected to a specific Fargate Service Task Container is currently acting as production and with which image tag.