Automating Blue/Green Deployments for ECS Fargate with AWS CodeDeploy

Published in

disney-streaming

6 min readMar 25, 2021

The Engagement and Messaging team at Disney+ is responsible for engaging with subscribers through e-mail, push, and in-app messages. The team is also responsible for the messaging associated with Disney+ co-viewing feature GroupWatch.

For background, GroupWatch is a feature that gives subscribers the ability to co-view any series or movies from Disney+ with their family and friends. For GroupWatch messages, we used AWS Lambda, but needed to minimize latency during cold starts to ensure people in groups were in-sync. To overcome the cold starts, we wanted to migrate some functionality to AWS ECS Fargate.

When switching to ECS Fargate for GroupWatch, we needed to reproduce the deployment strategy currently in place with our existing Lambda functions to ensure future iterations of our application were safely deployed.

We wanted:

Automation with Cloudformation
A Blue/Green deployment model
Run tests against production traffic
Automatic rollback during failure

To achieve our goals, our team decided to use AWS CodeDeploy as it met the criteria and was a tool that we were already using with AWS Lambda. However, implementing CodeDeploy for Fargate in Cloudformation was not as straightforward as we expected, so we needed to devise a solution.

Research

The first step was to see what solutions were available in Cloudformation. AWS has a Cloudformation hook for ECS Blue/Green Deployments. Our current Cloudformation stack could not be supported because the hook only supports non-nested stacks. It also creates an External Deployment instead of a Blue/Green Deployment type which meant that we would lose some of the benefits of CodeDeploy.

There were a few other solutions and discussions out there about CodeDeploy for ECS with Cloudformation. Some solutions use Cloudformation to build out some of the CodeDeploy resources and use a separate tool to automate the actual deployment. Though this could work, it meant that we would need to add an additional step to our current CI/CD pipeline to support the additional tool. We would also need to add a step to rollback the Cloudformation stack to the old stack if there was a failure in the CodeDeploy deployment. We wanted to avoid having two separate resource management tools.

The next step was to manually create a CodeDeploy and see what resources were required and build out as many resources as possible. The following resources were required:

A Service with the Deployment Controller type set to CODE_DEPLOY: Setting the Service Deployment type to CodeDeploy will allow new deployments to go through a CodeDeploy deployment.
A CodeDeploy Application: CodeDeploy uses applications to keep track of deployments and deployment groups.
A Blue Target Group and a Listener: In production, a listener must be configured from the load balancer to the active target group which points to the production tasks. This is depicted in the image above.
A Green Target Group and a Listener: When a deployment occurs, new instances should come up in a passive green target group and a passive listener will point to that passive target group on a separate port. This is also depicted in the image above.

The last part to automate was the actual deployment and the deployment group. Blue/Green deployments are currently not supported by the Deployment Group and Deployment Cloudformation resources so we couldn’t just create the resources in Cloudformation.

Attempt 1: Automating Deployments and Deployment Groups with AWS Lambda

We first attempted kicking off a deployment with an AWS Lambda function. The Lambda would run on every update to the stack, create the deployment, and send back the status of the deployment. If an error occurred, it would rollback the entire stack.

At first, this solution seemed to work, we were able to run the CodeDeploy and rollback if the deployment failed. However, we found two issues. First, scaling tasks during a deployment could cause the Lambda function to timeout and rollback the stack. The second issue was that the deployment was not aware of the state of the rest of the stack. When an update occurs, the deployment is kicked off by the Lambda function, while other resources simultaneously update. If an error occurs while updating a different resource, the deployment could complete successfully while the rest of the stack rolls back.

Even though the Lambdas were part of the Cloudformation stack, the deployment was still independent of the actual stack.

Attempt 2: Automating Deployments and Deployment Groups with Custom Resources

Next, we tried custom resources. We created two custom resources, one for the deployment and one for the deployment group. The custom resources use Lambda functions to perform stack actions.

Using the crhelper library, we were able to facilitate changes to deployment and custom resources. The deployment group Lambda creates a deployment group if the resource does not exist. It updates the deployment group if an update was changed and deletes the deployment group if a delete occurs.

@helper.update
def update(event,context):
    logger.info("Got Update: " + json.dumps(event))

    print(json.dumps(event))

    application_name = event['ResourceProperties']['ApplicationName']
    service_role_arn = event['ResourceProperties']['ServiceRoleArn']
    cluster_name = event['ResourceProperties']['ClusterName']
    service_name = event['ResourceProperties']['ServiceName']
    tg1_name = event['ResourceProperties']['TG1Name']
    tg2_name = event['ResourceProperties']['TG2Name']
    listener_arn = event['ResourceProperties']['ListenerArn']

    deployment_group_name = event['ResourceProperties']['GroupName']
    deployment_style=event['ResourceProperties'].get(
                              'DeploymentStyle', 'BLUE_GREEN')

    response = cd.update_deployment_group(
                    applicationName=application_name,
                    deploymentGroupName=deployment_group_name,
                    serviceRoleArn=service_role_arn,
                    autoRollbackConfiguration={
                        'enabled': True,
                        'events': ['DEPLOYMENT_FAILURE']
                    },
                    deploymentStyle={
                        'deploymentType': deployment_style,
                        'deploymentOption': 'WITH_TRAFFIC_CONTROL'
                    },
                    blueGreenDeploymentConfiguration={
                        "terminateBlueInstancesOnDeploymentSuccess": {
                            "action": "TERMINATE",
                            "terminationWaitTimeInMinutes": 0
                        },
                        "deploymentReadyOption": {
                            "actionOnTimeout": "CONTINUE_DEPLOYMENT",
                            "waitTimeInMinutes": 0
                        }
                    },
                    loadBalancerInfo={
                        "targetGroupPairInfoList": [
                          {
                            "targetGroups": [
                                {"name": tg1_name},
                                {"name": tg2_name}
                            ],
                            "prodTrafficRoute": {
                                "listenerArns": [listener_arn]
                            }
                          }
                        ]
                    },
                    ecsServices=[
                        {
                          "serviceName": service_name,
                          "clusterName": cluster_name
                        }
                    ]
                )


    print(response)
    helper.Data.update({"Name": deployment_group_name})
    cd_group_id = response['deploymentGroupId']
    return cd_group_id

@helper.create
def create(event, context):
    logger.info("Got Create: " + json.dumps(event))

    application_name = event['ResourceProperties']['ApplicationName']
    service_role_arn = event['ResourceProperties']['ServiceRoleArn']
    cluster_name = event['ResourceProperties']['ClusterName']
    service_name = event['ResourceProperties']['ServiceName']
    tg1_name = event['ResourceProperties']['TG1Name']
    tg2_name = event['ResourceProperties']['TG2Name']
    listener_arn = event['ResourceProperties']['ListenerArn']
    test_listener_arn = event['ResourceProperties']['TestListenerArn']
    deployment_group_name = event['ResourceProperties']['DeploymentGroupName']
    deployment_style=event['ResourceProperties'].get(
                              'DeploymentStyle', 'BLUE_GREEN')

    response = cd.create_deployment_group(
                    applicationName=application_name,
                    deploymentGroupName=deployment_group_name,
                    serviceRoleArn=service_role_arn,
                    autoRollbackConfiguration={
                        'enabled': True,
                        'events': ['DEPLOYMENT_FAILURE']
                    },
                    deploymentStyle={
                        'deploymentType': deployment_style,
                        'deploymentOption': 'WITH_TRAFFIC_CONTROL'
                    },
                    blueGreenDeploymentConfiguration={
                        "terminateBlueInstancesOnDeploymentSuccess": {
                            "action": "TERMINATE",
                            "terminationWaitTimeInMinutes": 0
                        },
                        "deploymentReadyOption": {
                            "actionOnTimeout": "CONTINUE_DEPLOYMENT",
                            "waitTimeInMinutes": 0
                        }
                    },
                    loadBalancerInfo={
                        "targetGroupPairInfoList": [
                          {
                            "targetGroups": [
                                {"name": tg1_name},
                                {"name": tg2_name}
                            ],
                            "prodTrafficRoute": {
                                "listenerArns": [listener_arn]
                            },
                            "testTrafficRoute": {
                                "listenerArns": [test_listener_arn]
                            }
                          }
                        ]
                    },
                    ecsServices=[
                        {
                          "serviceName": service_name,
                          "clusterName": cluster_name
                        }
                    ]
                )
    print(response)
    helper.Data.update({"Name": deployment_group_name})
    cd_group_id = response['deploymentGroupId']
    return cd_group_id


@helper.delete
def delete(event, context):
    logger.info("Got Delete: " + json.dumps(event))
    try:
      application_name = event['ResourceProperties']['ApplicationName']
      deployment_group_name = event['ResourceProperties']['GroupName']
      response = cd.delete_deployment_group(
        applicationName=application_name,
        deploymentGroupName=deployment_group_name
      )
      print(response)
    except Exception as e:
      print(str(e))

The deployment resource creates a new deployment for each update. A poll update is used to return the status of a deployment that takes a while to create. No actions happen during create or delete. When an ECS Service is first created, tasks do not go through a deployment so a create resource action is not required.

Note: Rollbacks do not use the automatic rollback feature in CodeDeploy. Instead, if the deployment fails, the stack will rollback instead and a new deployment will be kicked off to use the previous task definition. If another resource causes the stack to rollback, another deployment will also be kicked off to return to the previous task definition.

@helper.update
def update(event,context):
    logger.info("Got Update: " + json.dumps(event))

    application_name = event['ResourceProperties']['ApplicationName']
    deployment_group_name = event['ResourceProperties']['DeploymentGroupName']
    task_definition = event['ResourceProperties']['TaskDefinition']
    container_name = event['ResourceProperties']['ContainerName']
    after_allow_test_traffic = event['ResourceProperties']['AfterAllowTestTraffic']
    safe_deploy = event['ResourceProperties']['SafeDeploy']

    dg = cd.get_deployment_group(
      applicationName=application_name,
      deploymentGroupName=deployment_group_name
    )

    dgStatus = dg['deploymentGroupInfo'].get('lastSuccessfulDeployment',{}).get('status', '')
    if(dgStatus == 'Succeeded' and safe_deploy):
      inp = f"""\
      version: 0.0
      Resources:
          - TargetService:
              Type: AWS::ECS::Service
              Properties:
                  TaskDefinition: {task_definition}
                  LoadBalancerInfo:
                      ContainerName: {container_name}
                      ContainerPort: 8080
      Hooks:
          - AfterAllowTestTraffic: {after_allow_test_traffic}
      """
    else:
      inp = f"""\
      version: 0.0
      Resources:
          - TargetService:
              Type: AWS::ECS::Service
              Properties:
                  TaskDefinition: {task_definition}
                  LoadBalancerInfo:
                      ContainerName: {container_name}
                      ContainerPort: 8080
      """

    deployment_created = cd.create_deployment(
        applicationName=application_name,
        deploymentGroupName=deployment_group_name,
        revision={
            'revisionType': 'AppSpecContent',
            'appSpecContent': {
                'content': inp,
            }
        },
        ignoreApplicationStopFailures=False,
        autoRollbackConfiguration={
            'enabled': False
        }
    )
    deployment_id = deployment_created['deploymentId']
    helper.Data.update({"deploymentId": deployment_id})
    return deployment_id

@helper.poll_update
def poll_update(event, context):
    logger.info("Got Update Poll: " + json.dumps(event))
    deployment_id = event['CrHelperData']['deploymentId']
    deployment = cd.get_deployment(deploymentId=deployment_id)
    status = deployment['deploymentInfo']['status']
    if(status == 'Succeeded'):
      return deployment_id
    elif(status == 'Failed'):
      raise Exception("failed to complete deployment")
    else:
      return False

Conclusion

We were successfully able to use AWS CodeDeploy to manage our ECS Blue/Green deployments with our existing Cloudformation stack. The current native Cloudformation solution did not fit our use case because the prerequisites of the solution were not compatible with our current application. We went through a few iterations of automation until we landed on a solution that leveraged AWS Cloudformation Custom Resources. We were also able to pull out the custom resources into its own repository so that they could be reused by future ECS projects.

Automating Blue/Green Deployments for ECS Fargate with AWS CodeDeploy

Research

Attempt 1: Automating Deployments and Deployment Groups with AWS Lambda

Attempt 2: Automating Deployments and Deployment Groups with Custom Resources

Conclusion

Written by Avni Patel