Simplifying Docker Deployments on AWS EC2 Instances: A GitHub Actions and AWS SSM Approach

10 min readFeb 18, 2024

In my previous blog, we delved into the intricacies of updating Docker containers via CloudFormation helper scripts, namely cfn-init and cfn-hup. However, amidst the discussion of streamlined deployments, one critical aspect remained unaddressed: the absence of a robust rollback strategy.

Consider this scenario: what happens if, during deployment, the Docker run or pull command encounters a failure? Or worse, if the Docker rm command fails unexpectedly? In such instances, the absence of a rollback strategy leaves our deployment pipeline vulnerable and prone to disruptions.

Streamlining Docker Deployments with GitHub Actions and AWS SSM

AWS Systems Manager (SSM) commands enable remote execution of tasks on EC2 instances, while GitHub Actions automate deployment workflows triggered by events. By integrating SSM commands within GitHub Actions, we can execute deployment tasks and handle failures effectively, ensuring smoother deployments and enhanced system reliability in AWS environments.

What is AWS SSM?

AWS Systems Manager (SSM) commands serve as a powerful tool for managing and executing tasks on EC2 instances without requiring direct access. With SSM, users can remotely run scripts, install software, and perform system updates across fleets of instances, all from a centralized management console. This eliminates the need for manual intervention and simplifies operational tasks, enhancing efficiency and security in cloud environments. SSM also offers features like Parameter Store for securely storing configuration data and Session Manager for interactive shell access to instances without exposing SSH/RDP ports. Overall, SSM commands streamline infrastructure management and enable robust automation capabilities in AWS environments.

1. Sending a Command:

To send a command to EC2 instances, we use the aws ssm send-command CLI command. Let’s say you want to list all running Docker containers on an EC2 instance. You can simply send the docker ps command using the aws ssm send-command CLI command. This command tells SSM to execute docker ps on the specified instance:

aws ssm send-command \
    --instance-ids ${YOUR_INSTANCE_ID} \
    --document-name "AWS-RunShellScript" \
    --comment "List Docker containers" \
    --parameters commands=["docker ps"] \
    --output text

2. Waiting for Command Execution:

We can wait for the command execution to complete using the aws ssm wait command-executed command:

aws ssm wait command-executed \
    --command-id ${COMMAND_ID} \
    --instance-id ${YOUR_INSTANCE_ID}

3. Retrieving Command Output:

To retrieve the output of the command, we use the aws ssm get-command-invocation command:

aws ssm get-command-invocation \
    --command-id ${COMMAND_ID} \
    --instance-id ${YOUR_INSTANCE_ID}

Implementation Steps:

1. Utilizing Docker Image Tagged with GITHUB_NUMBER

jobs:
  deploy:
    name: Deploy
    runs-on: ubuntu-latest

    env:
      ECR_IMAGE_TO_DEPLOY: ""
      YOUR_ECR_REPOSITORY_NAME: "test-application"

    steps:
      - name: Setting and updating Base Environment Variables
        env:
          ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
        run: |
          echo "Create ECR image with GitHub run number as tag"
          echo "ECR_IMAGE_TO_DEPLOY=$ECR_REGISTRY/$YOUR_ECR_REPOSITORY_NAME:$GITHUB_RUN_NUMBER" >> $GITHUB_ENV

      - name: Build and Push to ECR
        run: |
          echo "Building ECR image..."
          docker build -t $ECR_IMAGE_TO_DEPLOY .
          echo "Tagging Docker image with GitHub run number..."
          docker tag $ECR_IMAGE_TO_DEPLOY $ECR_IMAGE_TO_DEPLOY:latest
          echo "Pushing Docker image to ECR..."
          docker push $ECR_IMAGE_TO_DEPLOY

Explanation:

Setting and updating Base Environment Variables: This step sets up the base environment variables required for the deployment process. It retrieves the ECR registry URL and concatenates it with the GitHub run number to create a unique ECR image tag. This ensures that each deployment gets a distinct version identifier based on the GitHub run number.
Build and Push to ECR: Here, the Docker image is locally built using the repository’s Dockerfile. It’s then tagged with a unique ECR image tag from earlier steps and pushed to Amazon ECR. This process deploys the latest application version to ECR with a GitHub run number-based tag.

Using the GitHub run number to version the builds allows for easy rollback to previous versions when needed. Each deployment generates a unique version identifier, making it simple to track and revert to specific versions in case of issues or changes. This versioning strategy enhances the reliability and maintainability of the deployment process.

2. Execute Docker Commands on AWS Instance:

In this step, we orchestrate the execution of Docker commands on EC2 instance to deploy the application using GitHub Actions and AWS Systems Manager (SSM) commands.

name: Execute Docker Commands on AWS Instance to deploy the application
        run: |
          command_id=$(aws ssm send-command \
            --instance-ids ${{ env.INSTANCE_ID }} \
            --document-name "AWS-RunShellScript" \
            --comment "Executing Docker operations" \
            --parameters commands='[
              "set -e",
              "trap '\''echo DOCKER_STOP_FAILURE 1>&2'\'' ERR; if docker ps --format '\''{{.Names}}'\'' | grep -q ${{ env.CONTIANER_NAME }}; then docker stop ${{ env.CONTIANER_NAME }}; fi",
              "trap '\''echo DOCKER_RENAME_FAILURE 1>&2'\'' ERR; if docker ps -a --format '\''{{.Names}}'\'' | grep -q ${{ env.CONTIANER_NAME }}; then docker rename ${{ env.CONTIANER_NAME }} ${{ env.CONTIANER_NAME }}-${GITHUB_RUN_NUMBER}; fi",
              "trap '\''echo DOCKER_LOGIN_FAILURE 1>&2'\'' ERR; aws ecr get-login-password --region eu-west-1 | docker login --username AWS --password-stdin ${{ env.ECR_IMAGE_TO_DEPLOY }}",
              "trap '\''echo DOCKER_PULL_FAILURE 1>&2'\'' ERR; docker pull ${{ env.ECR_IMAGE_TO_DEPLOY }}",
              "trap '\''echo DOCKER_RUN_FAILURE 1>&2'\'' ERR; docker run -d -p 80:80 --name ${{ env.CONTIANER_NAME }} ${{ env.ECR_IMAGE_TO_DEPLOY }}"
            ]' --query 'Command.CommandId' --output text)

          echo "command_id=$command_id" >> $GITHUB_ENV

Explanation:

DOCKER_STOP_FAILURE trap: This trap is set to catch errors in case the docker stop command fails. It ensures that any running containers with the specified name (${{ env.CONTIANER_NAME }}) are stopped gracefully before proceeding with the deployment process.
DOCKER_RENAME_FAILURE trap: In the event of a failure during the Docker operations, such as pull or run, the container is renamed to ${{ env.CONTIANER_NAME }}-${GITHUB_RUN_NUMBER}. This renaming serves as a backup mechanism, preserving the existing container state and configuration. If any subsequent operations fail, the renamed container can be reverted to its original state for recovery purposes.
DOCKER_LOGIN_FAILURE, DOCKER_PULL_FAILURE, DOCKER_RUN_FAILURE traps: These traps handle errors related to Docker login, image pull, and container run operations, respectively. They ensure that any failures during these critical steps are detected and appropriately handled to maintain the integrity of the deployment process.

3. Retrieve and Store SSM Command Results:

After sending the SSM command, the workflow waits for the command to be executed on the EC2 instance. The aws ssm wait command-executed command is used to monitor the execution status of the SSM command. Once the command execution is complete, the output and status of the command invocation are retrieved using the aws ssm get-command-invocation command.

- name: Retrieve and Store SSM Command Results
        run: |
          aws ssm wait command-executed \
            --command-id ${{ env.command_id }} \
            --instance-id ${{ env.INSTANCE_ID }} || true
          ssm_command_output=$(aws ssm get-command-invocation \
                --command-id ${{ env.command_id }} \
                --instance-id ${{ env.INSTANCE_ID }})

          standard_error_content=$(echo "$ssm_command_output" | jq -r '.StandardErrorContent')
          standard_error_content=${standard_error_content//$'\n'/' '}
          status=$(echo "$ssm_command_output" | jq -r '.Status')

          echo "status=$status" >> $GITHUB_ENV
          echo "standard_error_content=$standard_error_content" >> $GITHUB_ENV

This step ensures that the workflow proceeds only after the Docker operations on the EC2 instance have been successfully executed or handled with appropriate error recovery mechanisms.

4. Check Docker Command Success and Remove Old Container:

In this step, we handle the aftermath of Docker command execution, ensuring the smooth progression of our deployment pipeline. Let’s delve into the details:

 - name: Check Docker Command Success and Remove Old Container
        id: handle_docker_results
        run: |
          if [[ "${{ env.status }}" == "Success" ]]; then
            echo "Commands completed successfully."
            echo "Deleting renamed old container"
            removeOldContainerCommandId=$(aws ssm send-command \
              --instance-ids ${{ env.INSTANCE_ID }} \
              --document-name "AWS-RunShellScript" \
              --comment "Deploy test-application" \
              --parameters commands='[
                "set -e",
                "trap '\''echo DOCKER_RM_FAILURE 1>&2'\'' ERR; docker rm ${{ env.CONTAINER_NAME }}-${GITHUB_RUN_NUMBER}"
              ]' --query 'Command.CommandId' --output text)
            echo "Waiting for removal of renamed old container to complete..."
            removeContainerInvocation = aws ssm wait command-executed \
              --command-id "$removeOldContainerCommandId" \
              --instance-id ${{ env.INSTANCE_ID }} || true

            removeContainerInvocationStatus=$(echo $removeContainerInvocation | jq -r '.Status')

            if [[ $removeContainerInvocationStatus == "Success" ]]; then
              echo "Removal of renamed old container completed successfully."
            else 
              error_content=$(echo $removeContainerInvocation | jq -r '.StandardErrorContent')
              echo "Container removal failed with error: ${error_content}"
            fi
          fi

Evaluation of Command Success: We kick off by checking the outcome of the Docker commands executed in the preceding steps. Leveraging environment variables provided by GitHub Actions, we check if the Docker commands concluded successfully.
Removal of Renamed Old Container: Upon confirming the success of the Docker commands, our next course of action involves the removal of the previously renamed old container. This container was renamed during deployment as a safety measure. Now, it’s not needed anymore, so we will remove it to keep the system clean.
Execution of Removal Command via SSM: With SSM send command, we dispatch a directive to the target EC2 instance, commanding the removal of the designated old container. The SSM command execution is monitored to ensure completion and validate the removal of the container.
Verification of Removal Status: After we send the SSM command, we wait to see if the removal is done properly. If everything goes well and the removal is successful, we consider it a win.
Handling Removal failure: If there’s an unexpected issue while we’re removing the container, the workflow will let us know by printing out the error details. This helps us understand why the container removal failed, so we can take the right steps to fix it.

5. Check Docker Command Failures and rollback if needed:

In this phase of our deployment workflow, we address potential setbacks encountered during Docker command execution and swiftly respond with robust recovery mechanisms. Let’s break down this step-by-step process:

- name: Check Docker Command Failures and rollback if needed
        if: env.status == 'Failed'
        run: |
          echo "error_content: ${{ env.standard_error_content }}"
          if echo "${{ env.standard_error_content }}" | grep -Eq "DOCKER_RUN_FAILURE|DOCKER_PULL_FAILURE"; then
            echo "Docker run command failed"
            echo "Handling Docker run failure..."
              echo "Rollback initiated..."
              rollbackId=$(aws ssm send-command \
                --instance-ids ${{ env.INSTANCE_ID }} \
                --document-name "AWS-RunShellScript" \
                --comment "Rollback to previous container version" \
                --parameters commands='[
                  "set -e",
                  "trap '\''echo DOCKER_RENAME_FAILURE 1>&2'\'' ERR; docker rename ${{ env.CONTIANER_NAME }}-${GITHUB_RUN_NUMBER} ${{ env.CONTIANER_NAME }}",
                  "trap '\''echo DOCKER_RESTART_FAILURE 1>&2'\'' ERR; docker restart ${{ env.CONTIANER_NAME }}",
                ]' --query 'Command.CommandId' --output text)

              echo "Waiting for rollback to get completed..."
              aws ssm wait command-executed \
                --command-id "$rollbackId" \
                --instance-id ${{ env.INSTANCE_ID }} || true

              echo "Fetching rollback invocation details..."
              rollbackInvocation=$(aws ssm get-command-invocation \
                --command-id "$rollbackId" \
                --instance-id ${{ env.INSTANCE_ID }})

              rollbackstatus=$(echo $rollbackInvocation | jq -r '.Status')

              if [[ $rollbackstatus == "Success" ]]; then
                echo "Rollback completed: $rollbackstatus"
              else 
                error_content=$(echo $rollbackInvocation | jq -r '.StandardErrorContent')
                echo "Rollback failed with error: ${error_content}"
              fi
              exit 1
          
          elif echo "${{ env.standard_error_content }}" | grep -Eq "DOCKER_STOP_FAILURE|DOCKER_RENAME_FAILURE"; then
            echo "Docker stop/rename command failed"
            echo "Handling Docker stop/rename failure..."
              echo "Deploying application to new evironment"
              rollbackId=$(aws ssm send-command \
                --instance-ids ${{ env.INSTANCE_ID }} \
                --document-name "AWS-RunShellScript" \
                --comment "Deploy application to new evironment" \
                --parameters commands='[
                  "set -e",
                  "trap '\''echo DOCKER_LOGIN_FAILURE 1>&2'\'' ERR; aws ecr get-login-password --region eu-west-1 | docker login --username AWS --password-stdin ${{ env.ECR_IMAGE_TO_DEPLOY }}",
                  "trap '\''echo DOCKER_PULL_FAILURE 1>&2'\'' ERR; docker pull ${{ env.ECR_IMAGE_TO_DEPLOY }}",
                  "trap '\''echo DOCKER_RUN_FAILURE 1>&2'\'' ERR; docker run -d -p 80:80 --name ${{ env.CONTIANER_NAME }} ${{ env.ECR_IMAGE_TO_DEPLOY }}"
                ]' --query 'Command.CommandId' --output text)

              echo "Waiting for application to be deployed to new environment..."
              aws ssm wait command-executed \
                --command-id "$rollbackId" \
                --instance-id ${{ env.INSTANCE_ID }} || true

              echo "Fetching new environment deployment details..."
              newEnvironmentInvocation=$(aws ssm get-command-invocation \
                --command-id "$rollbackId" \
                --instance-id ${{ env.INSTANCE_ID }})

              newEnvironmentDeploymentStatus=$(echo $newEnvironmentInvocation | jq -r '.Status')

              if [[ $newEnvironmentDeploymentStatus == "Success" ]]; then
                echo "New environment deployment completed: $newEnvironmentDeploymentStatus"
              else 
                error_content=$(echo $newEnvironmentInvocation | jq -r '.StandardErrorContent')
                echo "New environment deployment failed with error: ${error_content}"
              fi

          else 
            echo "Command did not complete successfully."
            echo "Error details: ${{ env.standard_error_content }}"
            exit 1
          fi

Handling Docker Run or Pull Failures:

Identifying Docker Run or Pull Failures: If errors related to Docker run or pull commands are found, the workflow recognizes that the deployment process may have encountered issues.
Initiating Rollback Procedure: To rectify deployment problems, the workflow triggers a rollback process using AWS Systems Manager (SSM) commands.
Rollback Steps: The rollback process involves renaming the current container to its previous version and restarting it to address deployment glitches.
Waiting for Rollback Completion: After initiating the rollback, the workflow waits for the rollback command to execute and complete its task.
Verifying Rollback Status: Once the rollback command finishes execution, the workflow verifies its status to ensure that the rollback procedure was successful.
Handling Rollback Failure: If the rollback encounters errors or fails to execute successfully, the workflow logs the error details and acknowledges the rollback failure.
Ensuring Deployment Stability: To maintain system stability and prevent further complications, the workflow exits with an error status until the rollback issues are resolved, ensuring a smooth and stable deployment process.

2. Handling Docker Stop or Rename Failures:

Handling Docker Stop/Rename Failure: Upon detecting failures in Docker stop or rename commands, the workflow initiates steps to address the issue.
Deployment to New Environment: If Docker stop or rename commands fail, it suggests that there is no existing container running, indicating a need for a new environment deployment. This approach ensures that the application is deployed to a fresh environment.
Execution of Deployment Commands: The workflow utilizes SSM to send deployment commands to the instance. These commands include logging into the AWS Elastic Container Registry (ECR), pulling the required Docker image, and running the container with specified configurations.
Monitoring Deployment Process: After sending the deployment commands, the workflow waits for the deployment process to complete by using the aws ssm wait command-executed function, ensuring that the new environment deployment progresses smoothly.
Verification of Deployment Status: Once the deployment process concludes, the workflow retrieves the status of the deployment invocation to check whether the new environment deployment was successful.
Handling Deployment Outcome: If the deployment to the new environment succeeds, the workflow logs the completion status and acknowledges the successful deployment. However, if the deployment encounters errors, the workflow logs the error details and acknowledges the deployment failure, ensuring transparency and accountability in the deployment process.

3. Fallback for unexpected errors:

The last part of the script (else part) functions as a catch-all mechanism for handling unexpected errors.
If none of the previously mentioned failure scenarios occur and the script encounters an unexpected error, it prints a message indicating that the command did not complete successfully.
Additionally, it logs details of the error content, allowing developers to review and troubleshoot the issue directly from the workflow logs.
This ensures that any unforeseen errors are properly documented and visible within the workflow environment, facilitating effective debugging and resolution processes.