Migrating from AWS Opsworks Stacks to AWS Systems Manger

Jason Westbrook
9 min readJun 9, 2023

As of May 26, 2023 AWS sent out emails to any account that was utilizing AWS Opsworks that the service is being shutdown in one year (May 26, 2024). AWS provided a blog post to help users migrate their stacks to AWS Systems Manager. (https://aws.amazon.com/blogs/mt/migrate-your-aws-opsworks-stacks-to-aws-systems-manager/) This blog post links to a migration script that is purported to help you migrate your current running Opsworks Stacks to Systems Manager. While it might help you migrate a simple stack with minimal lifecycle recipes it by no means is a turn key solution.

First the documented limitations:

  • No Windows or CentOS instances — Further clarification the SSM document “AWS-ApplyChefRecipes” does not support the Windows platform at all. You will need to completely rethink any stacks/layers that are Windows based.
  • Does not migrate the built in Chef 11 layers
  • Any recipes that reference Chef attributes or data bags will break
  • No details or information from Apps. This includes the code deployment sources, integrated data sources, environment variables, SSL cert information, domains.
  • No IAM user to Opsworks SSH/RDP user linking
  • No Time / Load / CloudWatch Alarm based instance rules
  • If one Stack has multiple layers, the migration script treats them as separate applications
  • Only the Setup Lifecycle recipes are automatically ran, no other lifecycle recipes are ran.

Now the undocumented limitations and problems:

  • All of the resources are dumped into the same CloudFormation Stack, Try using nested stacks, no need to repeat the same resource over and over.
  • IAM roles get policies directly attached to them. Security Hub starts blinking red when you do this manually, why is AWS providing a CloudFormation Template that does this ?
  • If you have more than 1 layer you will get lots of duplication of IAM Roles and Policies. Probably because of the resource reference linking, but it could have been done a lot better.
  • The Stack config, which includes custom cookbook repo, the recipe run list, the custom JSON get all dumped into one SSM parameter in a double, and triple escaped JSON string. This is unmaintainable.
  • The CloudFormation Template out of the box has a race condition that sometimes fails to trigger the setup lifecycle. I had to decouple this to make this work reliably (more details below).
  • The CloudFormation Template only sends logs to S3, I had to dig and Google a bit to find the config option that sends the logs to CloudWatch Logs instead.
  • Opsworks Chef is more forgiving of loosely formatted cookbooks. The SSM Document ApplyChefRecipes requires all of your cookbooks to be correctly formatted, even if they are not used in your recipes.
  • Any cookbook dependencies in the cookbook metadata need to exist or be removed.
  • SSM ApplyChefRecipes pulls the custom cookbooks every time it runs, it does not rely on a local cache. This is just a change to be aware of — its neither bad or good in my opinion.
  • Recipe timeouts get ignored for SSM Document execution timeouts, but even then if you set an execution timeout of longer than an hour, it ignores that as well.
  • If you look at the SSM Documents there is a Python script that replaces the text “ssm-securestart” and “ssm-secureend” with “{{ssm-secure:}}” so the document run processes the the ssh private key value correctly. This should cause some alarm, as a random AWS developer wrote a Python script that has access to your secure SSM parameter. If you don’t dig in to look at it, who knows what that Python script might do with those values.
  • The Readme.md file in the downloaded script, has a lot more documentation then the web based documentation, make sure you read it.
  • The images in the migration ZIP file show that the Auto Scaling groups are part of the Application Manager page, however Resource Groups that are tag based do not support Auto Scaling Groups. The OpsWorksCFNTemplate.yaml file has the Resource Group Resource being created as a tag based Resource Group so your Auto Scaling Group will not be included.
  • The image in the migration ZIP file show a Instances Table image. I have been unable to find this screen in Systems Manager?
  • Migrating App information to a single SSM Parameter is not feasible as SSM parameters are limited to 8192 characters. As soon as you include the SSL cert, private key, and chain you either exceed that length or leave no room for any other data. If you migrate the SSL cert to Secret Manager, now you incur more cost per month, that you were not charged with Opsworks (Thanks AWS)
  • There is no built in mechanism to define the instance hostname.
  • Custom cookbooks are not rotated out or cleaned up, so hard drive space can get filled quickly with large cookbook repos. There are open issues with the SSM agent ( https://github.com/aws/amazon-ssm-agent/issues/94 , https://github.com/aws/amazon-ssm-agent/issues/308 , https://github.com/aws/amazon-ssm-agent/issues/471 , https://github.com/aws/amazon-ssm-agent/issues/425 ) which do not look like they have been resolved, despite devs/contributors saying they have been resolved.
  • If you have instances in your Opsworks Stack that are using Elastic IPs, you have 2 options:
    #1 Don’t launch that instance into an ASG, launch it manually with CloudFormation. You lose the ASG monitoring in case the instance fails, so you need to add monitoring manually.
    #2 Launch the instance into its own ASG, and as part of the startup of the instance, force disassociate the elastic ip, and then associate with the newly launched instance. (idea from this Medium post https://lakshman301195.medium.com/elastic-ip-in-an-auto-scaling-group-a43c3bc9e74 )
  • In Opsworks you can tell your entire stack to run Setup, Configure or Deploy and the specific layers will run their own set of recipes for those life cycle events. In SSM you need to create a parent Setup/Configure/Deploy Automation document that looks up what instances are attached to the stack or layer, and then sends the specific Automation command to those instances, lots more to build.
  • AutoScaling groups and Launch Templates don’t partition and format any additional EBS volumes you want to attach to your instance. You need to handle in your Setup commands
  • When running your new deployment lifecycle recipes from SSM the StartAutomationExecution API documentation is not well documented so it’s a lot of trial and error, I’ll list what I did below.

So what have I done to make it better ?

I separated out the foundational resources, the Automation Role the SSM Documents run as, JSON config parameters, and the recipe run lists, into its own CloudFormation Stack. As these resources are not going to change for all the layers. I also added another JSON config parameter so the recipes have a flag to know its running from SSM, instead of Opsworks.

So my CloudFormation Stacks look like this

— Foundation
— — Automation Role
— — Base Stack JSON (custom cookbook source)
— — Custom Config JSON
— — Layer 1 Setup Recipe Run List
— — Layer 1 Configure Recipe Run List

— Instances
— — Layer 1
— — — Launch Template
— — — Auto Scaling Group
— — — Layer 1 Setup Document
— — — Layer 1 Configure Document

In the OpsWorksCFNTemplate.yaml file the Auto Scaling Group and the EventBridge Rule are created at the same time. So the ASG can spin up an instance and if the Rule doesn’t exist yet, your Setup Document won’t run. To combat this, I removed the Rule but added a SNS notification to the ASG. The SNS topic sends to a Lambda function that #1 checks to see if the instance is marked as HEALTHY and InService, and if not waits 5 minutes and checks again. Then #2 sends the Setup Document to the instance. This has been 100 times more reliable than guessing if the race condition triggered or not. The Lambda is not part of the CloudFormation Stack, because I’ll be using it for all my stack scaling handling.

To combat the timeouts, I added retries to some of the commands that need more than an hour to run. To answer the question why would I need more than an hour for a recipe, the recipe needs to compile from source — on smaller instances it takes time.

In the SSM Document add the value to the inputs object

"inputs": {
"CloudWatchOutputConfig":{"CloudWatchOutputEnabled":true}

This sends your command output to CloudWatch Logs. You can send to S3 as well if you like.

I was already using my own Chef recipes to add instances to my Load Balancers, so I did not configure the ASG to do so. Because of this I have not explored any problems there.

I added a JSON config value “run_source”:”SSM” so in my recipes I can add a logic block

if node[:run_source] != nil && node[:run_source] == "SSM"

Since most of the values I have been retrieving from Chef Attributes or Chef Data bags are values about the stack or instance, I can get those values from the EC2 metadata like this

 chef_gem 'aws-sdk-ec2' do
action :install
compile_time true
end

require 'aws-sdk-ec2'

ec2_metadata = Aws::EC2Metadata.new
aws_instance_id = ec2_metadata.get('/latest/meta-data/instance-id')

This allows my current recipes to continue working in Opsworks while I continue to explore.

At this point I have only been able to run my Setup Lifecycle. I still need to update my Config recipes, Deploy recipes, add my App configurations.

Because of the length limitation of SSM Parameters, I’m using a combination of SSM Parameter, and Secret Manager. Using Secret Manager, you can replicate the same secret to any regions you are using the SSL cert.

Here is the App configuration JSON I’m storing in a SSM Parameter. This replicates the Chef App JSON structure. Both the SSH key and the SSL cert are stored in Secret Manager, because of SSM Parameter limits and Secret replication.

{
"domain":"www.domain.com",
"app_source":{
"type":"git",
"url":"ssh://",
"revision":"master",
"ssh_key":"arn:aws:secretsmanager::secret:"
},
"ssl_certificate":"arn:aws:secretsmanager::secret:"
}

To prevent the hard drive volume from filling up with the custom cookbooks, I added another chef recipe to put the SSM agent config file at /etc/amazon/ssm/amazon-ssm-agen.json. The file only needs a bit of JSON in it.

{
"Ssm":{
"OrchestrationDirectoryCleanup":"clean-success-failed"
}
}

The SSM document cleanup should be working, but at the moment does not — I’ll be adding a cleanup recipe to clean up the folder /var/lib/amazon/ssm/<instance id>/document/orchestration

At this point I only have my DR region replicated, but not in service yet, as I still need to figure out DNS issues of stack instances talking to each other.

In Opsworks, you can easily click your entire stack and run Configure or Deploy. To replicate that I created an Automation Document that runs an executeScript. The script takes all of the instance ids selected, and splits that list into separate layer lists. The script looks at the tags on the instance, (which come from the EC2 Launch Template) to determine which layer the instance should be in.

def script_handler(events, context):
import json
import boto3
import re
from botocore.exceptions import ClientError
ec2_client = boto3.client('ec2')

print(events['InstanceList'])

instanceids = events['InstanceList']

output = {'layers':{'Layer1':[],'Layer1':[]}}

instances = ec2_client.describe_instances(InstanceIds=instanceids)

for instance in instances:
if 'Reservations' in instances.keys() and len(instances['Reservations']) > 0:
for res in instances['Reservations']:
for instance in res['Instances']:
if 'Tags' in instance.keys():
for tag in instance['Tags']:
tag['Value'] = tag['Value'].replace('-','')
if tag['Key'] == 'SSM-LAYER' and tag['Value'] not in layers.keys():
output['layers'][tag['Value']] = []
if tag['Key'] == 'SSM-LAYER':
output['layers'][tag['Value']].append(instance['InstanceId'])


output['Layer1Count'] = len(output['layers']['Layer1'])
output['Layer2Count'] = len(output['layers']['Layer2'])

return output

To handle extra EBS volumes beyond the root volume, I added another recipe

device_id = "/dev/nvme1n1"
mount_point = "/data"

ruby_block "wait for #{device_id}" do
block do
count = 1
loop do
if File.blockdev?(device_id) or count >= 10
break
else
Chef::Log.info("device #{device_id} not ready - sleeping 10s")
sleep 10
count += 1
end
end
end
end
# create file system
execute "mkfs #{device_id}" do
command "mkfs -t ext4 -m 0.25 #{device_id} && partprobe #{device_id}"
# only if it's not mounted already
not_if "grep -qs #{mount_point} /proc/mounts"
end
# mount
directory mount_point do
mode "0775"
owner "root"
group "root"
recursive true
action :create
end
mount mount_point do
device device_id
fstype 'ext4'
options 'rw,noatime'
action [:enable, :mount]
end

The SSM StartAutomationExecution API call takes the required parameters DocumentName and then many of the other parameters are conditional. Here’s what I did to automate my deployments from my CI/CD pipelines. First list the instance ids with the tag name “SSM-LAYER” and value “Layer-1”, put those into an array of strings. I take that array and pass into the Targets object list, instanceids = [“i-xxxx”,”i-yyyy”]

{
'DocumentName': "DOCUMENT NAME",
'TargetParameterName':'instanceId',
'Targets':[
{
'Key':'ParameterValues',
'Values': instanceids
}
],
'MaxConcurrency':"1",
'MaxErrors':"1"
}

If you are using the automation documents from the tool, the instanceId parameter is what the document expects for your list of instance ids to run the document on.

My Thoughts

  • From the Readme.md file there is a reference to Opsworks V2 ? If AWS is making a new version of Opsworks, why put us through all this migration headaches once, and then when V2 launches (thinking November 2023 at re:invent ?) we have to go back through it again ?
  • As of May 2024, what might happen to instances that are still running in Opsworks? When the Opsworks Chef service goes away, will the chef agent process start throwing errors and cause other problems on those instances ? When the Opsworks Chef service is turned off, is the default last command “shutdown/terminate” ?
  • Are there any big companies that called their account manager when they got the email and tore them a new one ? Would that delay or reverse the decision to kill the service, and rather migrate customers to the new service ?

--

--