Using AWS Simple Systems Manager (and Lambda) to replace Cron in an EC2 Auto Scaling Group

Simon Rand
Aug 24, 2017 · 6 min read

We recently streamlined our frontend server fleet (on EC2, leveraging Auto Scaling Groups) with one goal being to make all instances equal and as throwaway as any other — this meant we couldn’t have one special server in a group that ran our scheduled jobs (using Cron, or whatever).

Enter Amazon EC2 (Simple) Systems Manager (SSM), essentially a tool to help manage large fleets of systems. One feature available in SSM is Run Command — which allows you securely run commands on one or more instances. Combine this with a scheduled Lambda function and we can run specific commands at a specific time.

Here’s an overview of how to implement:

1. Install SSM Agent on all instances you want to use to run commands

  • Ensure instances have the correct IAM policy to send and receive messages with SSM, do this by ensuring the instance role has the AmazonEC2RoleforSSM role:

2. Create the Run Command Documents

Run Command leverages SSM Documents to execute certain actions on instances, we can use these define the commands we want to run:

  • Select Documents under SYSTEMS MANAGER SHARED RESOURCES in the EC2 Console, and select Create Document:
  • Enter a Name for the Document (you’ll use this as a reference to the Document in the Lambda function)
  • Leave Document Type as Command
  • To run a shell script use the following template:
{
"schemaVersion": "1.2",
"description": "Test SSM.",
"parameters": {

},
"runtimeConfig": {
"aws:runShellScript": {
"properties": [
{
"id": "0.aws:runShellScript",
"runCommand": ["echo 'shell script here'"]
}
]
}
}
}

Note: these commands are run as root on your instances, so bear this in mind when creating Documents.

(You read more about Documents in the AWS docs: https://docs.aws.amazon.com/systems-manager/latest/userguide/sysman-ssm-docs.html)

  • After saving the Document, select Run Command to test your Document on an instance:
  • Select your Document from the list:
  • Then select the instance(s) you want the Document to be run on:
  • When you Run you should see a confirmation message, and on return to the main Run Command screen, a new entry in the Run Command list with the status of your command:
You can view details about a command (including captured output) by selecting it.

Once you are happy everything is working, the next step is to create a Lambda function to run these commands on your instance(s).

3. Implement a Lambda function to run commands via SSM

In order to run these commands on instances we need some way of triggering them — for what we wanted to acheive Lambda was the best choice as we wanted to be able run commands across any current server in the Auto Scaling Group (ASG), so the Lambda function needed to:

  1. Query an ASG and get a current set of in-service and healthy instances
  2. Select one of these instances (potentially based on some pre-defined criteria)
  3. Run the command on the selected instance (using SSM)

To do this we used the following:

'use strict'

const AWS = require('aws-sdk')
const autoscaling = new AWS.AutoScaling()
const ssm = new AWS.SSM()

exports.handler = (event) => {
fetchInstance(event.environment)
.then(instance =>{
runCommand(event.documentName, instance)
})
.catch(err => {
reportFailure(err)
})
}

const fetchInstance = (environment) => {
return new Promise((resolve, reject) => {
autoscaling.describeAutoScalingGroups({
AutoScalingGroupNames: [ environment ]
}, (err, data) => {
if (err) {
reject(JSON.stringify(err))
} else {
const instance = selectInstance(data.AutoScalingGroups[0].Instances)
if(instance) {
resolve(instance)
} else {
reject('No instances are available to run commands')
}
}
})
})
}

const reportFailure = (failureMessage) => {
const failureSnsTopic = process.env.FAILURE_SNS_TOPIC

if(failureSnsTopic) {
reportFailureToSns(failureSnsTopic, failureMessage)
} else {
console.log('Warning: no failure SNS defined.')
console.log('Scheduled Job failed:', failureMessage)
}
}

const reportFailureToSns = (topic, message) => {
const sns = new AWS.SNS()

return new Promise((resolve, reject) => {
sns.publish({
Message: message,
Subject: 'Scheduled Job Failed',
TopicArn: topic
}, (err, data) => {
if (err) {
reject(err)
} else {
resolve(data)
}
})
})
}

const runCommand = (documentName, instance) => {
ssm.sendCommand({
DocumentName: documentName,
InstanceIds: [ instance ],
TimeoutSeconds: 3600
}, function(err, data) {
if (err) {
reportFailure(JSON.stringify(err))
} else {
console.log(data)
}
})
}

const selectInstance = (instances) => {
// Find all healthy and in service instances
instances = instances.filter(instance => {
return instance.HealthStatus == 'Healthy' && instance.LifecycleState == 'InService'
})

if(instances.length === 0) return

// For now just select a random instance
return instances[Math.floor(Math.random()*instances.length)].InstanceId
}

This allows us to pass a Document to the Lambda function and run it in an environment (i.e. the name of the ASG). This will also create a message on an SNS topic if we can’t run the command, for this you’ll need to ensure the FAILURE_SNS_TOPIC environment variable is defined on the Lambda function.

(Note: we just randomly select an instance to run the command on for now, you could do something more clever here, like checking the load of the servers for example, but randomly selecting a host works well enough for us)

You’ll need to setup the correct IAM permissions for this to run, the Lambda function uses a role that can:

  • Interact with the Auto Scaling Group, , autoscaling:DescribeAutoScalingGroups, autoscaling:DescribeScalingActivities and EC2:DescribeInstances
  • ssm:SendCommand on both the accountec2 document/* and instance/* resources (i.e. arn:aws:ec2:::document/*)
  • Publish to the SNS failure topic (sns:Publish on the relevant topic)

4. Implement CloudWatch Event Rule to trigger the Lambda function

The final part of this is to use CloudWatch Events to trigger the Lambda function as required–simply create event rules for the schedule/times we want to run our commands and set the target to the Lambda function:

  • Go to Events > Rules in the CloudWatch console and select CreateRule:
  • Under Event Source choose Schedule to invoke event targets according a fixed time period (there’s obviously nothing stopping you running these based off an Event Pattern)
  • Enter your schedule either using a Fixed rate or Cron expression
  • Under Target select Add Target, choose Lambda function and select your function from the list.
  • We will pass the required Document name and environment to the Lambda function using a JSON string — under Configure Input select Constant (JSON text) and enter the correct values, e.g.{“documentName”: “documentToRun”, “environment”: “production”}
Creating a new CloudWatch Events Rule
  • You can add more than one target so you can run multiple commands using the same event rule if required.
  • Click Configure Details to enter a name and an optional description name for the rule, and once you click Create Rule your event will start triggering the Lambda function and running the command on your servers.

There’s probably more we could add to the Lambda function in terms of error handling, but we’ve been using this approach for over two months across all environments with almost 100% success of commands run (we’ve had only 3 failed commands compared to thousands of successful commands run) — so far this is proving to be a great solution (on AWS) to our original problem of detaching our scheduled jobs from our frontend servers.

)

Simon Rand

Written by

Software Engineer @ Bergamotte

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade