Mary Had a Little Lambda

Dean Fiala
The Startup
Published in
9 min readNov 5, 2020

--

Exploring the goodies Amazon’s Web Services (AWS) offers is a kid-in-the-candy-store experience. From Machine Learning gewgaws to Text Extraction baubles, the variety of services is both enticing and overwhelming. There are so many things to sample, it is impossible to know where to start.

Photo by Viktor Talashuk on Unsplash

Training and tutorials have their place, but I have found that needing to make a tool do something useful is the most effective way to learn it. Fortunately, I had a likely project available.

I inherited a Jenkins server that performs infrastructure tasks, an inordinate number of which call into our AWS environment to create manual database snapshots that we share with another AWS account. Every time a new database is added the environment, past practice had been to copy an existing Jenkins task, update the script by changing three variables in far too many places and calling it a day. While this might have been “acceptable” when we had three or four databases, it is unwieldy with twenty-plus.

Since we are dealing with AWS databases, and each script makes calls to the AWS RDS service, it made sense to bring the snapshot functionality into AWS.

But how?

One option would have been a dedicated server instance that hosted a snapshot creation service. From an implementation standpoint, this would have been the simplest approach, but keeping a server running for something that runs once a day would be an expensive solution. Finally trading a Jenkins server for one that did the same thing, even in a DRYer way, made no sense.

“What about Lambdas?”

So I started thinking, “what about Lambdas?”. Lambda functions in AWS provide serverless, on-demand code execution. No server required, the code runs as needed, and can be easily invoked via a cron-based trigger. Lambda functions can also be created in a number of languages, including Ruby. That sounded like a promising path.

Guided by the existing scripts my initial naive design had everything in a single Lambda function. In pseudo code…

def create_snapshots  snapshots = get_snapshots_create
snapshots.each do |snapshot|
create_snapshot(snapshot.cluster_id, snapshot.temp_id)
status = get_status(snapshot.temp_id)
while status != 'available' do
status = get_status(snapshot.temp_id)
sleep
end
copy_snapshot(snapshot.temp_id, snapshot.final_id)
status = get_status(snapshot.final_id)
while status != 'available' do
status = get_status(snapshot.final_id)
sleep
end
share_snapshot(snapshot.final_id)
end
end

While it was straightforward, I didn’t care for all the sleeping. Depending on the size of the database temporary snapshot creation and copying the final snapshot can each take at least two minutes and upwards of ten. The existing Jenkins scripts kick off these steps then poll each snapshot to check its status. Placing sleep statements between polling calls in the Lambda function would have meant that 99% of the time would be spent sleeping. As Lambda usage is charged by execution time — that would have been wasteful.

Photo by Matheus Farias on Unsplash

Also, this approach would mean that the process would run serially and each additional snapshot would make the process run longer. AWS offers this beautiful scalable infrastructure and this method took zero advantage of it.

A little searching pointed to a solution — Step Functions. An AWS Step Function is a state machine that can call out to other AWS services including Lambdas, can wait for a set amount of time, and can branch based on a value. And importantly, usage is charged per state transition, so sleep time could be removed from the execution cost.

Translating an individual script into a state machine resulted in this…

The Start step invokes a Lambda function to delete any existing snapshots and create a new one.

The Creating step waits 60 seconds.

The CheckStatus step invokes a Lambda function to retrieve the snapshot status.

HasCreated simply checks the snapshot status returned by the CheckStatus. If it is “available” the Step Function moves on to the Copy step, otherwise it returns to the Creating step.

The Copy step invokes a Lambda function to copy the temporary snapshot to a final version with the shared key necessary to provide access to another account. (The reason we do all this in the first place.)

Similarly to the Creating loop, in the Copying loop there are three steps…

  • Copying is a wait step
  • CheckCopyStatus invokes a Lambda function to check the final snapshot status
  • HasCopied checks the snapshot status and either continues the loop or moves on the ShareSnapshot step

ShareSnapshot sets the restore attribute on the snapshot to the target account, and it’s Ready.

Here’s the json that defines the state machine…

{
"StartAt": "Start",
"States": {
"Start" : {
"Type":"Task",
"Resource":"arn:aws:states:::lambda:invoke",
"Parameters":{
"FunctionName":"snapshot_create",
"Payload":{
"cluster_id.$":"$.cluster_id",
"final_snapshot_id.$":"$.final_snapshot_id",
"temp_snapshot_id.$":"$.temp_snapshot_id"
}
},
"ResultPath": "$.result",
"ResultSelector": {
"status.$": "$.Payload.body"
},
"Next" : "Creating"
},
"Creating": {
"Type": "Wait",
"Seconds": 60,
"Next": "CheckStatus"
},
"CheckStatus" : {
"Type":"Task",
"Resource":"arn:aws:states:::lambda:invoke",
"Parameters":{
"FunctionName":"snapshot_status",
"Payload":{
"cluster_id.$":"$.cluster_id",
"snapshot_id.$":"$.temp_snapshot_id"
}
},
"ResultPath": "$.result",
"ResultSelector": {
"status.$": "$.Payload.body"
},
"Next" : "HasCreated"
},
"HasCreated" :{
"Type": "Choice",
"Choices": [
{
"Variable": "$.result.status",
"StringEquals": "available",
"Next": "Copy"
}],
"Default": "Creating"
},
"Copy" : {
"Type":"Task",
"Resource":"arn:aws:states:::lambda:invoke",
"Parameters":{
"FunctionName":"snapshot_copy",
"Payload":{
"temp_snapshot_id.$":"$.temp_snapshot_id",
"final_snapshot_id.$":"$.final_snapshot_id"
}
},
"ResultPath": "$.result",
"ResultSelector": {
"status.$": "$.Payload.body"
},
"Next" : "Copying"
},
"Copying": {
"Type": "Wait",
"Seconds": 60,
"Next": "CheckCopyStatus"
},
"CheckCopyStatus" : {
"Type":"Task",
"Resource":"arn:aws:states:::lambda:invoke",
"Parameters":{
"FunctionName":"snapshot_status",
"Payload":{
"cluster_id.$":"$.cluster_id",
"snapshot_id.$":"$.final_snapshot_id"
}
},
"ResultPath": "$.result",
"ResultSelector": {
"status.$": "$.Payload.body"
},
"Next" : "HasCopied"
},
"HasCopied" :{
"Type": "Choice",
"Choices": [
{
"Variable": "$.result.status",
"StringEquals": "available",
"Next": "ShareSnapshot"
}],
"Default": "Copying"
},
"ShareSnapshot" : {
"Type":"Task",
"Resource":"arn:aws:states:::lambda:invoke",
"Parameters":{
"FunctionName":"snapshot_share",
"Payload":{
"cluster_id.$":"$.cluster_id",
"snapshot_id.$":"$.final_snapshot_id"
}
},
"ResultPath": "$.result",
"ResultSelector": {
"status.$": "$.Payload.body"
},
"Next" : "Ready"
},
"Ready": {
"Type": "Pass",
"Result": "Hello",
"End": true
}
}
}

Tying it all together, we end up with the following…

A daily event invokes the manual_db_snapshot Lambda which spawns a snapshot_builder state machine for each snapshot. These use the remaining Lambda functions to manipulate the snapshots.

This approach no longer has sleeping Lambdas and takes advantage of the AWS infrastructure to run it all in parallel!

Let’s take a look at the underlying code…

The Lambdas

As mentioned above, I used Ruby to create the Lambda functions. In this implementation, the Lambda functions are entry wrappers for the real work that goes on in the SnapshotService class.

Our first Lambda function is manual_db_snapshot which starts the whole process. As shown above, it is invoked by a cron scheduled CloudWatch event.

require 'aws-sdk-states'
require 'json'
require_relative './snapshot_service'
def lambda_handler(event:, context:)
state_machine_arn = "arn:aws:states:us-east-1:ACCTNUM:stateMachine:snapshot_builder"
snapshots = SnapshotService.get_snapshots
step_function_client = Aws::States::Client.new()
snapshots.each do |snapshot|
puts snapshot[:final_snapshot_id]
name = "#{snapshot[:final_snapshot_id]}_#{DateTime.now.strftime("%Y%m%d_%H%M")}"
step_function_client.send(:start_execution, {:name => name, :state_machine_arn => state_machine_arn, :input => JSON.generate(snapshot) })
end
{ statusCode: 200, body: JSON.generate(snapshots) }
end

The important bits here…

require 'aws-sdk-states'

This loads the AWS State Function Ruby SDK, so we can call our state machine…

step_function_client.send(:start_execution, {:name => name, :state_machine_arn => state_machine_arn, :input => JSON.generate(snapshot) })

Where the state_machine_arn is, well, the arn of our state machine. The name is a little more interesting, it is the name of the state machine execution. For state machines, start_execution is idempotent, or to quote the AWS API

If StartExecution is called with the same name and input as a running execution, the call will succeed and return the same response as the original request. If the execution is closed or if the input is different, it will return a 400 ExecutionAlreadyExists error. Names can be reused after 90 days.

So we make a unique name, which also allows us to easily view and log the execution for each individual snapshot in case there are any issues.

The last bit is the input, which is the JSON payload that tells the state machine what snapshot to create, and looks like this…

{
"cluster_id": "cool-cluster",
"final_snapshot_id": "cool-cluster-final",
"temp_snapshot_id": "cool-cluster-tmp"
}

The first thing the state machine does is invoke the snapshot_create Lambda.

def lambda_handler(event:, context:)
result = SnapshotService.start_snapshot(event["cluster_id"], event["final_snapshot_id"], event["temp_snapshot_id"])
{ statusCode: 200, body: result }
end

The real work happens in the SnapshotService class, which does all its work with the AWS RDS Ruby SDK…

require 'aws-sdk-rds'
require_relative './aws_helper'
class SnapshotServicedef self.rds_client
Aws::RDS::Client.new()
end
def self.start_snapshot(cluster_id, final_snapshot_id, temp_snapshot_id)
delete_snapshot(final_snapshot_id)
delete_snapshot(temp_snapshot_id)
create_snapshot(cluster_id, temp_snapshot_id)
end
def self.delete_snapshot(snapshot_id)
existing_snapshot_ids = self.list_cluster_snapshots.map{ |cs| cs[:ID] }
if existing_snapshot_ids.include?(snapshot_id)
rds_client.send(:delete_db_cluster_snapshot, {:db_cluster_snapshot_identifier => snapshot_id})
end
end
def self.create_snapshot(cluster_id, snapshot_id)
rds_client.send(:create_db_cluster_snapshot, { :db_cluster_identifier => cluster_id, :db_cluster_snapshot_identifier => snapshot_id })
end

After waiting for 60 seconds the state machine invokes the snapshot_status Lambda to see if the snapshot has been created…

def lambda_handler(event:, context:)
result = SnapshotService.get_snapshot_status( event["snapshot_id"]);
{ statusCode: 200, body: result }
end

using a method in the aforementioned SnapshotService class…

def self.get_snapshot_status(snapshot_id)
begin
result = rds_client.send(:describe_db_cluster_snapshots, { :db_cluster_snapshot_identifier => snapshot_id})
rescue Aws::RDS::Errors::DBClusterSnapshotNotFoundFault => ex
return "no snapshot"
end
result["db_cluster_snapshots"][0]["status"]
end

Once the snapshot is available, the state machine invokes the snapshot_copy Lambda…

def lambda_handler(event:, context:)
result = SnapshotService.copy_snapshot(event["temp_snapshot_id"], event["final_snapshot_id"])
{ statusCode: 200, body: result }
end

which in turn calls…

def self.copy_snapshot(temp_snapshot_id, final_snapshot_id)
result = rds_client.send(:copy_db_cluster_snapshot, {:source_db_cluster_snapshot_identifier => temp_snapshot_id,
:target_db_cluster_snapshot_identifier => final_snapshot_id,
:kms_key_id => SHARED_KEY_ARN})
end

After more waiting and checking the status of the copied final snapshot, the state machine makes one last Lambda invocation to snapshot_share.

def lambda_handler(event:, context:)
SnapshotService.share_snapshot(event["snapshot_id"])
{ statusCode: 200, body: "shared" }
end

All this does is set the snapshot to be accessible by the shared account.

def self.share_snapshot(final_snapshot_id)
result = rds_client.send(:modify_db_cluster_snapshot_attribute, { :db_cluster_snapshot_identifier => final_snapshot_id,
attribute_name: "restore", values_to_add: [SHARED_ACCOUNT_ID] })
end

That’s it — one CloudWatch event, five Lambda functions, one state machine and use of the AWS RDS Ruby SDK later — creating a snapshot for a new cluster is as simple as adding a hash to our cluster list…

def self.clusters_to_snapshot
[
{id: "cluster-a", final_snaphot_id: "cluster-a-final", temp_snapshot_id: "cluster-a-tmp"},
{id: "new-cluster", final_snaphot_id: "new-cluster-final", temp_snapshot_id: "new-cluster-tmp"}
]
end

Permissions

AWS resources are locked down by default. So to make this all work we need to create some execution roles and give them the permissions to properly access the needed services.

I created two IAM roles. The manual_db_snapshot_role is a Lambda basic execution role that has additional policies that allow it to:

  • invoke our state machine
  • list the database clusters and snapshots in the account
  • create, copy and delete snapshots, and alter their attributes
  • read the shared key used in the Copy step

The permissions JSON…

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"kms:GetPublicKey",
"rds:CopyDBClusterSnapshot",
"rds:DeleteDBClusterSnapshot",
"rds:DescribeDBClusterSnapshots",
"states:StartExecution",
"kms:DescribeKey",
"rds:CreateDBClusterSnapshot",
"rds:ModifyDBClusterSnapshotAttribute",
"rds:DescribeDBClusters"
],
"Resource": [
"arn:aws:states:us-east-1:ACCT_NUM:stateMachine:snapshot_builder",
"arn:aws:kms:us-east-1:ACCT_NUM:key/key",
"arn:aws:rds:*:ACCT_NUM:cluster-snapshot:*",
"arn:aws:rds:*:ACCT_NUM:snapshot:*",
"arn:aws:rds:*:ACCT_NUM:db:*",
"arn:aws:rds:*:ACCT_NUM:cluster:*"
]
}
]
}

This role is used by all the Lambda functions.

The state_machine_snapshot_builder_role has the permissions necessary to invoke our four Lambda functions. This role used by our state machine.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": "lambda:InvokeFunction",
"Resource": "arn:aws:lambda:us-east-1:ACCT_NUM:function:snapshot_create"
"arn:aws:lambda:us-east-1:ACCT_NUM:function:snapshot_copy",
"arn:aws:lambda:us-east-1:ACCT_NUM:function:snapshot_status",
"arn:aws:lambda:us-east-1:ACCT_NUM:function:snapshot_share"
]
}
]
}

Conclusion

Two AWS services down, one, two..twenty..three hundred — a lot more to go — but now not only have I learned how leverage new AWS services to do my bidding, but adding the next cluster to the snapshot process will take me two minutes. And I am also twenty-plus tasks closer to replacing our Jenkins server.

Photo by Viktor Talashuk on Unsplash

Now, I really want to reach up on the top shelf and pull down the Machine Learning jar, but I will need to dream up a compelling case before I take my next trip to the candy store.

--

--