Serverless Image Processing on AWS

Published in

The Startup

13 min readFeb 7, 2021

Or, “How I built Flickr with some events and a pub/sub queue”

Wind blown snow on the side of a tree — Image by the author

In my last note, I talked about how I set up some serverless code build pipelines on AWS to assemble Go code into a Lambda function. At the time, I mentioned that I would be ploughing on into writing up how I used those Lambdas to build — or at least move toward — a personal photo archiving service.

To recap what I want the service to do: I want a place in S3 that I can drop a collection of photos, then have The System have a look at the date of the photo in it’s EXIF metadata, archive them in a year/month/day hierarchy, build a thumbnail image, and make a second copy on Wasabi.

Longer term, I’m looking to automatically generate webpages that display the thumbnails, and click through to original full size image. As I said, roughly what Flickr does, without the social aspect.

My initial architecture was as shown below. I assembled all this just to verify that all the pieces worked and there were not too many surprises. Obviously there’s a a fair bit of detail missing from here around security controls, and the logging to CloudWatch is not shown, but you get the idea. You might notice Secrets Manager in the mix — I definitely don’t want the Wasabi access key and secret key hard wired into any code, and Secrets Manager lets me decouple that secret from the code and manage the secret out-of-band to the rest of the environment management.

When I sat down a week or so ago to write this up though, I realised that there was a bad smell. Right there in the middle is a single Lambda function that archived the photos and made the thumbnail and archived to Wasabi. Yeah, not a good smell. As soon as you have something in your design that has several different behaviours that you describe with “and”, you know you have something that needs to get broken up into pieces.

So then, architecture version 2 was simply breaking the lambda up:

Now each Lambda does exactly one thing. The overall flow is:

S3 notices that a file has been added;
It notifies the “wasabi” Lambda that there’s a file to copy to Wasabi;
It notifies the “thumbnail” Lambda that there’s a file to make a thumbnail of;
It notifies the “archive” Lambda that there’s a file to archive.

Sweet! A cleaner architecture, all three steps now happen in parallel rather than sequentially. Now lets build it.

First step was to break the Lambda up — this proved to be very fast, and I’m still amazed at how fast Go can be to iterate on and get a solid solution with. I did need to refactor the code pipelines in Terraform a little, but I made a Terraform module to encapsulate all that, and I can just pass in a list of Git project names (which made it really easy to add an additional Lambda, as you will see below).

Next step was to rejig the Terraform code to define the new Lambdas. Quick, easy, painless. Right. Final step, tell the S3 bucket to make the notifications.

And that’s where it fell apart. It turns out there’s a behaviour of the S3-to-Lambda notifications that is not clearly documented: the notifications are configured in terms of “this bucket and this key prefix and this key suffix trigger that Lambda”. Oops. The same combination of bucket/prefix/suffix cannot be used to trigger two different lambdas. More concretely, something like this is disallowed:

resource "aws_s3_bucket_notification" "photos" {
  bucket = aws_s3_bucket.photos.id  lambda_function {
    lambda_function_arn = aws_lambda_function.archive_photo.arn
    events              = ["s3:ObjectCreated:*"]
    filter_prefix       = local.import_prefix
    filter_suffix       = ".jpeg"
  }  lambda_function {
    lambda_function_arn = aws_lambda_function.thumb_photo.arn
    events              = ["s3:ObjectCreated:*"]
    filter_prefix       = local.import_prefix
    filter_suffix       = ".jpeg"
  }
}

If you attempt this, you will be told that the “Configuration is ambiguously defined”, which is a very ambiguous message. What it actually means is that the a combination of event type, suffix and prefix must be unique. You cannot use the S3 lambda notification to fan out to different Lambdas for the same notification. End of story.

Sigh. Time to iterate on the architecture again. When in doubt, add a queue, or in this case an SNS publish/subscribe topic, which in retrospect I should have been thinking of doing in the first place

As I said, sometimes all you need is pub/sub queue. This version worked like a charm, although you will notice I added an additional Lambda: I was unsure what the message format that the Lambdas would now receive was, so I built a simple lambda that just logged the stream of notifications, which proved to be a boon for debugging some mystery behaviours in the implementation.

As I said above, I’d re-worked the code pipeline Terraform code so that it was a module: to add this fourth Lambda, I added the name of the project to the list I was passing to the module, added the code, and a few minutes later the Lambda artifact popped out the other side into my build bucket. I love it when a plan comes together, and I do love being lazy even when it requires some hard work to be lazy.

You will notice that the “archive” Lambda is still fired directly from an S3 event. I want to make sure that the archive happens, even if the other steps don’t, rather than there being a risk that a file will get dropped in progress. As it stands, the file is not removed from the “import” bucket location until it’s been confirmed that it’s safely in the “Archive” bucket. I then use an S3 notification to publish to an SNS topic, and the remaining steps happen in parallel somewhere downstream and asynchronously.

The notifications configuration is really simple now:

resource "aws_s3_bucket_notification" "photos" {
  bucket = aws_s3_bucket.photos.id  lambda_function {
    lambda_function_arn = aws_lambda_function.archive_photo.arn
    events              = ["s3:ObjectCreated:*"]
    filter_prefix       = local.import_prefix
    filter_suffix       = ".jpeg"
  }  topic {
    topic_arn     = aws_sns_topic.photo_created.arn
    events        = ["s3:ObjectCreated:*"]
    filter_prefix = local.archive_prefix
    filter_suffix = ".jpeg"
  }
}

If you are paying attention, you might notice that I am using a single bucket for both the import location, the archive location, and the thumbnail location. This is not necessarily good practice, and if I was building this for enterprise use I would split the three different data use cases across different buckets for just that extra bit of safety. For my purposes it’s probably overkill… and I anticipate that I will do it anyway, because it makes me itchy to not do things the right way.

Wiring in the SNS topic is simple — the only complicated part is my instinct to use the principle of least privilege, and ensure only S3 can publish to the topic, and only in relation to the bucket of interest:

resource "aws_sns_topic" "photo_created" {
  name   = local.thumb_topic
  policy = data.aws_iam_policy_document.photo_created.json
  tags   = merge({ "Name" = local.thumb_topic }, var.tags)
}data "aws_iam_policy_document" "photo_created" {
  statement {
    sid       = "photoCreated"
    actions   = ["SNS:Publish"]
    resources [  
"arn:aws:sns:${var.aws_region}:${var.aws_account_id}:${local.thumb_topic}"
    ]
    principals {
      type        = "Service"
      identifiers = ["s3.amazonaws.com"]
    }
    condition {
      test     = "ArnLike"
      variable = "aws:SourceArn"
      values = [aws_s3_bucket.photos.arn]
    }
  }
}

Similarly, Terraform makes defining the Lambda very simple:

resource "aws_lambda_function" "topic_logger" {
  description   = "Log SNS messages"
  s3_bucket     = aws_s3_bucket.build.id
  s3_key        = local.lambda[local.topic_logger]
  package_type  = "Zip"
  function_name = local.topic_logger
  handler       = local.topic_logger
  role          = aws_iam_role.topic_logger.arn
  runtime       = "go1.x"
  memory_size   = "256"
  timeout       = "60"
  publish       = true  tags = merge({ "Name" = "TopicLogger",
    "Version" = local.lambda[local.topic_logger]}, var.tags)
}resource "aws_iam_role" "topic_logger" {
  name                  = local.topic_logger
  assume_role_policy    = 
    data.aws_iam_policy_document.lambda_assume_role_policy.json
  force_detach_policies = true
  tags  = merge({ "Name" = local.topic_logger }, var.tags)
}resource "aws_iam_role_policy_attachment" "topic_logger" {
  for_each = toset([
    "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole",
  ])
  role       = aws_iam_role.topic_logger.name
  policy_arn = each.value
}

Here you can see that we define the characteristics of the Lambda — the runtime language, where to find the code, how much memory to use, etc, and then associate an IAM role with it. This role defines the permissions that the Lambda has while running. For the topic logger, I don’t need anything else than the basic permissions, but for the other Lambdas I have additional custom policies that constrain where they can read S3 objects from and where they can write S3 objects to (and in the case of the Wasabi Lambda, permission to read the secrets).

The only slightly opaque bit is the “assume role policy” — this is the IAM policy that says “this service is allowed to assume an IAM role on our behalf”, and for our Lambda functions looks like this:

data "aws_iam_policy_document" "lambda_assume_role_policy" {
  statement {
    actions = ["sts:AssumeRole"]
    principals {
      type        = "Service"
      identifiers = ["lambda.amazonaws.com"]
    }
  }
}

Assume Role policies are a bit confusing, and are best thought of in this fashion — you want some AWS service to act on your behalf, so you define a role with the attached policies that grant permissions, just like you do when granting permissions to a user or a group. For AWS services, there’s just the additional little step where you also have to grant the service permission to use (or assume) that role. That’s where the Assume Role policy comes in.

To wire this into SNS, first you give permission for SNS to invoke the Lambda:

resource "aws_lambda_permission" "topic_logger" {
  statement_id  = local.topic_logger
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.topic_logger.arn
  principal     = "sns.amazonaws.com"
  source_arn    = aws_sns_topic.photo_created.arn
}

and then you subscribe the Lambda to the pub/sub queue:

resource "aws_sns_topic_subscription" "topic_logger" {
  topic_arn = aws_sns_topic.photo_created.arn
  protocol  = "lambda"
  endpoint  = aws_lambda_function.topic_logger.arn
}

Job done, and this pattern can just be endlessly repeated — I intend to do a minor refactor of the code and turn that pattern into a module, just so that I’m not repeating code through cut-and-paste.

For the Lambda that is invoked directly from an S3 event, the permission is very similar, really just specifying a different AWS service principle and notification source. Naturally, I don’t subscribe this Lambda to the pub/sub topic.

resource "aws_lambda_permission" "archive_photo" {
  statement_id   = local.lambda_archive
  action         = "lambda:InvokeFunction"
  function_name  = aws_lambda_function.archive_photo.function_name
  principal      = "s3.amazonaws.com"
  source_arn     = aws_s3_bucket.photos.arn
  source_account = var.aws_account_id
}

That’s most of the solution, really. Of course there’s a bunch of additional Terraform code doing things like:

setting up buckets and their security;
setting up various IAM policies to control what different parts of the solution are allowed — the principal of least privilege should always be our friend;
defining the Secrets Manager secret;
setting up logging.

Just a quick note on logging — the Lambda functions will log to Cloudwatch automatically, without you needing to do anything further. I do like to “adopt” these logs into Terraform though, for the simple reason that this gives me the chance to specify the retention period of log messages, which otherwise default to three months:

resource "aws_cloudwatch_log_group" "archive_photo" {
  for_each = toset([
    aws_lambda_function.archive_photo.function_name,
    aws_lambda_function.wasabi_photo.function_name,
    aws_lambda_function.thumb_photo.function_name,
    aws_lambda_function.topic_logger.function_name
  ])
  name              = "/aws/lambda/${each.value}"
  retention_in_days = 14
}

Lets turn to the Lambda code itself, just so that I can call out some of the important parts. Before doing so, I will say that the AWS SDK for Go is really good, and makes writing Lambda code almost effortless. The other thing you need is the companion library aws-lambda-go

I’ll start with the main “archive” Lambda, which receives a direct notification from S3. First up, your go.mod

module github.com/TheBellman/photo-lambda

go 1.15

require (
   github.com/aws/aws-lambda-go v1.21.0
   github.com/aws/aws-sdk-go v1.36.14
   github.com/rwcarlsen/goexif v0.0.0-20190401172101-9e8deecbddbd
)

I’m including the excellent EXIF library from Robert Carlsen here, which saved a lot of effort around cracking open the JPEG files to find the date the photo was taken.

We import the various bits that we need in our main code (along with some other things we need)

import (
   "github.com/aws/aws-lambda-go/events"
   "github.com/aws/aws-lambda-go/lambda"
   "github.com/aws/aws-sdk-go/aws"
   "github.com/aws/aws-sdk-go/aws/session"
   "github.com/aws/aws-sdk-go/service/s3"
   .
   .
   .
)

and define an interface to encapsulate the S3 service

type s3Service interface {
   GetObject(input *s3.GetObjectInput) (*s3.GetObjectOutput, error)   CopyObject(input *s3.CopyObjectInput)
      (*s3.CopyObjectOutput, error)   WaitUntilObjectExists(input *s3.HeadObjectInput) error   DeleteObject(input *s3.DeleteObjectInput)
     (*s3.DeleteObjectOutput, error)   PutObject(input *s3.PutObjectInput) (*s3.PutObjectOutput, error)
}

Note that this is not necessary for running the code, but it’s super handy for being able to “mock out” the S3 service during tests, which I will touch on below.

I use the init to grab some environmental values, which are injected into the configuration by my Terraform code:

var params *runtimeParametersfunc init() {
   params = &runtimeParameters{
      SourcePrefix:
        validatePrefix(os.Getenv("SOURCE_PREFIX"), DefaultSrcPrefix),
      DestinationPrefix:
        validatePrefix(os.Getenv("DESTINATION_PREFIX"),
                         DefaultDestPrefix),
      DestinationBucket:
        validateDestination(os.Getenv("DESTINATION_BUCKET")),
      Region: validateRegion(os.Getenv("AWS_REGION")),
   }
}

Don’t worry about those validation methods, they just make sure that I’ve got something sensible to glue into the struct that I use to contain all the runtime parameters.

About half of the Lambda magic occurs in the main function, which is invoked by AWS Lambda when it starts up an instance. Initially I was creating the AWS session and S3 client in the init function, but I quickly came to my senses and realised that made testing a pain in the butt, so moved that boiler plate here:

func main() {
   sess, err := session.NewSession(&aws.Config{
      Region: aws.String(params.Region),
   })
   if err != nil {
      log.Fatal("Error starting session", err)
   }
   params.Session = sess
   params.S3service = s3.New(sess)

   log.Println("Registering handler for photo-lambda...")
   lambda.Start(HandleLambdaEvent)
}

The important part being the lambda.Start() call, which registers the handler that will be used to, well, handle all the incoming requests. I won’t go into details on what’s inside this particular handler, beyond sketching out what it receives:

func HandleLambdaEvent(request events.S3Event) (int, error) {
  cnt := 0
  for _, event := range request.Records {    ...stuff happens with the S3 bucket and object in here...  }
  return cnt, nil
}

You can see that there’s a well-defined struct passed in from the surrounding Lambda framework. This is a common pattern across the SDK, and exactly what is in the event will vary from service to service. For some services it might contain some metadata about the event, but in the case of S3, all it has is a slice of events.S3EventRecord. Each of those have metadata about what the event was, when it happened, what kind of event it was, etc, and an enclosed events.S3Entity. Unsurprisingly, this contains all you need to know about the bucket and the object.

This is a very common pattern across the SDK — a well defined struct that wraps around a slice of data objects, all of which are well documented and aligns exactly to the published API for the service. As I said, the combination of SDK and Go makes dealing with the APIs so easy it’s almost cheating.

I mentioned that I’d created an interface to help with mocking in tests. In my test I create a struct and implement some functions on it. This is just enough to match the interface, although the actual implementation contains some logic so that it can return different responses depending on what is in the request input. That in turn allows me to create various failure/success scenarios in my test:

type mockS3 struct{}func (f *mockS3) PutObject(input *s3.PutObjectInput)
     (*s3.PutObjectOutput, error) {}func (f *mockS3) DeleteObject(input *s3.DeleteObjectInput)
     (*s3.DeleteObjectOutput, error) {}func (f *mockS3) WaitUntilObjectExists(input *s3.HeadObjectInput)
     error {}func (f *mockS3) CopyObject(input *s3.CopyObjectInput)
     (*s3.CopyObjectOutput, error) {}func (f *mockS3) GetObject(input *s3.GetObjectInput)
     (*s3.GetObjectOutput, error) {}

Using the mock in tests is trivial (although you do have to remember to structure the code under test so that the service is injected):

func Test_getImageReader(t *testing.T) {
   mock := mockS3{}   _, err := getImageReader(&mock, "bucket", "key/good.jpeg")
   if err != nil {
      t.Errorf("Received an unexpected error: %v", err)
   }

   _, err = getImageReader(&mock, "bucket", "key/bad.jpeg")
   if err == nil {
      t.Errorf("Did not get an error when expected")
   }
}

where the function being tested looks like

func getImageReader(service s3Service, bucket string, key string)
   (io.Reader, error) {}

This is an example of the rather delightful strongly typed duck typing that Go offers: both the real S3 client and my mock test client implement the interface, so using them is entirely interchangeable.

For the Lambdas that are receiving notifications, the pattern is very similar:

func handler(snsEvent events.SNSEvent) {
   for _, record := range snsEvent.Records {
      snsRecord := record.SNS
      message, err := parseMessage(snsRecord.Message)
      if err != nil {
         log.Fatal(err)
      }

      for _, s3 := range message.Records {
         log.Printf("ObjectCreate event observed for: %s/%s",
           s3.S3.Bucket.Name, s3.S3.Object.Key)
      }
   }
}

func main() {
   lambda.Start(handler)
}

The only complicated thing being that the content of the SNS event is not well defined in the AWS documentation. Each event is a slice of SNSEventRecord, which in turn contains an SNSEntity that describes a notification, and an arbitrary Message string which is the notification payload. All very simple, but what on earth is the format of the Message when S3 has raised an SNS Notification? Good luck finding that in the documentation!

Writing my event logging Lambda was enough to crack the mystery: it’s a chunk of JSON that is the same events.S3EventRecord that we’ve already seen. I can parse the JSON, and just sail on:

type snsMessage struct {
   Records []events.S3EventRecord `json:"Records"`
}

func parseMessage(messageBody string) (*sqsMessage, error) {
   var message snsMessage

   err := json.Unmarshal([]byte(messageBody), &message)
   if err != nil {
      return nil, fmt.Errorf(
          "failed to parse the message body: %q = %v",
              messageBody, err)
   }
   return &message, nil
}

That’s really about all there is to get started with if you want to do something similar. The other two pieces of magic in play to do the actual photo manipulation were:

An excellent EXIF library from Robert Carlsen that I use to pull apart the JPEG and find a useful date to organise it with;
The distintegration/imaging library from Grigory Dryapak which made creating the scaled down thumbnail trivially easy;

There are some future enhancements I want to make.

First of course is to write the code that periodically builds the webpages for me, and to wire those into my personal website.

It would also be good to support formats other than JPEG, and to split the import, archive and thumbnail files across three different buckets.

Finally, I’d like an activity monitor that does not require me to login to the AWS console and scan through logs, although I’m holding that off until I have a better sense in my hand how I would like that delivered.

The impressive thing about cloud services in general from the big three, and from AWS in particular, is that it’s so damned easy to build robust solutions. Working on my own, for perhaps a man-week in total, I’ve knocked together a solution that is arbitrarily scalable, ridiculously cheap to run and maintain, and is the equal of any multi-million data acquisition and processing pipeline that I’ve seen in enterprise already. General platform oriented cloud services like AWS, GCP and Azure, when coupled with specialist services like Wasabi, put incredible computing power into the hands of anyone who takes the time to learn how to assemble them. Be brave. Build things. Have fun.

Serverless Image Processing on AWS

Written by Robert Hook