Building a real-time image processing solution at edge

Khoa Le
Engineering @ Homes.com.au
13 min readNov 17, 2021
Photo by Tom Rumble on Unsplash

Background

Homes.com.au is a proptech startup based in Australia. Our goal is to make it easy for Australians to buy or rent their next home.

We currently process over 3 million unique images each month, most of them are of properties that are listed on our site. These images are sourced from different channels — mainly uploaded via our customers — and they differ in quality and size. We needed to ensure that there was conformity when presenting these images to users of the website — taking the device on which they were viewing the site into consideration.

Our initial attempt was to do this using an existing provider but found ourselves unsatisfied with the user experience that we delivered; so set ourselves an engineering challenge to “do better”. The following is an overview of our journey in exploring and delivering on that challenge.

What’s the problem we’re trying to solve?

We had been using a popular 3rd-party image service provider, Cloudinary, to transform and serve images to the end user. When images were requested, Cloudinary fetched them from our self-maintained S3 bucket and processed them before responding back to users. One of the benefits of using Cloudinary was the ability to cache “already fetched” images but what we soon found was that first-time fetching of images were surprisingly slow resulting in users spending a non-trivial amount of time waiting for images to load. Pre-caching these images as they came in to our platform was an option but that came at a financial cost.

We found Cloudinary to be rather expensive; its linear pricing model did not fit well with our forecasted growth in website traffic. It would have cost us ~$400 USD per 1 million images on just initial transformations. Any additional transformations and fetches would have only increased that cost.

So… we set out ourselves some goals:

  • Achieve fast image load time with p99 < 100ms
  • Cut current costs significantly
  • Improve data governance; we control how, where and when our data is stored
  • Connected infrastructure, we would like to pre-cache our images using existing AWS services

Proposed solution

It basically came down to two options:

  • Consider a different 3rd party service, or
  • Implement an in-house solution

We knew that we would not need to do anything fancy with the images and hence, we felt, processing them ourselves should be relatively easy to achieve. Picking another third-party solution would not reduce the cost that much. Therefore we decided to build our own image service provider.

After a little bit of research we concluded that the most straightforward solution would be to store the images, process them and host them on a CDN. In the AWS world, the solutions for those three can be mapped to S3, Lambda@Edge and CloudFront.

Knowing that we weren’t really “breaking any new ground” with our chosen approach; some further research quickly led us to a few articles and blog posts about how other people have also solved similar problems:

Although the principles and ideas can be applied to other platforms, we chose to focus on AWS for this implementation. CloudFront + S3 was a no-brainer when it came to choosing a cost-effective and reliable CDN solution. Real-time image transformation is listed by AWS themselves as one of the use cases of Lambda@Edge.

We are a fan of infrastructure as code and therefore used Terraform to manage and automate deployments of these infrastructure components (e.g. CloudFront, S3, Lambda).

Architecture overview

Overview of the architecture

This is roughly what the architecture looks like. CloudFront as a CDN provides 4 trigger points during the life cycle of a request such that developers can hook their own Lambda@Edge functions to them and manipulate the request/response. We have two Lambda@Edge functions and one normal Lambda:

  • One that intercepts requests for images to validate the request and determine appropriate image formats
  • One to process images on the fly when the image with desired size and format is not found
  • One to process images asynchronously when a new image is available

The third lambda was needed due to the way those images are made available to our system. We received them from an external system that we didn’t control.

Pros and cons

Pros:

  • We fully own the solution and will be able to customise it as we’d like
  • Eliminate features that we don’t need and their associated cost
  • Better performance and lower cost
  • Finally, as always, expanding our technical knowledge in this area

Cons:

  • Development and maintenance cost. However, we will see these more like pros as their benefit outweighs the cost tremendously.

There are also limitations and restrictions imposed on Lambda@Edge (the full list is here) as of this writing:

  • Only Python and Node.js runtime are supported
  • No support for dead letter queue
  • No support for environment variables
  • It can’t be configured to access resources inside a VPC
  • Limit on the request body size
  • Limit to 10,000 request/second (can be increased)
  • The function must be deployed to us-east-1 region, as well as some related resources
  • Often a numbered version of the lambda must be used, not the the convenient $LATEST alias. This may not sound like a big deal but it has been a deal breaker in the past for a fully-automated IaC solution.

Fortunately, none of these limitations were a major issue for us. Having said that, we would need to monitor the solution closely, especially the request traffic when our user base grows. Any throttling indicates we are hitting that limit of requests per second.

Cost analysis

In order to compare the cost between solutions, we made some assumptions about our usage for a month:

  • Average image size: 50KB
  • Traffic — 10 million requests for images, 10% involves the original images to be processed while 90% of the traffic should be served via the CDN cache
  • Network — 500GB bandwidth used
  • Storage: 50GB * 7 different sizes = 350GB

Also for the new solution, we assume:

  • The AWS pricing was obtained for Sydney region
  • Average duration for viewer request lambda is 5ms
  • Average duration to process an image is 100ms
  • We preprocess 7 different sizes of an image and store them all in S3. This would result in 7 million images being stored in S3
Cost comparison

The AWS solution is a clear winner here at just ~20% of the cost.

Please note that there are other cost items that we intentionally not calculating due to its complication and which we know shouldn’t affect our comparison that much. Some of those are logging, CloudFront data transfer out to origin.

What we need to consider?

As always, there were several consideration we need to make for our solution.

  • How will we process images on the fly? This is required for existing images as we transition to the new approach.
  • How will we process and store images we receive going forward? Should we process them all in advance so that we wouldn’t need to do it again in real time? A trade-off would have to be made between storage cost vs real-time performance.
  • Which image formats are we going to support? We also need to consider various browsers on both desktop and mobile.

Prototype

Choosing the language

At the time of building this, Lambda@Edge only supported Node.js and Python runtimes. While it’s possible to use one language for the Lambda@Edge and another, such as Go, for the normal Lambda, we decided to be consistent and only use one language for both of our Lambda functions for now. Any premature optimisation is undesirable until being benchmarked properly.

Node.js was selected since it is a popular language of choice for edge functions such as Cloudflare Workers and Akamai EdgeWorkers. Many of our engineers were also more familiar with Node.js more than Python, so it was a natural choice for us.

Choosing the image processing library

While looking at existing Node.js libraries for image processing, sharp came up as a strong candidate. It uses libvips under the hood. It is praised for its high performance, support for several image formats as well as transformation operations on them. There is an existing benchmark of its performance against other popular libraries here.

Building the prototype

We started building out the prototype by creating the two lambdas following this article for a few reasons:
https://aws.amazon.com/blogs/networking-and-content-delivery/resizing-images-with-amazon-cloudfront-lambdaedge-aws-cdn-blog/

  • It was authoritative i.e. written by AWS, even though it was three years old
  • The implementation was written in Node.js
  • They also used sharp to process images

It’s almost as if we were copying them and modifying it to fit our scenario!

Instead of CloudFormation, we used Terraform for Infrastructure-as-Code. There used to be an issue with Terraform (and also Serverless frameworks) not handling the full deployment lifecycle of Lambda@Edge completely. For instance, when a new version of the Lambda@Edge is published, developers would have to attach the newer version to the CloudFront distribution manually. That’d be a real deal-breaker for our continuous deployment setup. Luckily, that issue had been solved for Terraform at the time, enabling us to fully automate the deployment pipeline.

Here’s the code snippet for resizing the images:

Testing image quality

One thing we need to verify is the image quality of the new solution. Obviously we wanted to keep the image quality as much the same as possible. Any degradation would affect the user experience. We compared the image quality using Structural Similarity Index (SSIM) method.

  • Size — The new solution’s JPEG output is comparable to Cloudinary’s WebP, the size is also smaller for bigger images (+/-20%), Cloudinary’s WebP has smaller size for smaller images (+/-20%)
  • Blockiness — Both solution performed relatively well
  • Ringing — The old solution has considerable more ringing artefacts
  • Aliasing — None for both (no upsizing)
Image quality comparison

Implementation

Feature flag

In order to test the new solution in staging and production, we used a feature flag to control the behaviour on the client side. We use Launch Darkly as our feature flag provider. It allows us to enable the flag per environment, and at specific user level. This comes in handy when we tested the feature in production by turning it on for all of our internal users, including engineering and product team. Of course, we would need to login with our account so the app is able to determine our user type given the account email.

We also tested for backward compatibility by switching back and forth between the old and new approaches.

Choosing the S3 bucket directory and URL structure

We tried to structure our S3 bucket in the most pragmatic way to our current situation, but also considering any future extension, such as hosting images for a different product or other types of static assets using the same solution. Therefore, we chose to place all images for our current site in the img-homes-com-au folder.

.
└── img-homes-com-au
├── 32
├── 64
├── 128
├── 384
├── 640
├── 1080
└── 1920

At the same time, each images will exist in various fixed sizes. We decided to store each version of a particular size in a particular sub-folder. One small benefit of doing so is the ability to keep the exact original image name in each sub-folder. And we can easily check for discrepancy in the number of images in each of the those sub-folders without having to do fancy filtering with S3 operations.

Choosing the image formats

At the beginning, we considered using WebP as it is a modern image format that provides superior lossless and lossy compression for images on the web. However, we had to take into account its support from different browsers and devices. On iOS, WebP is only supported on Safari from iOS 14.0 onwards, which is could be big problem for us — which means users on older iOS would have issue using our site effectively.

This article https://siipo.la/blog/is-webp-really-better-than-jpeg thoroughly analysed the difference the two formats produce under various circumstances. Long story short, it depends on the use case.

We decided to stick with JPEG for the time being. Device compatibility was more vital than performance to us at this stage. By using the mozjpeg engine for processing, we were able to preserve high image quality with comparable performance to WebP from our initial testing.

Processing images in the background

When the images are made available to us, we decided to preprocess them so that it’d be faster to serve them to the users when being requested.

For existing images, they are not preprocessed and hence need to be converted on the fly first. The real-time image processor then triggers the background processor lambda to process for the remaining sizes — in the background — so they will be ready for future requests.

Real-time image processing flow

Hardening the infrastructure

We believe in good security and do our best to learn and follow best practices in the industry. The following are what we did to enhance our infrastructure:

  • Enabled public access blocking for the S3 bucket
  • Only allowed access to S3 bucket via the designated CloudFront distribution
  • Enabled server-side encrypt for S3 data by default, including the images and CloudFront access logs
  • Disabled root object in CloudFront distribution to prevent listing objects in an S3 directory
  • Redirected all traffic to CloudFront to HTTPS
  • Restricted HTTP methods to GET, HEAD and OPTIONS only

Final architecture

Final architecture diagram

Performance benchmarking

Measuring latency

We used k6 to test the performance of the new solution against the old one. Essentially we wrote a script to fire requests continuously at a given URL and measure its latency.

The full result provided a good details of all the metrics:

k6 test result

The final result was:

  • New solution: P95 = 67ms
  • Old solution: P95 = 750ms

Looking back at the result, we realised that the test configuration wasn’t at a desirable size. The script simulated 10 (virtual) users and sending 10 requests each — which is relatively small for a proper benchmarking exercise. This is a lesson learned to improve our benchmarking approach in the future.

Real-life performance

We also did a “realistic test” by simply browsing the website and inspecting the network requests by eyes 👀. In addition, by observing the image load time from the user perspective we could tell a significant improvement made.

Further optimisation

From the user interaction point of view, we knew users will likely to perform certain actions. For example, when users open the image carousel of a property, they will certainly scroll through the images. And we can optimise our implementation to cater for such behaviour.

  • Increased react-virtualized total slide counts to preload more images
  • Prioritised loading images when users enter gallery mode
  • Prioritised loading images of those properties that are visible in the carousel of the current page

We also looked at optimising the Lambda functions:

  • Increased max memory for the real-time image processor to 1769 MB to fully utilise the power of 1 vCPU (see doc for reference). While lambdas for viewer requests and viewer responses are limited to 128 MB memory, lambdas attached to origin requests and responses don’t have such limitation. The timeout is also higher at 30s.Unfortunately, provisioned concurrency is not supported for Lambda@Edge
  • We did however set the reserved concurrency for background image processor, such that during a sudden influx of new images it wouldn’t suck up all the quota for concurrent execution per region

Monitoring

We strongly believe no project ever goes complete without proper monitoring and alert setup. Observability of a system is just as important as building it. Without observability, we wouldn’t be able to understand how our system is performing or knowing things going wrong when Murphy’s Law kicks in. We used Sumo Logic for centralised logging, metrics dashboards, as well as CloudWatch for alerts. This setup is worth another article to discuss on its own.

Others

Real cost after going live

This is the daily cost breakdown by service for the last month, obtained with AWS Cost Explorer.

Cost breakdown for Oct 2021 from AWS Cost Explorer

It costs us ~$135 for the month, with about 3 million images processed, stored and served to the users. It could have been $2000+ if we were still using Cloudinary, due to its billing structure. As we could see, S3 and Lambda each takes about half of the total cost, while CloudFront costs virtually nothing.

Extra-large image causing out-of-memory errors

Processing images can take a lot of memory, especially when the image dimensions are large. For example, we once had an image with a size of 6717 x 14133 pixels (what a size!), which exceeds the pixel limit. Moreover, it would result in an out-of-memory error when we attempt to resize the image into 7 different sizes in parallel.

We could observe this error from the logs:

width: 6717 [Error: Input image exceeds pixel limit]

If the lambda execution is terminated due to out-of-memory, the log would look like this:

Error: Runtime exited with error: signal: killed
Runtime.ExitError

After this incident, we switched the non-realtime image processor lambda to resize the images sequentially, also increased lambda memory and timeout (to 60s) to accommodate those edge cases. This reduced most of our errors but we may need to refactor the lambda to be more performant if it turns out to be not enough.

Conclusion

In the end, we were very happy with the outcome. We set out to address not only a technical challenge, but also a problem affecting the experience of our users. We were able to build out a solution from scratch that is cheaper, faster and more customisable.

We will also open-source our solution at some point in the future. Image hosting and transformation are common tasks that someone will most likely come across; and we hope to help the developer community with our small experience in this matter.

References

Lambda@Edge design best practices https://aws.amazon.com/blogs/networking-and-content-delivery/lambdaedge-design-best-practices/

Optimizing S3 performance https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html

Special thanks to Chris Chan for working together with me on this project and contributing to this article, Benny Thomas for reviewing and editing this article.

We are also actively looking for awesome engineers to join our engineering team. If you are interested in solving similar challenges, please get in touch https://www.linkedin.com/feed/update/urn:li:activity:6861441278274629632/.

--

--