Image serving with Content Addressable Storage, S3, and ImgIx

Marketplacer and its websites served user-provided web-optimised images from the same webserver as its actual application, and stored both the original image and the web-optimised versions on disk. Each image’s filename was stored in a column on the related model, and uploads were handled with the Ruby gem carrierwave.

Marketplacer’s websites would often have many duplicate images stored against database records and on disk:

  • a templating system allowed users to copy manufacturer provided images to their own adverts
  • users would duplicate existing records between stores of the same franchise
  • different sites with the same models would use the same original images

These duplicate images, along with handling the web-optimisation of the images and storage server-side, presented problems:

  • storing images on disk made it difficult for us to horizontally scale without using shared filesystems
  • developers & test environments using production data would refer to images and assets that did not exist on the local environment
  • copying images from templates or between users could take time and use additional disk space
  • changing our web-optimisation method (eg, image size, compression ratio) required us to re-process every single uploaded image

To correct these problems we decided to rebuild our image processing & serving architecture based on ImgIx, S3 and a concept called “Content Addressable Storage”.

ImgIx allowed us to optimise our images for the web on the fly by specifying the desired resolution, compression and image format in the URL provided to the user. For added security, we opted to sign these URLs so that a user could not retrieve the original asset by modifying the URL.

We created a single S3 bucket for ImgIx to use to source the image originals. This S3 bucket would be used by marketplacer websites, across development, test, staging and production, and for all asset types. We were able to do this without risk of corrupting production assets due to the nature of Content Addressable Storage.

Content Addressable Storage meant that every image’s location in S3 is representative of its contents. Two files with different contents will aways have a different path & filename, and two files with the same contents will always have the same path & filename.

Here’s how it works:

  • a user uploads an image to a web application
  • the image is stored in a temporary directory
  • we create SHA digest of its content (eg CgqfKmdylCVXq1NV12r0Qvj2XgE)
  • we infer the image’s location is S3 based on that digest (Cg/qfKmdylCVXq1NV12r0Qvj2XgE)
  • we check S3 to determine whether we already have an image in this location
  • if we don’t, we upload the image to the location
  • we store the image’s location in the database on the relevant database record
  • we later serve a link to a processed version of that image via ImgIx (eg https://marketplacer.imgix.net/Cg/qfKmdylCVXq1NV12r0Qvj2XgE?auto=format&fm=pjpg&fit=max&w=1600)

During the first month of rolling this mechanism out to production, we had a 5% hit rate of duplicate images on all uploads.

We don’t only use this method for images — since early 2017, all user-provided assets served by Marketplacer applications are served via this method. Non-image assets are now served via signed CloudFront links connecting to this S3 bucket.

We never remove assets from this S3 bucket (as storing small things in S3 is relatively cheap). The entire bucket is backed up to a server external to AWS every day.

Using SHA means that the chance of two different assets having the same name is effectively impossible. This means that each asset uploaded will never be replaced with a different one in the same location.

As we rely on signed links, and as the S3 bucket is private, we can even use this method to serve & control access to paid content.

In conclusion:

  • using S3 to store images, and ImgIx & CloudFront to serve them, allows us to scale horizontally without using NFS to share files between servers
  • as images are stored in a single bucket, development & test environments can link to them without worrying about missing content
  • there is no risk of development or test corrupting production data, as each image uploaded to S3 is named based on its content, making each individual asset effectively immutable
  • copying images from templates or between customers is now as simple as copying a string between database records
  • we only need to change the query parameters sent to ImgIx to change our web-optimisation method

Originally published at mipearson.id.au.