Handling thousands of image upload per second with Amazon S3
At uDroppy we work a lot with images, and being able to process large volumes of image (both uploading and serving) is a key part of our platform, to give to our users a fluid UX.
In this article I am going to discuss about one way we use to deal with large volumes of data upload, I would like to note that as always in IT there’s no “best way” or “right way”, each problem has multiple solutions, the one described in this article is just one of many.
About image (file) uploading
When it comes to SaaS one feature you will implement for sure at some point is image (or any other file) uploading.
Even though at first this seems like an easy task, this part of your application if done in a bad way is going to quickly become a paint-point and very expensive.
But let’s get straight to the point, how can you make your user upload an image and then save it?
Let’s start by splitting the main problem in smaller ones, here are the things we want to address:
- Choose where to store the images
- Store the image
Where to store the image
There are plenty of possibilities, in this article i am going to discuss the 3 most popular choices:
- File System
- A cloud platform
Let’s say you are using MongoDB as one of your DBMS and the first thing that comes to your mind is to save the images in a collection.
Even though i have seen some scenario where it makes sense storing some image in the DB (i’ll make another article about it), I think in most cases this is the worst choice you can make and the reason is that this is VERY expensive.
Let’s do some math, suppose you are using mLab as your provider. mLab charges 15$ per gb of storage as of today.
And let’s suppose the average size of the uploaded images is 1MB. So it costs you 15$ every one thousand image per month!
That’s quite expensive, and trust me actually it is even more expensive as we are not calculating hidden fees (DB CPU usage, server usage, ecc…).
So if you are going to host thousands of image this is for sure not the best solution.
So the next thing that might come to your mind is saving the image on your server in the file system.
At first it makes perfectly sense, you have a server where your app is running, you receive the image, and you store it locally in the hard drive.
In the pre-cloud era, this was the most common way to deal with images upload. Let’s have a look at the picture below to understand the flow:
However there is a big issue with this method, which is scaling your application horizontally.
Let’s say your application starts doing very well, and you increase your user-base day by day. At some point you will have to scale your app to handle all the requests. Can you see the problem?
What if a major news magazine in your field makes an article about you? You are going to experience a LOT of traffic from different part of the world. And you will need to scale up VERY quickly.
Each time you have to spawn a new instance of your application you will also have to copy all the images in the new instance. And if you have dozen of thousands of images this is not going to be quick for sure thus scaling your app is not going to be easy.
Here we are finally, it’s 2019 and the world is ruled by the big cloud. Almost everything is stored and executed in the cloud.
There are plenty of providers you can choose from, the most popular are:
- Amazon AWS
- Google Cloud Platform
- Microsoft Azure
I am going to talk about Amazon AWS, but you can replicate the same logics in any other cloud platform (as long as they have the features i am going to talk about).
In particular i am going to talk about Amazon Simple Storage System, also known as S3. Let’s break it down and see how S3 resolves the problem of pricing, and scalability:
First we calculated that with the DB approach we had to pay 15$ per GB each month, actually there’s not much more to say rather than that S3 charges you 0,023USD per GB each month. Not bad.
You don’t have to worry about it, with its huge network of servers around the globe, Amazon will take care about it for you. You just have to put some configurations in your AWS console, and you are done.
Now that we have found our storage partner, things start to get interesting and we can finally move to the main part of this article: how are we going to store image on S3 in a super-scalable way?
Storing the image
Now that we have decided to store the image on S3 we have to understand how to upload the image to S3.
Do we first upload the image from the client to the server and then upload it to S3 from the server?
Uploading the image from the server can be very expensive
If you have a quick search online you will find plenty of guides that address this “problem” by first uploading the image to the server, and then uploading it to S3.
This solution works perfectly, but the fact that server has to receive an image, store it for some time in a temp file (or just in the memory) is something VERY expensive if we talk about costs.
Pick a language/framework and make a simple API that receives an image and stores it locally in a temp file. You will notice that there’s is a peak in CPU usage when you receive the image and you store it. On the Macbook pro I am using at the moment there’s an increase of CPU usage of 4%.
Now try to imagine what would happen if your APIs receive 10 or even more image per second.
You might be thinking “yeah but my app will never have that volumes”, well who knows maybe your app is going to be the next Instagram or maybe not.
But i’ll give you another reason, which is costs. To keep it short, the more CPU you are going to use, the more you are going to pay. That’s a pretty good reason to optimize this process. And by the way, Instagram handles roughly 900 images per second.
How do we handle image upload then?
The solution of all our problems is called: Presigned URL.
A presigned URL gives you access to the object identified in the URL, provided that the creator of the presigned URL has permissions to access that object. You can learn more about them here.
Long story short:
- The client consumes an API where it basically tells the server that there’s a need to upload a file, but without sending the file itself.
- The server, with api keys, requests the same thing to S3. File name, type, size, and other metadata are sent to S3 in this phase.
- S3 answers back with a URL to the server.
- The server sends back the URL to the client.
- The client performs a put request against the URL.
- The image (file) is uploaded to S3.
Let’s break it down in simpler parts.
At some point the client needs to upload an image but, as we discussed before, we don’t want to send the file to server at it would be expensive for the CPU.
So the client just “tells” the API that it wants to upload an image.
The API on the other part receives the request and with the private credentials it has of S3, it requests to AWS a presigned url. Basically a URL where a file can be uploaded but only if all the requirements, the server specified, are satisfied.
If the presigned url is generated for an image, it can’t be used to upload a video. If it was for a JPEG you can’t upload a PNG. If there are CORS policy that also should be satisfied. So there’s a lot of security you can implement in this layer.
At this point the server sends back the presigned url, that it has received from S3, to the client.
Finally the client sends the file via a PUT HTTP request to S3, and if all requirements are satisfied the file is correctly uploaded.
The benefit of this approach is that our server has to handle just a simple API call where there’s no file data. The upload itself is processed by the client, leaving our server free and ready to process the next request very quickly.
As you can imagine this method is very scalable, and at the same time not very expensive.
Just remember that S3 is Storage Service and not CDN, in the next article I will discuss how to deliver resources from S3 in the fastest possible way!