I currently work for a media company focussing on sports news and entertainment. The majority of our readable content is crowdsourced. Our content creators get paid based on what revenue their article generates. It is a very transparent system and one that scales well. This growing scale is good for business but brings in a lot of challenges for us at the engineering team. We are therefore always focussed on building systems, which can handle large volumes of traffic without any downtime to our users and content creators.
Today, a significant percentage of content consumption has moved from textual medium to videos. Keeping up with changing times, we plan on extending out content creation platform to handle video uploads as well. Well… video is a different ball game altogether. Owing to their huge file sizes, they need to be stored and managed separately.
While brainstorming on the architecture of the system. We came up with a couple of approaches:
User uploads the video to our server, and then we upload it to s3 and use that s3 bucket url to send it for encoding further.
Well, this might sound simple enough, but this approach has an unnecessary overhead which could potentially lead to a major catastrophe(for us atleast). How you ask.. let me tell you.. one of our micro service, serves client facing request as well as content creation requests. Now in this architecture, it is possible that a lot of video upload requests choke up the server (by hogging up CPU or server bandwidth), and any further requests are timed out.
But what if I have an autoscaled environment?
For those who don’t know what autoscaling is, let me give you a overview. If your server can handle 1000 requests/second and you start getting 2000 requests/second, then your server will not be able to serve those requests and they’ll time out for the user. In order to solve this issue, we can run 2 parallel servers and distribute traffic on them using a load balancer. A reasonable question here would be, then why don’t always we run 2 parallel server? Well, we could but what if the number of requests grows upto 5000/sec then we’ll need more than 2 servers. Also, there is a cost factor associated with each server that is running. The ideal solution is to spin up new servers only when there is an actual need for an additional server. AWS provides such a autoscaling solution out of the box. You can specify a single or multiple trigger conditions (eg: CPU over 80% and/or RAM usage 90%)which, when met will autoscale your service.
Now that we know autoscaling basics, let’s consider a probable scenario in our case. Suppose there are a lot of incoming requests to the server from both content creators (unloading videos) and end users. Serving both the requests is critical for business. Now with autoscaling in place, an additional server will be automatically added when the trigger condition are met. Once the new server(or node) is up and running, it’ll starts receiving and serving requests. With video requests incoming on both servers,Now both the nodes receive video uploads and store them locally on their respective HDD. Ideally, both the servers will keep uploading to s3. Work fine, right? Well, not really.. Now if the traffic were to subside a bit and one server was sufficient to serve the requests, then one server will be killed(while downscaling). When this happens we can’t be a 100% sure if the video on that node’s filesystem was upload to S3 yet.
In order to overcome the aforementioned potential issues, we thought of skipping the middle man and uploading videos directly to s3 from the clients’ browser. With this approach, we don’t have to worry about the requests choking up the server and users reporting issues at 3o’clock in the night. Since we are moving everything to client side, there is another challenge of security. We don’t want our private s3 info to be exposed to the public. So the challenge is to upload files securely to s3 directly from the browser.
How to Setup
- Create a new bucket
A new bucket is preferred, so that you can restrict users access to other buckets
- Create a new user with minimum privileges
Create a new user who can write to your new bucket. Save these accessKey and secretKey, we’ll be using it later. Below upload policy restricts what a user can do even if he has your accessKey and secretKey. Therefore it’s better to create a new user and not use the root user.
- Handle CORS request
Cross-Origin Resource Sharing (CORS) is a mechanism that uses additional HTTP headers to let a user agent gain permission to access selected resources from a server on a different origin (domain) than the site currently in use.
<?xml version="1.0" encoding="UTF-8"?>
How the upload works
In HTTP terms, the upload is a simple POST request to an S3 endpoint. The request contains the file, a filename (key, in S3 terms), some metadata, a signed policy and a signature.
Order of flow
- Client sends a HTTP GET request with filename and content type to your server. The actual file is not sent in this request.
- Server responds with fields which the client will use to send request to s3. Based on your requirements, here you could restrict who can upload files to s3 or the number of upload a person can do in say 24 hours.
- Client consumes the server response and sends a HTTP POST request to s3 bucket. This request contains the actual file along with the various params contained in the request body.
- The request is authenticated at AWS. If successful, the file is uploaded, if not an error is returned.
Note: If your file size is pretty large (>10MB), you can chunk the file and repeat the whole process for each chunk. When all chunks have been uploaded to S3, you can send a merge request to S3. This merge request tell AWS that all parts of the file have been received and the file can now be merged.
Sample server response
How is AWS request authenticated
On Client side
- Construct a request to AWS.
- Calculate the signature using your secret access key.
- Send the request to Amazon S3. Include your access key ID and the signature in your request. Amazon S3 performs the next three steps.
On AWS Server
- Amazon S3 uses the access key ID to look up your secret access key.
- Amazon S3 calculates a signature from the request data and the secret access key using the same algorithm that you used to calculate the signature you sent in the request.
- If the signature generated by Amazon S3 matches the one you sent in the request, the request is considered authentic. If the comparison fails, the request is discarded, and Amazon S3 returns an error response.
Show me the code
You can find working sample for server and client at:
Contribute to secureBrowserUploadsToS3 development by creating an account on GitHub.
If you liked the post, you can follow me on Twitter.
https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html (Useful while debugging)
https://docs.aws.amazon.com/AmazonS3/latest/dev/S3_Authentication2.html (Understanding AWS request authentication)