Migrating 15 years of Meetup photos to the cloud
Meetup has been around for quite a while, at least since 2002, and our members have always been fond of uploading photos: profile pictures and countless memories of every type of Meetup. And while it was maybe uncommon 15 years ago to have a digital camera with you, in all these years taking pictures has become ubiquitous and now photos are a multi megapixel thing, far different (and heavier) than what was called a “digital” photo back then. So, after 15 years, we have hundreds of terabytes of beautiful pixels to keep good care of, and to move with us, wherever our platform goes.
Before our AWS migration last month, we were using MogileFS, a distributed file system solution that we didn’t want to replicate on AWS as EC2 instances and EBS volumes, because the cost would have been unsustainable, when instead on AWS we could use S3, which is infinitely scalable and far, far cheaper for this purpose.
The serverless photo scaler
So we built, almost from scratch, a serverless/Lambda-based photo resizing platform, backed by S3 and with all the assets tracked in DynamoDB: a cloud native solution, for an internet scale problem.
Its design is inspired by similar projects (Lambda and photo resizing are a thing since Lambda was announced in 2014), but we also wanted to make it more efficient at saving space, storing assets in S3 based on the sha-256 (because sha-1 is not that good) of their binary content, so that duplicated contents would not take more storage. To determine where to store them, we use this simple function:
var location = crypto.createHash('sha256').update(buffer)
This builds a path based on the hexadecimal representation of the hash of the content and splits the first 3 groups of two digits as a path (just because it looks neater).
On a side note, talking about duplicated contents, it’s interesting to see how many people used to upload just the same Windows XP wallpaper (Bliss was the most common) as profile photos. That was obviously before the era of the selfie and Snapchat.
In our new platform, a photo will still be at its original path in the S3 bucket (Meetup uses a simple hashing schema to store a photo based on its numeric id), but it will be a
0 bytes length reference to the actual asset, with the object metadata property
x-amz-website-redirect-location set to cause a
301 redirect to the actual location. So, dropping a file on a source bucket will trigger an S3 event, causing a Lambda invocation that will create the required crops for the photo (a lot of them), and each one of them will point to the actual asset:
No, that’s not an actual member profile picture, but the diagram should explain the approach.
We also have to support on-demand, dynamic processing, based on our REST style API, with synchronous invocation from the AWS API Gateway, and for that we take advantage of S3 routing rules, a trick that we learned from this project. Let’s assume, simplifying a bit, that a request for on-demand resizing of
123456.jpg looks like this:
GET /dynamic/300x200/123456.jpg HTTP/1.1
If there’s no object at the path
/dynamic/300x200/123456.jpg on the S3 bucket, the routing rule will map the
404 response code to a
307 response (a temporary redirect) with the
Location header pointing to the API gateway:
307 Temporary redirect
The request to the API Gateway will, in turn, invoke the Lambda to create the requested resized asset (you may have guessed, a 300x200 picture) at a path based on the hash of the resulting binary, and then a
0 bytes length “pointer” to it at the expected path
Finally, the Lambda will respond with a JSON object, that the API gateway will translate to:
301 Moved Permanently
If you are feeling a bit confused with this beautiful dance of redirects and wondering how this will perform in the real world, hang on. Here’s a picture that may explain it a bit better (or not):
The neat thing about this architecture is that the second time the same request comes along, the asset is served directly from the bucket, and there’s no Lambda invocation: It’s just a straight request for a file.
Finally, we realized that we could also hide the redirects if our CDN would follow them at the backend, and our visitors would just see
200 responses. More on that in another post.
We had all of this built, and we were left with just one little, enormous problem: how to transfer the hundreds of terabytes of all our members’ photos from our old datacenter to S3.
We did some back-of-the-envelope estimates, and no matter what bandwidth we allocated to it, if we wanted to serve our site during the copy, it would have taken several months to complete the transfer.
AWS had just announced Snowball, and we thought it was just the perfect thing for us — petabyte scale data transfer — so we ordered one to try, and we installed it and started pushing gigabytes of data to it.
Unfortunately, maybe because it was still an early release, or maybe because it was a defective unit, or maybe because of user error (we are not perfect either), that didn’t work out for us. We were getting errors that were making us feel not very safe about our data.
With a couple months left before the cutover, we redid the math, this time assuming that we would copy just the originals, and not all the crops, over the network, and let Lambda do the heavy lifting of transforming all the photos into all the required sizes and crops. After all, Lambda scales to infinite, and our problem felt so close to infinite.
It turned out that, limiting the copy to just the high-res originals, and with a little bit of luck, we could do it in a month and one half, more or less. That would still leave us with 15 days to do all the resizing. Our principal system engineer wrote the most efficient Perl code that could be written by anybody but Larry Wall to copy over all the photos from MogileFS to S3, and we just had to hope it would work.
In reality, it took even less than what we thought, and all we had to do now was start the massive scaling job. AWSLabs had just published this project, using Sharp as the resizing engine (instead of ImageMagick), and looking at the performances, we decided that it was worth it to try it: In the Lambda world, you pay for the execution time down to the millisecond, so a 400% or 500% performance increase would cut our costs considerably.
The Node.js sharp module must be compiled natively on the platform on which it is going to run, but Docker came to the rescue, and by using
as base image in a Dockerfile, we could install the node nodule natively and deploy it to Lambda. All that was left to do was start the job to trigger Lambda for every asset in our buckets containing the originals.
We already had our Lambda limits raised to as much as AWS allowed us to, we cranked up DynamoDB to 40,000 writes per second (we track the assets in DynamoDB, so each generated crop is a write in DynamoDB), and we let the job start.
In the CloudWatch screenshot above, you can see what happened. DynamoDB (the graph on the top) happily handled 20 million writes per hour, but our error rate on Lambda (the red line in the graph on the bottom) was spiking as soon as we went above 1 million/hour invocations, and we were not being throttled. Looking at the logs, we quickly understood what was happening. We were overwhelming the S3 bucket with PUT requests, and we were receiving this response back:
"message": "Please reduce your request rate.",
We opened a ticket with AWS and quickly discovered that S3 will limit your write throughput to a bucket, especially if the keys that you are writing happen to be in alphabetic, sequential order — which was exactly what we were doing. It was a back-to-square-one-with-the-clock-ticking type of moment. We requested a manual partitioning of the bucket, as recommended, but we were also advised that it could take up to a week to complete.
And, it didn’t help that at that time, S3 experienced a serious outage in the
us-east-1region. You can see it reflected as the absence of data point on 2/28. For a second, we thought we did it.
Also, with all these execution errors, now we had our data in an inconsistent state, and all we could do at this point was modify our migration script to try to use a sharded pattern for the keys for our requests, add exponential back-off logic, clean up the bad data, start again, and hope for the best.
And the best happened. While we were rewriting the script, the AWS S3 team repartitioned the bucket, even though it was during the week you would have thought that they had bigger problems to deal with. And all the remaining photos were processed in less than 24 hours, with no errors and no throttling.
The lesson learned is that even with the best tools available, — and AWS DynamoDB, S3, and Lambda are clearly unparalleled tools in the cloud universe — there are still details (some would call them laws of physics) that need to be taken in account, and that was a good lesson for us.
Pictured above is the size of the target bucket and the source bucket from which the photos were being resized. From February 26th to March 3rd we had already processed hundred of terabytes, but the throttling error was blocking us from continuing at that rate. That is when we realized we had to rethink our approach, and March 8th is when we started again.
It was not what you could call a stroll in the park, but the idea that we took care of countless photos and memories of all our members makes us proud of the task we accomplished in so little time.
The project for the hashed storage serverless scaler platform will be published soon as open source on our github.com, so feel free to contribute to it.