AWS Lessons Learned for Data Processing Pipelines

EG Tech
Expedia Group Technology
4 min readNov 3, 2016

Last year my team and I embarked on a mission to migrate Expedia’s Media Services image processing pipeline to AWS. The media services are used to ingest images from hoteliers and image providers, optimize them, and distribute them to be used by our sites. The new queue based cloud pipeline was put in place to replace an aging batch system.

Moving to a cloud infrastructure has brought many advantages; we can easily scale when the load is heavier, have faster deployments, and mostly no more worrying about ordering more disk space for storing the increasing number of images flooding in. However, that doesn’t mean we haven’t encountered any problems. Here are some lessons we learned over the past year that might save you some time.

SQS — Default Visibility Timeout

Amazon Simple Queue Service (SQS) has a visibility timeout. This means a message can be picked up by a second consumer if the first consumer of the message didn’t complete its task within the timeout period. Typically when a message is put on a queue it’s meant to be consumed once and only once. Other queuing systems such as RabbitMQ put messages back on a queue when the consuming process is no longer connected. Since we were used to that behavior it took us some time to figure out why some messages were consumed multiple times. This is because SQS relies on the timeout, and not the connected consumer. SQS puts messages back to a queue to be consumed again when a consumer takes longer than the default visibility timeout. The first consumer eventually acknowledges the message has been consumed so at least it doesn’t get reprocessed over and over again.

When choosing SQS, to keep messages from being consumed multiple times, you need to figure out the maximum amount of time your message could take to be consumed and set the visibility timeout to a greater value. Keep in mind that you may have to adjust your timeout over time as you add more features to your consumers. Our processes typically end in less than 10 seconds but sometimes take longer than the default 30 seconds. We initially increased the timeout to a minute but even then we would sometimes get messages that would be processed twice. So we pushed the timeout to an hour to be on the safe side until we got further metrics about our process times. This has served us well so far but we need to revisit our timeout because it’s too long for those messages that need to be picked up again due to failing consumers.

One possibility we haven’t explored but may be useful for you is the ability to change the visibility timeout on individual messages. If you have a message that’s taking longer to consume and your application is aware of the time it may still take to complete the task, the SQS ChangeMessageVisibility action might be for you. Of course this will add some complexity to your code with some responsibilities you weren’t planning on handling.

S3 — Path Names

Be careful with path (or key) names with S3. Renaming a folder in S3 is not like renaming a folder on a conventional file system. On a conventional file system renaming a folder is applied to all subfolders and then files automatically find themselves where they ought to be. In S3, the “folder” is part of the key, so all files in that path have to be renamed. Individually.

We transferred over a million files onto a snowball from our SAN storage to S3. And because of a typo in the paths we wanted to use, we had to write a script and run it concurrently on several machines to rename/re-key each file to the right location. This added extra time to our migration time. Double and triple check your root path. Otherwise you’ll need to rename your keys if the file path is important for your application.

S3 — Cross Region Replication

Trying to replicate to multiple regions using S3’s Cross Region Replication didn’t work well for us. In our efforts to migrate files and create a failover we tried to move files to 3 different buckets using cross region replication. The first replication between the first and second buckets worked just as expected but the second replication between the second and third did not. This is exactly as the documentation says; “Objects in the source bucket that are replicas, created by another cross-region replication, are not replicated.” So… RTFM…always good advice.

Dynamo — Spikes

If you’re expecting a steady data flow to your DB, Dynamo might work for you. However, if you have activity spikes, like we pretty much have every day with media, there are some things to be aware of. Every so often we get spikes of new incoming media. Sometimes these spikes would go past our highest expected capacity. These produced errors and we would have lost data if it weren’t for our applications’ retry policies. We had to raise the read and write capacity several times.

There are ways to mitigate this of course. One of them is to programmatically change the capacity once peaks start and lower it back once the spike is over. We can increase the capacity as many times as we want, but can only lower it four times every 24 hours. Changing the capacity also requires some time before it’s enabled. This can get complicated to manage.

As an alternative for building the scaling rules into your application there is an open source CloudFormation template called Dynamic DynamoDB that can be helpful.

Another way is to enable autoscaled capacity in Dynamo for reads and writes, but that requires a minimum of 10 minutes of breach time, so once the time is reached it could be too late for many incoming request during the peak. As of our peaks are not predicable in quantity (we know more or less when they happen but the number of requests can vary wildly) we decided to move off Dynamo and use Aurora instead.

I hope some of the lessons we learned here will help others avoid the headaches we’ve had here.

--

--