Migration of our video encoder to AWS

Fabrice Baumann
May 11, 2018 · 14 min read

As a streaming site, the videos we receive every day is the core of the business. Without these videos, we would not have any traffic, so we need to give a lot of love to our video encoding platform.

We have been running our own video encoder for quite a while now. Over time, we released many different version of the encoder and the encoding team spent quite a few weeks tweaking ffmpeg parameters to get the best ratio quality/file size.

At first we were using a simple PHP application and 40 dedicated servers to execute the jobs. Then we migrated to a docker container running on Mesos and the jobs were orchestrated with Chronos. The cluster had 80 dedicated servers.
Instead of using Chronos as a scheduler, we were using it as a queue and it did not work very good for our usage. The thousands of jobs we were getting every day were not handled very well by Chronos.

The encoding team created their own queuing tool, called Aeon. A simple python application, using Redis as database, that would receive resource offers from Mesos and find the job that would match the offer. It also has a priority system, so we can process important videos faster.

The main issue there, is that our Mesos cluster has a fixed size of 80 servers, and it doesn’t scale very well. New videos have bigger resolution, higher encoding complexity and require more resources and our servers are getting older every day. And the same way the cluster doesn’t scale up very well, we also pay for 80 servers even if we don’t need them 24/7.

The total resources of the mesos cluster

We needed a way to get newer modern servers, with modern CPU architecture and we needed to be able to get as many as possible and stop paying for them when we did not need them anymore.

That’s when we decided to migrate to AWS.


AWS Batch or EC2/ECS/SQS/CloudWatch

For years already, people have been using AWS for batch processing, and AWS has been very generous and created AWS Batch. Released in December 2016, Batch handles everything for you if you want to do some batch processing.

It handles scaling up and down your processing cluster, handles your queues, retries, etc. And it is free, you only pay for the resources you use, not for Batch itself. Pretty cool.
We used it a lot for other projects, but for this project, we decided to go with our own system (The real reason is that we thought Batch compute environments were limited to 256 CPUs, and a maximum of 18 compute environments total, and this was just not enough for us. But CE can have a lot more than 256 CPUs. We realised that when we finished our own setup).

We made a list of all the AWS services we needed:
* EC2 for the servers (Spot fleet, for reduced costs)
* ECS for the docker orchestration system
* S3 as temporary storage
* SQS for the message queuing
* CloudWatch for the monitoring and the auto-scaling logic
* IAM obviously for the permissions and roles
* CloudFormation to create the entire stack (CF template provided at the end of this article)

It was working good at first sight, but when we looked at it closer, we had a few issues here and there.

First, CloudWatch only offer basic alarms, and it wasn’t enough for a good auto-scaling logic for our cluster. We were using the ApproximateNumberOfVisibleMessages from SQS to scale up, and the NumberOfEmptyReceives also from SQS to scale down. Scaling down worked great, but scaling up had a few issues. It was either too slow or too fast.
You don’t want to boot the same number of servers if you have 50 messages waiting in the queue and 1 server running than if you already have 100 servers running.
So we create a new simple metric:
ApproximateNumberOfMessagesPerServer = ApproximateNumberOfMessagesVisible/SpotFleetTargetCapacity

Autoscaling values

We decided that the alarm threshold would be 0.2 on our new metric and it would have the following steps:

The second issue we had was workers that were scaled-down needed to send the message back to the queue so it could be sent to another worker. And because we use spot instances, our servers might be shutdown by AWS and we also need to push these messages back to the queue.

We added a bit of python code to our worker, to monitor when the instance would go down and send the message back to the queue. Each EC2 instance has an HTTP endpoint that can be checked to see if the instance is about to be shutdown. The HTTP endpoint will return a 200 if the instance will be shutdown soon. We used a background thread to run this monitoring logic, to not block the main encoding worker.

class InstanceActionThread(object):def __init__(self, interval=3, receipt_handle=None,
queue_url=None, sqs_client=None):
self.interval = interval
self.sqs_client = sqs_client
self.receipt_handle = receipt_handle
self.queue_url = queue_url
thread = threading.Thread(target=self.run, args=())
thread.daemon = True
thread.start()
def set_receipt_handle(self, receipt_handle):
self.receipt_handle = receipt_handle
def run(self):
while True:
r = requests.get('http://169.254.169.254/latest/meta-data/spot/instance-action')
if 200 == int(r.status_code):
# instance needs to be shutdown, exit parent process
log.warning("Scaling down EC2 instance...")
self.sqs_client.change_message_visibility(
QueueUrl = self.queue_url,
ReceiptHandle = self.receipt_handle,
VisibilityTimeout = 0
)
log.warning("Message visibility set to 0...")
os._exit(2)
time.sleep(self.interval)

Scaled-down jobs represent ~3% of our all jobs, meaning 3% of our jobs might be processed twice. Right now we don’t check anything about the jobs when we push it back to the queue. A better approach would be to check what is the progress of the job and let it finish if the job is at 75% or more, or if the ETA if the jobs is less than 3 minutes.

We already processed about 15 000 videos on our new cluster and we haven’t noticed any error, so I think we can say this was a success.

Our new CloudWatch dashboard

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store