FERMI-A self-balancing, chain-reaction video processor built for scale

Published in

Doubtnut

5 min readJul 17, 2021

A requirement came to us from business to process our in-house created videos for distribution across other channels. And it was to be done ASAP.

In startups, the term ASAP is a grey area. It has an unnerving deadline. ASAP leaves us thinking very hard about the problem and to reach an optimal solution. This is where it got interesting. We estimated the time it would take to finish 1.5M videos using the current methods. It did not meet the ASAP timeline. So we got thinking, keeping other future ASAPs in mind.

It was the right time to build a scalable service. We knew we need to have a system in place which can process/reprocess videos on demand and no matter how many videos we throw at the system, the throughput of the service should remain almost the same. Keeping up with the requirements, we bulleted a few features we would want this service to have.

As videos to be processed increase, the service should estimate how many processors should be added to balance the throughput with the delivery rate. And vice versa.
A processor, in our case, an ec2 instance should be self-sufficient like a single cell. And should recreate one like its own when the demand surges and should terminate itself once the demand is met. This would greatly reduce maintenance costs and manual interventions.

And thus, FERMI was born. Leveraging the AWS SDK, spot ec2 machines, a simple MySQL table, node.js, and Redis we started the first pipeline of the job.

Version-1: Birth of cells

A single ec2 machine aka the cell had to recreate its own type of machine. The cell had to be complete in itself. We put the upper limit on videos to be processed per cell aka “mitosis factor”. And as the threshold increased, the mitosis factor should come into effect and allow self-replication of the same machine. We put a generous upper limit on the total number of machines to be spun to safeguard EC2 vCPU usage. Let’s talk about the components of a cell:

EC2: c5.2xlarge machine, one of the most optimized machines for processing videos via FFmpeg, matched our video delivery rate. A machine AMI is used to recreate other machines.
Codebase: Node.js had a simple script of fetching a video from S3 and table data had the information on commands to be run, here, FFmpeg commands, and upload the video on a different bucket as defined in the table. It also had the methods to harakiri itself when the demand was met. And also create a new machine via AWS SDK and terminates itself when its job is done. Using pm2, we leveraged the auto-start of the application each time a machine is spun.
Redis and MySQL: MySQL table has the status of each video to be processed. Redis is used to check if any machine has just finished the processing and is ready to take the next batch. This batch should not be shared with other machines. If a Redis lock is present, every other machine in the same state should wait till the lock gets released.

This way, we could process almost 1.5M video in 3 days using 50–100 ec2 machines. We used Fermi to add intro and outro to our videos and convert them to 1-min as per the requirement from the first pipeline.

But this was not over. More requirements kept on coming and the pipeline had a few drawbacks.

Version-2: The rise of super-cells

The first version of fermi had some serious challenges which required manual intervention. Some of them include:

Every time the pipeline has to be started manually, i.e. the first cell has to be created manually.
The distribution of videos among cells was uneven due to video length, it led to some cells having finished their part of the job before others.
Since every cell made a lock on a set of videos beforehand, if the cell terminates (which was a high probability since we used spot instances), it caused a chunk of videos to get stuck in the processing state.
Any changes in mitosis factor/instance type/codebase required a new AMI to be created.

To resolve the drawbacks, we moved the cell launching mechanism from within ec2 instance to lambda function backed by rabbitMQ, thus solving the mitosis factor/instance type problem and the need for a database to maintain status and Redis to maintain lock. The lambda is triggered by s3 events which then initiates/scales the fermi pipeline. A message is then published from lambda to rabbitMQ which has consumers on ec2 cells. Thus we managed to efficiently strike a balance between time and cost. The cell termination mechanism remains the same. Also, there was no need to create the first cell manually.

Version-3: Cells on steroid

As the organization grows, more and more requirements come up. And this is very frequent in startups. We got our next requirement of multi pipeline to cater to multiple use cases such as pdf creation, thumbnail creation, screenshot capture, DASH/HLS packaging, DRM encryption, etc. So the fermi pipeline should no longer be confined to just video processing. It needs to be a generic one that serves multiple purposes. This required frequent code changes. The second version still had some nuances of AMI creation on code change which needs to be solved. To solve this, we started pulling the latest code from the repository when the cell launches.

Additionally, a new challenge arose as we started transcoding more types of videos. One of them being the case of long videos vs short videos. Since long videos take 50x the time for transcoding when compared to short videos, it’s very difficult to find a perfect mitosis factor. We would either be paying more in case of short videos or taking more time in case of longer ones. The same goes for screenshot/thumbnail creation when compared with video processing. So multiple triggers, multiple queues, and a mitosis calculator function solved this problem.

Version-4: The mutation

A unique requirement came up to transcode live streams. This came with its own challenge. Live streams can be transcoded only when the stream is live, so fermi cells need to be available before the stream goes live. RabbitMQ won’t work in this case. Also, the CPU usage should remain below 90% which transcoding is going on, else it’ll drop frames.

To solve this, we added a mutation to our cell creating a separate target group that won’t be rabbitMQ consumers, rather they’ll be on-demand APIs. We used a cron to spin up fermi cells just before the stream goes live and start/stop API for as triggers. Also regarding CPU usage, after some trial and error, we found that xlarge instances are well below 90% with the FFmpeg configuration that we were using. But that would be possible only when one cell processed exactly one stream, so we attached these cells to our microservices framework for auto-discovery and load balancing.

Infrastructure provisioning and managing cost

We were running c5 type instances with a spot in AWS. It’s a quite good experiment when we spawned several machines within seconds. We had almost 100 c5.2xlarge machines running on the spot with a cost of 285$ for 24 hours. We saved around $530 over on-demands. Also in case of spot usage, we have to monitor spot capacity utilization using service Quotas. But the problem with 2xlarge machines is that FFmpeg doesn’t go well with 8 vCPU machines causing 40% of CPU to be unutilized. So we changed the instance type to c4.xlarge reducing the spot cost by over 60% over c5.2xlarge and the availability is also better.

Co-authored by - Dipankar Arhat

FERMI-A self-balancing, chain-reaction video processor built for scale

Written by Abhishek Sinha