nginx-rtmp on EC2

Published in

Crowd Emotion

5 min readDec 23, 2014

Here at CrowdEmotion.co.uk we use OSS to stream and save user webcam from the browser to a clouded server infrastructure. The server side componend that receives and saves the RTMP stream is nginx-rtmp.

nginx-rtmp is upposed to be quite efficient and support traffic for a few Gigabits although our setup is limited to a single master nginx process because it’s the only working mode that actually supports saving webcam to disk. This is the nginx example configuration we used for our tests.

As you can see one Flash application endpoint just receives the video stream while the other also saves it to disk.

We created a sample content with 1 megabit video using GStreamer.

Simulating users was done by using multiple instances gst-launch like this

In these conditions one stream weights about 150 KiloBytes/s. Therefore 100 users used roughly 150 MegaBytes meaning 120 Megabit/s

What happened

First of all, let me say I would have expected being able to receive and record a few thousands of webcam streams. But things are never easy. Lets dive into it.

We fired up 2x m1.large instances (Ubuntu 14.04 AMI). One set up to trasmit while the other set to receive and save RTMP streams. We were able to receive and save 50 streams before the cpu (one single core actually) went 100% and the server started loosing connections. I was surprised because I would have expected almost no cpu involved in receiving and saving to disk.

Flooding system calls

Further analysis showed that most of the time was spent in SYSTEM time. So I used strace/ltrace to see which syscalls were taking time. These were the results.

Basically nginx-rtmp reads every RTMP packet of every socket by calling the recvfrom() syscall every time (every few bytes!!). It never tries to read a larger block and split it later. I don’t know if this is how efficient servers are supposed to work but for sure this was causing a lot of time spent on switching from application context to systems and back. The same thing happened for disk writes. Nginx-rtmp calls pwrite() for every single piece of data it receives and wants to write.

Note that nginx-rtmp does have a “chunk-size” configuration paramenters but that only works for sending packets and its purpose is exactly to limit the number of calls used while sending the stream to viewer clients therefore lowering the cpu consumption. In our case we need to receive them though.

Solving these “problems” means putting hand at nginx-rtmp source code which we decided to skip for the moment, for the moment… But if you want to know more about this I found one and another interesting articles to start from.

Network interrupts

The fact that hundred clients were sending packets to the server means that the server NIC has to receive all and each one of them (Mr. Obvious). In a virtualized environment each packet triggers a software interrupt that is perceived as “stolen” time.

What’s even worse is that by default on Amazon EC2 machines NIC interrupts are handled on one single core. You can read about this in many posts like this, that or that.

To summarize, the partial solutions to thes problems are the following.

Never use ParaVirtualized instances

If you need performances use HVM. They just have less overhead and better performances.

Simply upgrading our machines from m1.large to m3.large allowed us to have 100–150 streams before cpu wasreched 100% (single core)

Enable RPS

RPS (Receive Packet Steering) defers NIC interrupts handling to different CPU cores even if the NIC does not support multiple queues (one for each core)

echo f > /sys/class/net/eth0/queues/rx-0/rps_cpus

The balancing of work between two cores works but it is sort of jumpy. Often you see spikes of the usage of one core of the other.

This grew the amount of supported stream to 200–250.

Alternatively, use Amazon Enhanced Networking

Enhanced networking is better not just because it better balances interrupts but also because the transfer of data from the NIC(s) to the VM memory is not stolen to your machine.

But note that the driver does not build cleanly on Ubuntu 14.04 you need patches. The easiest way to get it working is to use official Amazon AMI images. On those you will see that “/sys/class/net/eth0/queues/” contains many queues, one for each core. This means that you have to do nothing to balance traffic between the cores: it’s all taken care of (sadly not even ubuntu 14.10 amis has this ready out of the box)

Enhanced networking is better not just because it better balances interrupts but also because the transfer of data from the NIC(s) to the VM memory is not stolen to your machine.

With this we were able to handle 350–400 streams on the same m3.large instance (probably also thanks to disk io improvements, read on).

Use interrupt coalescing

Coalescing is a technique to limit the frequency of interrupts handled by the os. My understanding is that interrupts are deferred and grouped so that the system handles groups of them at once and does not have to wake up too often. See here.

I found that using the maximum value of 1000 can reduce general cpu usage by 20% in our case. Of course it adds latency for our application it’s not a problem.

ethtool -C eth0 rx-usecs 0

This grew the amount of handled streams by about 20% so we can expect to safely handle 400–500 streams on one single m3.large instance.

Last note about disk i/o

There are many posts talking about storage performances of Amazon EC2. Like this one, this other one, that, oh and this too.

These posts talk about how to set up raid0 disks to achieve better performances and how to set sysctl parameters to better fit disk usage for your specific application needs. While this is an options we will consider, for the moment we are using one single SSD disk with enough provisioned IOPS to handle the amount of stream we need.

Not so many posts talk about the fact that even disk access is something that may weight on one single core. So if you have a lots time spent on disk writes, as in our case, all of that time will be wasted on one single core. This is something I could see using Ubuntu AMIs. As soon as data was flushed to disk one core showed high usage. A useful tool for spotting this is iostat.

iostat -x 2

There is a recent kernel feature called “blk-mq” that enables handling of storage io using multiple cores. This is supposed to be available in 3.13.* Linux kernel but it does not actually work with all storages. The system has to detect the SSD disk and its driver must be supported.

I have no evidence of this but it seems to me that Amazon AMI wins again. Using it never showed spikes on one single core due to disk access. I suspect Amazon has worked for having that feature ready ASAP.

Conclusions? Well, for sure this has beeh harder than expected but we were able to get things in a better shape. I hope this is useful for someone. Back to work.