globo.com’s live video platform for the 2014 FIFA World Cup

Juarez Bochi
Oct 1, 2015 · 12 min read

I recently gave a talk about globo.com’s live video platform at nginx.conf with my co-worker and friend Leandro Ribeiro. He has already shared the slides, but I thought it might be useful to share this article for those who are interested but couldn’t see the presentation. I initially wrote this document as a speech, so if I panicked, at least I’d have something to read. This was the first time I spoke in an international event, in English. In the end, I did not speak anything I wrote, but I guess the talk was not that bad and we were very happy to receive some positive feedback.

Image for post
Image for post

globo who?

To explain how we got there, we should first go back to 2010, when the last World Cup had taken place. Better yet, to 1925, the year Globo was founded in Rio de Janeiro by Irineu Marinho, publishing two newspapers. In 1944, they launched a radio and in 1965 a TV network. Today, Grupo Globo is the largest mass media group of Latin America. globo.com was launched in 2000 and is an independent company responsible for the internet operations of the group.

If you are not from Brazil, maybe you never heard of Globo before, but it’s really huge there. In 2010, neither me nor Leandro had joined globo.com, but globo.com was already a big company with a big audience and it broadcasted the 2010 FIFA World Cup to 285 thousand simultaneous users. For that, we used Flash Media Server (FMS, now known as Adobe Media Server or AMS). The player was written in Flash. The best quality delivered was 800kbps, which is considered a bad quality nowadays, but was a lot for that time, especially for the typical broadband in Brazil. The numbers were very impressive for the time, but in reality it was not a good experience for users. The video was buffering, people were disconnected all the time and the average bitrate was really bad.

growing pains

The protocol we were using was RTMP, a stateful protocol, which makes it hard to keep the load evenly distributed across servers. If a server failed, the users would try to connect to another server and would not go back to the same server when it restarted. Under heavy load, a server would suddenly crash, and the other servers would fail in cascade because of the additional load. This happened several times, as it can be seen in the chart below, which represents the total audience for a stream.

Image for post
Image for post

We updated our servers, we contacted Adobe, we tried different topologies, we did everything we could, but every main event was a pain. Any event with more than 60k concurrent users was a huge stress. Basically, any football match. Moreover, since FMS is a proprietary software, we did not have many tools to debug what was going on. For troubleshooting, all we had was FMS's logs and WireShark to inspect RTMP, which is really hard.

While we were struggling with that, the world was watching the exponential growth of mobile phone, and we were still unable to stream to mobile phones. The most popular operating systems for mobile phones, iOS, and Android, do not understand the RTMP protocol. We decided to tackle this problem first since we were going to deliver a paid stream for the Big Brother Brazil show and the company had already announced it would support mobile devices.

mobile support

Image for post
Image for post

In a few weeks, we were able to create a working prototype using FMS and Nginx. We used FMS to segment the video streams and create the playlist files and Nginx for caching. Later on, we added to Nginx Lua and C modules for authentication and access control (we restrict the number of simultaneous streams per account and geolocation).

BBB was a success. We had no major issues with the 10 different streams that went on 24/7 during the 3 months of the show. We had a couple of issues with the RTMP stream under load, but the HLS stream was much more resilient. Even better, we could easily see and measure what was going on because it was just plain HTTP and Nginx.

one protocol to rule them all

We had some other goals at the time, but we started working on the player on our own spare time and after a few weeks we had a working prototype we could show to the management and convince them that this could work. The only problem of HLS is that the delay is huge. From 2s-5s in RTMP, it is 10s-20s in HLS. Keeping the segments short, we were able to minimize the problem and the additional number of requests were not a problem for Nginx on our tests.

Image for post
Image for post

We did some A/B tests, and the users that were using HLS watched the streams for longer periods, with better quality and with fewer switches between bitrates. For the Copa das Conferações, we replaced FMS with EvoStream and invested a lot on instrumentation and monitoring. It was another huge success. We were able to deliver video to 380k concurrent users.

dvr

DVR, from Digital Video Recording, is the technical term used when users have the ability to pause, rewind, and forward live video. Of course, for video streams, the server records the video and the player just plays the stream from a different starting point.

Image for post
Image for post

We build a simple Python application that moved the video segments from the segmenter to Redis and we developed a Lua application in Nginx to create the playlists dynamically and serve the chunks from Redis. We also used Redis to store video thumbs for each couple of seconds so the player could show a thumbnail in the seek bar.

load testing

We use Avalanche for capacity testing, but it’s hard to simulate the full-scale of a transmission. Luckily, we were going to stream some big events before the World Cup such as the Champions League, so we used this opportunity to battle test our platform. Our strategy was to use the least amount of servers that would handle the traffic. We only added more servers when we identified a bottleneck.

One of the things we noticed in one of such games was that sometimes the upstream layer was receiving a burst of requests and the response time started to increase. And the higher the response time got, more often than not we could see those bursts. At some point, it got so bad that we had to turn DVR off for one of these matches. The day after we finally understood what was going on: We had not configured cache lock properly. When the cache for playlist expires in a front end server, several requested were sent to upstream concurrently until one of them was returned. We had to explicitly tell Nginx to deliver the “stale” cache while one request was “updating”. The solution is very simple. Just use the following directive:

proxy_cache_use_stale updating

Another thing we noticed in one of such games, was that the core was handling all the network interrupts was very busy and we started to drop some packets when we were streaming 6Gbps per server. We enabled IRQBalance and increased the network card buffers. After that, we reached 9Gbps in a single node.

Image for post
Image for post

full throttle

The estimate was that we would have a total capacity of 1.6Tbps, but we had only 80 machines (each with 64gb of RAM and 24 cores and 2 bounded network cards of 10Gbps). The math was very simple: To be able to deliver the full capacity of our network links, each node should be able to stream 20Gbps. The issue was we had never crossed the 10Gbps before. We went back to Avalanche and used CPU affinity to manually pin IRQs to CPUs and we were finally able to reach 19Gbps, almost the full throughput of the network cards.

By the way, both NIC were listening to the same IP address, so they were bounded. We used direct server response in our load balancers. It means that the server listens to the public IP address and the load balancer only routes the incoming traffic to the server. The servers send the response directly to the user.

One last thing we noticed is that the server was gzipping the playlists every single time on demand and this also became a bottleneck. We had to configure a cache for the compressed content.

sorry, you are not on the list

The solution we got was to measure the capacity of each link in near real-time and put new users arriving from full links on a waiting list. In that way, we would save the bandwidth for the users which were already streaming, instead of accepting more users than we could support and hurting the experience for everybody. Once again, we used Nginx for that. The player was modified to ask the waiting room service if a new user was allowed to stream before it started streaming. The waiting room API was also developed using Nginx, Lua. A ruby dumped the exabgp database with the IP routes for each link into a Redis database. We used a Redis fork with interval sets to make the query by IP fast. The link capacities were obtained using SNMP to talk to the routers and persisted in the same database. The waiting room API just queried the database to check what was the link for user’s IP and its capacity. All this was summarized in the diagram:

Image for post
Image for post

the world cup

aftermatch

We had to write a Cassandra driver to Nginx Lua that has seen some adoption. It was used to build Kong, which has some very impressive benchmarks.

future plans

One of the things they plan to work on is to support the MPEG-Dash protocol. We still need a flash player to handle the HLS protocol in desktop players and with Dash it’s possible to create a pure HTML5 player.

Another change that was made was to adopt nginx-rtmp to segment the streams. It is an awesome open source library that could completely replace Evostream for us. There are just a couple of features they miss.

This year they are going to stream the Olympics in Rio. Let’s stay tuned!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store