There is a lot of collaboration across the BBC’s Design & Engineering teams to bring the World Cup and Wimbledon to the BBC. From the actual application, over scheduling behind the scenes, how the actual video transmission makes it way into iPlayer and BBC Sport and how this lets users actually watch a live broadcast.
This year we are also broadcasting for the first time in UHD (Ultra-High Definition) resolution for the World Cup and selected Wimbledon matches.
Let’s start with how most people watch a match — they tune in into the match. BBC One is a pretty good bet to try if you’re in the UK, and if you’re lucky might see one of our ‘UHD Call To Actions’ on the screen and one of the commentators tell you this:
You can watch this match in Ultra HD via the BBC iPlayer
Someone decided that some of the World Cup matches are broadcast on the BBC in HD and UHD — more on the production side, and how our UHD trials worked this summer in this blog post from BBC R&D (Research and Development).
Once we know the schedule, we co-ordinate the schedule for our Audio/Video encoders to deliver the correct streams for the correct events at the correct times. For the World Cup, the UHD streams are actually completely separate episodes in the BBC programme structure, but embargoed in the system so they do not show up anywhere, including programme guides.
For Wimbledon, we also can’t use the same episodes as on broadcast TV since the broadcasts sometimes switch between channels, and are split up, so the UHD broadcast is attached to the equivalent Webcasts.
For the UHD streams, the encoder, packagers and backups, as well as the CDN (Content Delivery Network) configuration are controlled manually. This means, that prior to the match BBC Sport staff in Salford co-ordinate with their Media Services and OTG (Online Technology Group) colleagues in London, who are responsible for the stream delivery, and the broadcast staff in Moscow responsible for the contribution feed coming into London to turn everything on in the right order and start the stream.
This needs to happen before the broadcast begins because as soon as the broadcast episode becomes ‘available’ on iPlayer, the TV app starts to offer the UHD version — and if the encoder isn’t turned on, that does not look pretty.
Once everything is running, our CDNs are warmed up and we see no issues, we are getting closer to the match. The ‘Call to Action’ (CTA) for users to switch to iPlayer for UHD actually comes from the Outside Broadcast truck in Russia.
When everything is prepared, we are ready to serve the live streams to the audience, and that is co-ordinated across the different iPlayer teams.
For the app on smart TVs and set-top boxes, the basic architecture is a pretty typical client-server one. We have a common client application which is used for all of the BBC’s Connected TV products and numerous backend services. These services often take the form of a Node.js application deployed to Amazon EC2.
Our usual scaling process is to scale in response to AWS CloudWatch alarms which are hooked up to appropriate metrics (e.g. memory utilisation, CPU utilisation, network throughput) for the service in question. This approach works well with our usual traffic patterns as they are reasonably gradual increases. When an on-air CTA is made, however, our traffic increases at a rate far too great for AWS AutoScaling to have an effect. We have seen the network traffic through our NAT Gateways increase by 150% in the space of 60 seconds and traffic we send to the API that powers the iPlayer homepage tripled in less than 20 seconds.
As soon as we find out which games are due to be broadcast on BBC One (and therefore in UHD) we start analysing the times to determine when the peak traffic points are expected to be. We liaise with our colleagues from across the whole of the BBC to determine if there are any on-air CTAs planned and if so, when they are likely to occur.
Once we’ve got all the necessary information we begin planning our scaling rules. Unfortunately this process is (for now) rather manual as we make use of AWS’ Scheduled Scaling. Internally we’ve started looking at various other solutions that would allow automated scaling to be effective in these situations — hopefully we can write another blog post about this in the future.
To enforce a limit on the number of UHD streams, we couldn’t rely on previously used measurement systems, as the latency is too high. At kick off and on-screen CTAs, we have seen the number of viewers increase by a third within 60 seconds. We need to ensure that we can “close the gate” reliably when we reach the limit, as there is a real possibility to reach the limits of the UK internet infrastructure due to the bandwidth requirements of UHD — roughly 8x that of HD.
There is a Counting Service built by our Live Sport team, which uses AWS Kinesis Firehose to collect the viewer-count statistics, which was linked to the “gate”, and heavily used in the Live Pages — Mark Woosey has blogged about the details separately here.
In order to allow users to see iPlayer or Sport content, we ask you to sign in to your BBC account. This is a collection of applications which open the gateway to user personalisation for the whole of the BBC, and not just for iPlayer and our live streams.
As with many of the modern BBC services, our applications are a set of Node.js microservices, which are deployed using AWS cloud services, and communicate with other BBC services via HTTP requests. We have an isomorphic React application which displays the sign in and registration forms and then securely handles your credentials to confirm your identity. We’ve spent a lot of time ensuring that the applications can be scaled effectively for large events and we do so using AWS autoscaling. Our amazing Quality Assurance (QA) team had already spent a lot of time performance testing our systems in the past in order to understand the expected traffic patterns and server load for our usual “big” events, when we authenticate users during the live online votes for Strictly Come Dancing and Sports Personality Of The Year. That said, we knew that our audience would be keen to watch both the World Cup and Wimbledon, so we performance tested some different traffic profiles, and then added contingency in the shape of additional virtual server instances to support potentially higher loads.
Needless to say you didn’t disappoint us!
As you can see from the tweet of one of our graphs during the England vs Sweden game, we had huge peaks of traffic as you signed into, or registered for, your BBC accounts at the start of the game. In fact, we had the most registrations we’ve ever had during a single day! You can also see where the two England goals occurred from the second and third peaks. People were quick to sign in and watch the goals again on our iPlayer/BBC Sport streams.
While you just want to watch the World Cup goals on our live streams, our goal is to verify that you have a valid BBC account which allows you to watch the BBC content that you love as quickly as possible. During the peak periods of the England vs Sweden game, we had around 24 times the amount of users signing in that we would normally expect during weekend usage, so our additional application resources help us to serve requests as quickly as we can. Our applications also help to support the BBC account mobile team, and their framework is used by all of the BBC mobile applications, such as iPlayer on your phone or tablet.
Once signed in, we authorise users to access various BBC services by means of an access token which, to ensure our and your security, refreshes on a regular basis. During the peak traffic of this game, we saw approximately 5 times the normal amount of traffic as people turned on their mobile devices, we refresh these access tokens and then you watch the World Cup goals quickly after they went in.
We try to be as unintrusive as possible so once we’ve signed you into your BBC account, we send you back to the page you clicked “sign in” from. For the World Cup and Wimbledon this was probably the BBC iPlayer or BBC Sport Live pages.
Every press of a play button triggers a call to our Media Selector API. This service uses a range of factors including the following to decide whether your device is authorised to play a stream, and if so, where it should play it from.
- Device type
- Your location
- The rights of the asset being requested
- Current business rules for load balancing across CDNs based on capacity and performance
You can find out more about how we built Media Selector 6, our current iteration in Henry’s post here.
Media Selector generally serves between a few hundred and a few thousand requests every second, with regular but often unpredictable spikes in load. Before the World Cup, its traffic peak was when BBC News published a push notification after the stage invasion at Eurovision.
With the Media Selector response in hand, the device then selects a playback URL from the options provided to it by Media Selector, and starts streaming from a CDN.
In addition to our own CDN, BIDI, the BBC engages with two commercial CDNs to be able to handle the users and bitrates our streams require. As well as providing huge amounts of network capacity and substantial cache offload, the CDNs also have a lot of intelligent logic that try to protect users from transient issues such as slow network paths to the content origin.
Each CDN has its own set of rules around retrying requests to our origin servers, timing them out, or trying to route around specific parts of the network. Tuning these rules is a very complex and nuanced process, where every timeout value changed needs to be considered in the form of what impact it will have on every other part of the distribution chain. In the event of a CDN having issues providing a stream to a user, our Media players Standard Media Player (SMP) and TAP will automatically attempt to use another CDN rather than showing an error to the user.
Each CDN has a quota of UK distribution capacity that they allocate to the BBC for use — how many bits they think they can push before performance will begin to degrade. Throughout the World Cup performance and capacity were reviewed across different platforms and services as we broke and re-broke our peak throughput records, regularly re-routing traffic whilst tweaking configurations as they hit new barriers that if not addressed would cause quality to degrade before we hit our total caps on the CDNs.
There is good news with capacity planning though: football fans are predictable. The number of users and amount of traffic delivered at kick-off for a football match is usually around a factor of √2 out from the amount of traffic we’ll be handling at full time. This essentially means that as soon as the timer starts, we know what, if any, traffic engineering will need to have taken place by half time to keep things running smoothly.
One exception to this rule has been UHD. We initially expected it to follow the same pattern but the traffic levels have been far flatter — it looks like the CTA which stated that there were a limited number of viewers allowed caused everyone to try and join early and stay on. Whilst for HD there’s a dip in concurrent viewers at half time whilst they pop the kettle on, UHD stays stable (perhaps because viewers know that if they close their UHD stream, they might not get back on after half time due to the limit on concurrent viewers).
Digital 24/7 Operations
What it’s like to be on shift — a personal insight
Operations, I like to think, is the glue that holds this whole process together. We are the eyes and ears of the BBC.
Prior to one of the matches starting, we prepare. And once, we’ve prepared, we prepare some more!
This includes checking the UHD encoders have started, Wimbledon & World Cup promos have been scheduled ready to be displayed on iPlayer to chairing our tier 1 event stand-ups, ensuring our RAG status is green across our broadcast, online & network services & sending out comms to all our stakeholders.
We constantly monitor.
Now, we love our monitoring in 24/7, as you can see from our monitoring wall image below!
We monitor everything from our content delivery network, live video & audio output for iPlayer & iPlayer Radio, live sport pages, RedButton and whole heap of other alarms, products & services.
I work with very talented people in 24/7, some of whom have built bespoke monitoring tools to solve problems we have had here.
For example, for AWS Alarms we have LookOut. LookOut has a UI that utilises colours to help identify the state of alerts at a glance and exposes key elements to members of the Ops team, e.g. Wormhole links, SQS queue fetcher and it gives us the flexibility to customise visibility of alarms in the UI.
For monitoring our content delivery network we have Sawyer. A script that consumes the output of an API and performs calculations to determine the difference between the CDNs, compared against a customisable threshold. A UI that displays the desired metrics in a quick-to-digest dashboard view, with changing colours to draw attention to deviations between our third-party CDN providers — we like to avoid the buffer faces for our Audience Members!
Our monitoring tools ensure that we are able to identify & resolve incidents fast!
And when the match is over?
Did I not mention we have another ~1,400 components that we support on top of the components being utilised for the World Cup and Wimbledon?
Now that both World Cup and Wimbledon are over, and with that our special events UHD live trial, it’s probably good to reflect on the month. There were some initial hiccoughs — co-ordinating the CTAs with the Moscow team the impact of unprecedented traffic spikes on a new iPlayer homepage architecture and event based scaling of the underlying infrastructure.
But by the end of it, we have broken several records multiple times:
- Over 5.5 Terabit/s in live video streams
- 3.8m concurrent streaming requests for the England vs Sweden match
- 15,000 requests for the VR version of the match stream
- 1.6 million requests for the UHD version over the course of the tournaments
Of course there were some other problems — A large proportion of non-UHD viewers will have noticed the stream cutting out just before the end of the England vs Sweden, which was swiftly dealt with. Simon Thompson wrote another post summarising the teething problems we had and learned from with UHD.
Thanks to everyone involved in writing this blog post and to Jenny Wong on Twitter for suggesting that this would be an interesting topic. Thanks also to Quinn Cowper the provide us with some of the picture
Ask the teams any questions and we’ll be happy to answer as many as we can.