Things to consider when building a large-scale interactive live streaming platform

Jeroen Mol
Livery Video
Published in
12 min readJan 10, 2022

In 2021, more and more companies started experimenting with interactive live streaming — and their boldness was rewarded with extraordinary results. Livery customers who combined live video with interactive elements saw a user engagement boost of 80–95% compared to traditional second-screen activities. An on-screen interactive layer opens the door for customers to engage in new ways: static viewers become active participants, which increases retention, reach, and the average revenue per viewer.

While these results are certainly impressive, a lack of end-to-end interactive streaming solutions is limiting the adoption rate for organizations big and small. At Livery, we have seen firsthand how organizations are struggling with workarounds: they either use existing communication tools like Zoom, Meets, and Teams, or they combine live streaming solutions like Vimeo and YouTube with collaboration tools. The problem is, none of these cobbled-together systems were designed with live video in mind.

In this post, we will be walking you through the thought process behind our interactive live video platform, Livery, Including the many challenges that come with building a large-scale interactive streaming solution. When you want to start streaming asap, I can recommend the following post covering the basics of interactive live streaming.

First, a quick definition. When we talk about an interactive live stream, we are referring to:

“Real-time broadcasts in which the structured actions of the participants affect the content of the live stream.”

See how popular online video-streaming tool Taobao Live is being used by merchants to reach and engage with their consumers.

This video from Taobao is a great example of an interactive live stream and will be used as a reference throughout.

The video shows how Taobao Live broadcasts use an interactive layer on top of an ultra-low latency live stream during e-commerce events. The interactive layer allows viewers to actively participate via chat, submitting live sentiments, and engaging with on-screen product offers.

While this concept was pioneered in Asia, European e-commerce platforms have started hopping on board. A growing number of Livery customers are using our interactive streaming solution to achieve similar results, eager to capitalize on the increasingly popular shoppable video trend.

Intime Mall’s staff used Taobao Live to sell products remotely during the coronavirus outbreak (https://www.alizila.com/).

In the Taobao example, the hosts use multiple interaction types to collect feedback from the viewers. Asking questions via the chat or initiating structured interactions like polls provide data to guide the host’s approach. Product offers are presented at top of the stream the moment they are mentioned by the host, and direct deeplinks allow viewers to buy the items without leaving the stream. Thanks to this setup, Taobao was able to hit $7.5 billion in sales during the first 30 minutes of last year’s Singles’ Day presale event.

If you want to create a full feedback loop like this one, you will need to understand the method behind the magic. Let’s dive into the technical building blocks and challenges you can expect to face when designing an interactive live stream experience.

A full interactive live streaming feedback loop
00:00:000 A host standing in front of a camera or mobile phone is showcasing a given product. They ask the viewers — who are tuning in on mobile and desktop — which colour they prefer. As the question is asked, a structured poll interaction is triggered on-screen.00:00:000 A host standing in front of a camera or mobile phone is showcasing a given product. They ask the viewers — who are tuning in on mobile and desktop — which colour they prefer. As the question is asked, a structured poll interaction is triggered on-screen.00:03:000 The video is broadcasted with a delay of 1–3 seconds, delivered via a scalable video CDN to all 100k viewers. The poll interaction has a similar delay.00:03:010 The data generated by the viewers is delivered in (almost) real-time to the interaction server, which is then able to visualize the data for the host.00:23:010 The answer window for the poll stays open for 20 seconds, allowing the host to discuss the incoming data while the poll is open.00:23:020 The answer window is closed and all user data is processed, allowing the host to reveal the results right there in the studio.

Based on this flow, we can begin laying out a set of requirements. At the highest level, they can be split into video and Interaction requirements.

A video solution

  • Glass-to-glass latency cannot exceed 5 seconds
  • Needs to support 100k+ viewers
  • Option to control the video latency
  • Support for iOS, Android, and web
  • Cloud encoder able to receive a SRT or RTMP video stream

An interaction engine

  • Needs to support 100k+ requests per second
  • Option to sync the latency of the interaction with the video latency
  • Support for 2-way communication
  • Able to accommodate high volume of requests per second (majority of participants will be answering the on-screen question simultaneously)

The expected scale and total number of concurrent participants are key limiting factors when determining the possible options.

What is an interaction engine?

The interaction engine is responsible for powering the interactive layer on top of the video stream. As a developer, you can build your own interactive layer using iFrame, Webview, or another solution. You can also use the Livery player (available for iOS, Android, and web), which comes with a built-in interactive JavaScript layer.

The different layers of the Livery Video players for Web, iOS & Android

In the Taobao example, no player controls are visible. However, this is not always the case, so it is important to be aware of potential dead zones — areas where no interactive elements can be loaded. Aside from the player controls, it is important to be aware of any native UI elements or limitations. Both Apple and Android have an overview of their guidelines online.

When building an interaction engine or searching for a white label solution, you’ll need to keep the following points in mind.

Structured vs. unstructured interactions

When dealing with small groups of 2 to 20 participants, unstructured real-time interactions like chat, direct voice communication, and online whiteboards are good options. These groups are small enough for a facilitator to structure the incoming information and allow direct communication between participants. Glass-to-glass latency becomes noticeable above 250ms and unacceptable above 500ms when unstructured interactions are used, which makes WebRTC the most effective live streaming protocol for this specific situation.

Structured interactions are required for audiences of groups with more than 20 participants. The main difference between structured and unstructured interactions is predefined answers and the length of the answer window. For example, a poll interaction where the participants have 15 seconds to select 1 of 4 answers, or a Q&A where participants can upvote a question. The communication phase during structured interactions is reduced, meaning a higher glass-to-glass latency is acceptable. In general, a 0.5-second glass-to-glass latency would not affect the fluidity of the broadcast.

Metadata vs. server interactions

The most cost effective-way to make a stream interactive is to include the interaction data in the meta data (ID3 tags). This removes the need for an interaction server by integrating the interaction data into the video feed. While the major advantage is cost reduction, this approach comes with an important limitation: it only facilitates one-way communication and therefore cannot create a full feedback loop. Amazon’s IVS uses the approach above — you can learn more about it in the following post.

When interactions — whether structured or unstructured — require a two-way channel, a server becomes mandatory. The server capacity required, as well as the related costs, will depend on the type of interaction and its functionality.

For example, when Taobao is giving away prizes based on trivia questions. Trivia is more complicated than a poll due the associated scores and leaderboards. This means that servers are needed to make real-time score calculations, and the more users you have, the more servers you need. This Increases costs while decreasing the risk of failure.

Serverless vs. dedicated environments

Historically, setup and maintenance costs for servers have been quite significant. Nowadays though, thanks to serverless setups, the costs and time to market have been drastically reduced. Major players like AWS, Google, and Microsoft all have a serverless solution, which can be used to build your own interaction engine. These setups are a great way to get started — until you reach 10k concurrent players.

Let’s look at an example of a typical traffic pattern during a Livery interactive live stream:

A common traffic pattern during a Livery interactive live stream

The challenge with serverless setups relates to infrastructure’s Cold Start, meaning the system is idle and is not able to scale as quickly as the users who are joining or starting to participate. When an interactive event is triggered (on top of the live stream), all participants will see the same CTA at the same time, and 60–85% will interact within the same 1-second window. This results in a high volume of requests every second.

This challenge can be addressed by pre-warming the server infrastructure so it can handle the large influx of data, which requires additional components. You will need an up/down scaling system, load balancer, gatekeepers, and more. Handling these influxes is one of the reasons why we moved away from a serverless setup and built a fully-optimized custom platform: Livery gives us full control, removing any blackbox parts from the delivery process.

Multi vs. single tenant

A multi-tenant setup is more cost-effective compared to a single-tenant setup. In a multi-tenant setup, a server is used by multiple customers, or “tenants.” Not all customers broadcast at the same time, allowing them to share resources. When a broadcast is responsible for handling a significant amount of requests within the cluster, it is best to isolate the customer (i.e. to have their own setup). This prevents a single tenant from influencing another tenant using the same infrastructure. The Livery team likes to know when a customer expects more than 100k concurrent participants, since this allows us to determine if a single-tenant cluster is best and whether the fee per user hour can be reduced.

Load testing

Every part of Livery’s Interaction engine is load tested for different scenarios, as are all of our new releases. That way, we can be sure we meet customer expectations and prevent a situation where the Livery platform is a bottleneck for their success. We strongly recommend load testing early in the development cycle, if you choose to build your own platform. Livery uses its own load test setup, which is provided by Ex Machina and trusted by major broadcasters. It is also possible to use an existing tool — the following article will walk you through a few open source load testing tools.

Live video streaming parameters

Underneath the interactive layer, a live video is streamed to the participants. In the Taobao example, the host stands in front of a camera and uses the input from his participants to enrich the video and make it more personal. This only works because the interactive elements are in sync with the video and the glass-to-glass delay is minimal.

Ex Machina has years of experience building interactive experiences for both TV and live streamed events. Based on their research, they pinpointed the ideal latency for structured interactions to be 0 to -5 seconds. A latency higher than 5 seconds starts affecting the feedback loop. More important than achieving the lowest possible latency is synchronicity: the call-to-action and related participation windows cannot be too far apart for different users. An offset longer than 1 second will affect the end-user experience.

If the latency is close to zero, there is no need to sync users — but even with the fastest video protocols, drifting can occur, which hurts the interactive video experience. When the stream is synced using a NTP time source, all participants are able to watch the same moment of the stream at the same time. With the proper optimization, it is possible to reach an almost frame-accrued synchronization across browsers and device types. If the interaction data is embedded in the metadata of the video, it is possible to achieve a frame-accrued sync with the video, but it cannot guarantee that all participants see the same moment and related interactions at the same time, creating spoilers and breaking the feedback loop.

Based on the latency requirement of 0 to 5 seconds, protocols like HLS or DASH — including the tuned (small segments) versions — do not fit the business requirements.

Live streaming protocols and their latency

The fastest kids on the block are WebRTC-based (UDP) protocols, followed by WebSocket-based (TCP) technologies which can achieve a sub-second latency. In the 1 to 3 second latency range, you can find HTTP-based technologies like ULL-CMAF and LL-HLS. There are more niche protocols available, but the global adoption is still limited.

WebRTC-based and WebSocket-based technologies are mainly used for tools focused on small audiences (like the meeting tools mentioned earlier), whereas HTTP-based technologies are created with large audiences in mind.

If you like to know more technical details about WebRTC and ULL-CMAF you should check the following post.

Video quality

Both WebRTC and HTTP streams can support a type of adaptive bitrate. In WebRTC-based solutions, this is done client-side with Simile Casting. For HTTP-based streams, it is done via ABR, which is server-side. When it is done client-side, you depend on the streamer’s hardware, while server-side means the platform is in control.

Using ULL-CMAF and LL-HLS ABR logic as we know it for VOD streams is an option, but it’s not as accrued as a segment-based stream like HLS or DASH. The usage of video chunks in ULL-CMAF complicates the ability to do proper measurements needed for a proper ABR algorithm. The chunks, in ULL-CMAF, are not required to be delivered at line speed and the size of chunks is unknown. Apple’s LL-HLS improved the abilities for ABR algorithms which rely on the chunks, by singling the chunks in the manifest.

The buffer available for HTTP and socket-based streams allows the players to hide possible hiccups or perform retries on chunks that are not properly received.

Image credit: magnolia ceiling | imgur.com

A WebSocket-based solution uses TCP and confirms the reception of data, whereas WebRTC uses UDP which does not confirm the reception of data. This makes WebRTC ideal for real-time video with the tradeoff of possible artifacts during the stream.

More details about how to measure video quality can be found here.

Cost

One of the major advantages of HTTP-based technologies is that they can scale according to the standard pricing for VOD and known major CDNs. WebRTC and WebSocket-based technologies require a dedicated delivery infrastructure when reaching the peer limits of the browser, making WebRTC for 500 viewers 2–8X more expensive on average than HTTP-based technologies.

Availability

The capacity and availability for HTTP-based technologies depends on the CDN used for the video delivery. Major organizations like Akamai have invested large amounts of money into building a global infrastructure able to deliver video to all corners of the world, including third-world countries. The structure is used by known VOD solutions and is prepared for a 4k and 8k future.

WebRTC and WebSocket-based technologies require you to rent dedicated machines on which the required tools need to be deployed. It is expected that major CDNs will extend their WebRTC support in the future, and the rise of dedicated WebRTC CDNs with global reach will tackle one of its inherent scaling and availability challenges.

The information above provides important insights to help you decide whether you should build or buy an off-the-shelf solution. If you have any questions or comments on this topic, I would love to hear from you — please use the comment section below to drop me a line. I may also update the article based on your feedback and comments!

We created Livery Interactive next to Livery Video because many of our customers didn’t have the knowledge or resources required to build an interactive streaming platform. They wanted a single trusted point of contact to be responsible for delivering both the video and the interactive experience, so they can focus on their own challenges. To understand how the Livery system works, check out the component breakdown below.

A high-level overview of the Livery platform.

Our development teams are constantly working to extend both the video and interaction platforms. All of our features are built in-house, which allows us to provide you with a high-quality solution at a reasonable price compared to our competitors. To make things easy, we’ve even created an online calculator to determine what the cost of your custom solution would be. If you like to use an off-the-shelf solution, please reach out to us. You can also contact me directly on LinkedIn.

--

--

Jeroen Mol
Livery Video

VP of Innovation @Livery Video. A creative problem solver with a educational background including a MA in Art Management and BA in Media Management.