Improving Low-Latency Livestreaming on Site Visibility

Published in

Samsara R&D

11 min readAug 16, 2022

While “livestreaming” might sound like there’s no delay, in reality, a small amount of latency is often the norm when we watch video. Cutting down latency to get it to near-real-time is already a challenge when you control the hardware and software at both ends of the connection — it’s even harder when you’re working with thousands of different customer networks and video cameras.

This is the problem we faced at Samsara when building Site Visibility, which lets Samsara customers connect their cameras and monitor their work sites and facilities. Livestreaming is one of our key features — and it’s easy to see why low-latency livestreaming in particular is important for many of our customers. For a security guard actively monitoring entrances and exits or an employee checking their surroundings before locking up for the night, the difference between a viewing livestream that’s 30 seconds behind vs. one that’s real-time is critical to operational security and peace of mind.

This past quarter, we focused on improving the delivery rate of low-latency livestreaming for our customers. We more than doubled that delivery rate from 40% to 82%. This post will highlight how we used our core metrics to understand the problem, define realistic goals, and make impactful changes to deliver value to our customers.

Background: Video streaming protocols

Before diving in, it’s important to give some background on video streaming protocols. There are many options in this constantly evolving field, with many articles out there comparing and contrasting different technologies and even whole companies built around offering video platform solutions out-of-the-box. Here, we’ll give a brief overview of the protocols we use (and why).

HLS

HTTP Live Streaming (HLS) was developed by Apple and released in 2009. As its name suggests, it’s built on top of the HTTP protocol. HLS works by “chunking” the video feed into individual segments on the server and having the client wait until it has fetched some number of segments before playing the video, which allows for a smoother playback experience.

HLS is highly scalable and widely supported across many platforms, and thus is ideal for video broadcasting use cases like sports and games streaming — in fact, this is what Twitch primarily uses. When we first launched Site Visibility, we prioritized HLS because of its stability and because we could use it to stream both live and historical video.

HLS was built with quality and resiliency in mind; as a result, however, it isn’t as optimized for low-latency. HLS latency can range from 10–30s on average. So, to address near real-time use cases, we implemented webRTC.

webRTC

webRTC is an open-source standard designed for real-time communication. It differs from HLS in that it optimizes for minimal latency; to achieve this, webRTC embraces UDP’s inherent unreliability, minimizes buffering, and can operate peer-to-peer. It’s also capable of operating over TCP, but the preferred connection is a direct one over UDP from client to client.

webRTC streams often achieve interactive latencies (<250ms), and the standard is widely used by applications like Google Hangouts, Facebook Messenger, Discord, etc. It’s supported by all major browsers with no additional extensions required. By minimizing buffering and sending frames as soon as it’s available, users can benefit from an almost real-time video feed. This also means, however, that webRTC is more sensitive to network fluctuations, so video playback may not be as smooth as HLS.

Why we chose webRTC over LLHLS

It’s worth noting that webRTC is not the only option for low-latency streaming; in between these two options is Low-Latency HLS (LLHLS) which is an extension on top of HLS that brings the latency down to ~5–10s, while preserving a lot of the same reliability benefits as HLS.

We chose to implement webRTC rather than LLHLS as our low-latency solution because of the customer and engineering benefits:

Near-real-time latency: <250ms is much better than 5–10s in security-critical situations.
Automatic local streaming: In onsite use cases, the video will not leave the local network if we’re streaming over webRTC, saving the customer’s uplink bandwidth. This was very important to customers who wanted the ability to continuously stream multiple livestreams locally while not sacrificing their bandwidth (as seen in the screenshot below, we support views of up to 6 concurrent streams at once, which can potentially outpace their network bandwidth). WebRTC supports this and prefers using the local peer-to-peer connection if it’s available. While this is also possible to implement over HLS, it would be more complicated to set up, whereas it’s automatically offered with webRTC.

Internal precedent: webRTC was already implemented for our Safety product, which meant a smaller technical lift and assurances that this would be feasible. It also meant we could share the underlying infrastructure, allowing us to provide customer value more quickly.

Setting a goal: Improve webRTC delivery rate

To provide customers with a low-latency option, we rolled out webRTC for our streaming platform shortly after launching Site Visibility. In the UI, we make it clear when a user is viewing a low-latency stream (webRTC) and can expect it to closely match what’s happening in real-time. Since webRTC can take some time to connect, we always pre-load the stream with the HLS livestream and switch over to webRTC once we’ve successfully connected.

Switching from a HLS to webRTC livestream

However, we heard from some users that they were still seeing a high-latency livestream — in other words, they weren’t getting the webRTC stream. As a small team working towards a product launch, we had originally focused much more on feature development and velocity — getting an MVP in customer’s hands was our top priority so we could start iterating and learning from their feedback. Now, it was time to prioritize operational excellence, and we decided to focus on improving our webRTC delivery rate.

Establishing our core metrics

When we started on this initiative, our biggest challenge was a lack of clarity and understanding — when webRTC failed, we didn’t know why. In fact, we lacked the metrics to answer critical questions like “how much livestreaming is currently done through webRTC vs. HLS?” as a baseline stat.

Luckily, we had great data analytics tools already built for us by our Data Platform team, so our first step was to instrument and start collecting metrics around livestreaming to improve our understanding of the customer experience:

How long are users waiting for a stream to load?
What percentage of the time is a stream healthy vs. unhealthy? (We use an FPS metric to approximate stream health)
When users watch a livestream, what percentage of the time are they viewing a low-latency stream via webRTC vs. the higher-latency HLS stream?

This last question gave us the baseline: only 40% of livestreaming is done through webRTC, and HLS accounts for the rest.

The next question was, how much can we improve this by? i.e. What should our target delivery rate be? In engineering, we often think in terms of percentile-based goals (p95, p99) — but was it realistic to expect 99% of livestreaming to go through webRTC?

Learning from investigating customer behavior

To set an effective delivery rate, we needed to look into real world usage to see what was realistic. Our new instrumentation helped us lead investigations into our most problematic cases and uncover insights on customer configurations. Because webRTC uses a direct peer-to-peer connection, its performance is highly dependent on the networking environment it’s deployed in.

The insights we gained, described below, led us to realize that 99% is likely not a realistic — nor necessary — delivery rate for low-latency streaming. That said, our investigations gave us new ideas for ways we could improve the customer experience, even if we couldn’t deliver webRTC in certain scenarios.

Focusing on longer sessions

We bucketed each livestream session into individual 5-second duration intervals, and found that a surprising large portion of sessions — 38% when we last sampled — last less than 15 seconds.

When we correlate the webRTC streaming percentages with these duration buckets, there’s a clear trend: the lower durations (0–15s) experience significantly less low-latency streaming.

This makes sense! We mentioned earlier that webRTC can take up to 15 seconds to connect. If a user navigates away from the page before this is done, then they’ll only see the higher-latency HLS stream.

In our initial work, we’ve focused on improving the delivery time for customers who view livestreams for at least 15 seconds (thus giving webRTC the chance to connect). However, we’re also exploring ways to reduce the initial connection time by implementing trickle ICE.

Implementing an adaptive fallback from UDP to TCP

We looked at some of the worst-performing cases: streams where 100% of livestreaming was done over HLS and 0% was done over webRTC. For certain customers, we noticed that if the webRTC used UDP, we had consistent packet loss and failed to successfully stream, but when we switched over to using webRTC over TCP (via a TURN server), the performance was much better.

We understood this as the UDP network capacity differing from the TCP capacity on the customer’s network — the same stream may experience 5–10% packet loss over UDP, but play smoothly with 0% packet loss over TCP due to higher capacity. In theory, webRTC is supposed to choose between UDP and TCP based on whichever has the best performance, but in practice, we noticed it was choosing to maintain the first connection regardless of performance. Since UDP has lower overhead, it usually connects first.

Our solution here was to introduce an adaptive fallback mechanism. When we start a livestream, we request both the TCP and UDP streams simultaneously, but keep the TCP one paused in the background and serve the UDP stream. If or when we detect persistent packet loss above an unacceptable threshold on the UDP stream, we then unpause the TCP stream and switch over, while pausing the UDP stream. This allows us to ensure that we’re not straining the client bandwidth unnecessarily while providing the best customer experience.

An example livestream with packet loss — note the stuttering & frame freezes

Identifying firewall restrictions

Our investigations also showed that a small percentage of gateways consistently failed to stream over webRTC. Digging deeper, we found that ~2% of gateways had such restrictive firewall rules that the client was unable to form a direct connection with the gateway over webRTC. This isn’t too surprising, considering these gateways are deployed in corporate networks.

For now, we’ve decided to proactively explain what’s going on to customers that have firewalls set up. However, one alternative fix in the future is to implement a Selective Forwarding Unit (SFU) on our backend to proxy the connections, since presumably these gateways have allowed connections to our Samsara backend.

Setting appropriate bitrates

We directly ingest footage based on the camera’s encoded video, and the bitrate is set entirely on the camera side, out of our control. We noticed that some of the worst-performing streams consistently streamed at much higher bitrates (>20 Mbps) than what we recommend. This, combined with insufficient network capacity on either the gateway or client side, meant that we would be consistently dropping packets and thus video would not render smoothly, if at all.

Our support team reached out individually to these customers and fixed the configured bitrate. However, we recognize this isn’t a scalable solution and have discussed additional tooling to help customers better understand and debug their own streaming issues.

Broad reliability improvements

Outside of the improvements inspired by particular customer scenarios, we worked on some larger reliability improvements that affected all customers across the board.

Retries

webRTC’s initial connection attempt can occasionally fail due to adverse networking effects, so we added retry logic to automatically re-initiate a connection. We retry up to 10 times, and 46% of retry attempts eventually succeed.

Standard-definition webRTC

Our streaming platform supports both standard-definition (SD) and high-definition (HD) streams. SD is useful if the user has bandwidth constraints. It’s also the default when watching a panel of multiple streams at once (up to 6, as mentioned earlier), where the difference in quality is less noticeable due to the smaller size of each video. If no SD stream is available from the camera, the gateway ingests the original HD stream from each camera and re-encodes it in standard-definition.

When we first launched webRTC, we launched it in HD only. While this was less than ideal, for some time, a small minority of gateways suffered from reliability issues that prevented us from enabling standard-definition webRTC. Since 37% of livestreaming is done in SD, we knew this was a major contributor to our lower webRTC delivery rate.

Our firmware team was able to make dramatic improvements to fix the reliability problems. On the frontend, we implemented an adaptive fallback mechanism where we switch to HD webRTC if we notice a problem with the SD webRTC stream. Together, these changes allowed us to enable standard-definition webRTC, resulting in the largest jump in our webRTC delivery rate.

Results

Putting all these changes together — from the targeted customer outreach, to our UDP vs. TCP fallback mechanism, to our broader reliability improvements — we were able to improve our average webRTC delivery rates from the baseline 40% to 82%. Now, when a customer starts a livestream to monitor their facilities, they’re more than twice as likely to be able to see the feed in real-time (as opposed to a 10–30s delay) and be assured they can respond to any threats or unsafe practices immediately after it happens.

Breaking this down further, in the chart below, we show how much streaming is done via webRTC vs. HLS. p50 is the median delivery rate, p90 is the bottom 10% delivery rate, and the average is our mean delivery rate across all users per day.

Before, our median delivery rate per user was 30% — it’s now 99%.

After our changes, we also looked at webRTC delivery rates specifically for in-network users, which represent 58% of livestreaming use cases overall. We found webRTC delivery rates are 5% better for in-network users. This makes sense due to fewer networking hops and less complexity, but it’s good to see, since users viewing the livestream for security reasons are most likely onsite and in-network.

Conclusion

This is an ongoing effort for our team, since streaming reliability will always be a priority for us, and we have a set of concrete next steps we want to take to even further improve the customer experience. We’ve talked about many of these ideas throughout the article, such as trickle ICE and SFU. We’re exploring whether to move our HLS streams to LLHLS to reduce our overall latency, and we’re looking into Adaptive Bitrate Streaming (ABR), which is similar to the adaptive fallbacks we already have in place.

However, we’re also happy with the progress we’ve made thus far, and can now reprioritize and shift the team’s focus as needed. Our goal is always to deliver the most value to our customers, even when the result might not make for a flashy demo. Indeed, some customers might not notice the changes — but for those with use cases where latency is vital, they can feel secure knowing they’re actively monitoring their sites in real time.

If tackling these types of challenges sound exciting to you, we’re hiring for many engineering roles at Samsara and would love to hear from you. You can see our open positions here.