WebRTC connection times and the power of playing around with data
Inspired by my good friend @fippo’s blog post on failing WebRTC calls, I wanted to play around with some data too. However, I’m far too lazy to dig around with SQL all the time, so I wanted something a bit more powerful.
The software we use to extract features from WebRTC calls being made in appear.in stores that data in DynamoDB, before it is put into Redshift for long-term storage and analytics purposes. This is mostly an implementation detail, but DynamoDB has some really nice features. Using DynamoDB Streams we can capture table activity and get that piped into something useful. In this case, I used some simple Amazon Lambda scripts to pipe the data from DynamoDB into ElasticSearch. Together with Kibana, I’m now able to query all the features of every WebRTC call made in appear.in, in real time, filter them, and create powerful visualizations and dashboards, all with a couple point and clicks. And the best part? This whole infrastructure is powered by Amazon, and I was able to set all of this up in mere hours, over the course of a boring evening.
Kibana offers two key features, Discover and Visualize. Discover let’s us interactively explore our data, applying various filters and searches to narrow it down, and pick out key fields we want to explore further.
Each call can be expanded to show all the data about that particular call, making you able to spot possible correlations.
How long does it take to connect to a peer?
Through this exploration, I came across the connectionTime feature, telling you how long the call took from peer connection creation to connected. This is a pretty useful metric to look at end-user perceived performance of our application. In layman's terms, it measures the time from when the user detected that a new user joined the call, until they are connected with media flowing.
With the help of Visualize, I was able to easily calculate the average connection time of all calls we have in our system. All graphs and numbers in this blog post are based on a random sample of 100 000 recent calls, to get a decent sample size.
However, averages can be dangerous. Do we have a lot of outliers? What does the distribution look like? So, a few clicks later, I had graph showing the distribution of connection times, bucketed by 100ms.
As we can clearly see, most of our calls actually connect below our average connection time calculated earlier. This is to be expected, our distribution has a long tail effect, making the average skewed by heavy outliers. So how do we calculate better metrics? We use percentiles, of course! Luckily Kibana easily lets us calculate percentiles for our distribution.
As we can clearly see, our average didn’t paint a very accurate picture for most of our calls. The long tail effect in our distribution is clearly shown in the 90th and 95th percentile, but these generally contain a lot of extreme outliers. A more actionable metric would be to look at the 50th percentile. We could use this by saying if 50% of our calls suffer a performance degradation of 20%, then that is something we have to investigate.
One final question we want answered: How does the candidate type selected affect the connection times? Does a relayed call take longer to establish? Let’s look at our data again, and this time, bucket our calls by connection type.
We can clearly see that host and server-reflexive candidates are about equally fast, whilst relayed and peer-reflexive calls are slightly slower, by about 50%. This is to be expected of course, but it’s nice to be able to verify assumptions like this with hard data.
One final thing, how does our distribution look for server reflexive versus relayed candidates?
The longer connection time for relayed calls stems from the fact that the candidates have to be gathered from a TURN server first which takes two round trips between client and TURN server (we measure this too!). Meanwhile, server-reflexive candidates only require a single round trip to gather.
The peaks for relayed calls can probably be explained by TURN/UDP and TURN/TCP, where TURN/TLS probably has too little data to show a significant peak in this distribution, but I’ll reserve that investigation for another blog post.
However, this still doesn’t tell the whole story. Are there other factors at play? Why are some users experiencing above 5 second connection times? Is this due to poor TURN server coverage in some countries we are used in? We don’t currently have an estimate to where calls originate from, so these questions are hard to answer, but it is something we will be looking more into in the future. Meanwhile, we can search for calls with a connection time >5 seconds and look at samples of those calls to see if we can spot anything obviously wrong.
I’m very excited about what these powerful tools and rich data sets might be able to tell us in the future. We have just barely scratched the surface of what ElasticSearch and Kibana might do, and we look forward to exploring and sharing more in the future as well.