I’ve talked about previously how we measure and analyze WebRTC traffic in appear.in. Following up in Philipp’s blog post on where to deploy TURN servers, I wanted to do one on latency and how that differs in our service across the world.
In our datasets we have, amongst other things, the mean send audio RTT for each call, as well as which TURN server was used during the call. Kibana allows me to combine these two to create a neat little bar chart that tells me for all calls I’ve selected, which is around 100 000 relayed calls in this case, what is the average mean latency per TURN server during that time.
And while we don’t know the geo location of each caller, we can from this, and the fact that each caller will only resolve their nearest TURN server, extrapolate some information about where each call has taken place. Singapore for example is most likely dominated by South-East Asia. Telenor has a lot of operations there, and sometimes the Internet connectivity can take some really weird routes, causing higher than normal RTT for our calls. Mumbai is also no shocker, India has poor Internet connectivity, leading to higher than average RTT. In fact, if we look at the Internet connection speeds as reported by Akamai, it paints a pretty clear picture of where our problems are, which are further confirmed by our own data. Neat!
Now the nice thing about Kibana is that it let’s you quickly filter the data set you are looking at. In my dashboard, I have distribution graphs for audio RTT and connection times, and just look at what happens when I click one of the TURN servers.
All of this is stuff that helps is diagnose problems and improve our service quality quickly and effectively. And while we can’t stare at graphs all day, this will help us define normal values for our TURN servers, and monitor for outliers that can help identify problems, helping us scale automatically to meet demand later on.
But there is one more interesting thing about our data. Why does Seoul have almost double the RTT of Tokyo? Is there a potential problem with scaling in this region, or could it be something else?
First of all, let’ understand the context of that number. How many calls are we talking about for each TURN server in this data set? Preserving the order, we have the following graph.
It does indeed look like some TURN servers are handling more load than others. If we look at Seoul, it’s only handing around one tenth of the traffic compared to the Virginia region, and Virginia has much lower RTT! This also impacts the relevance of the data we get, less data means more uncertainty. We know South Korea has some of the best Internet speeds and connectivity in the world, and it’s reasonable to assume that most traffic in that region is peer to peer, and the few that aren’t, probably run behind proxies or other nasty stuff that impacts our service quality. More investigation needed!
One last thing, how does our relayed latency compare to our non-relayed traffic? For a dataset of 100k calls, the average mean RTT is 367 ms for relayed calls and 227 ms for non-relayed. Breaking this down further is for another blogpost, but it does tell us something interesting. The Tokyo TURN servers are actually below the average latency for non-relayed calls globally, meaning they perform pretty well, whereas the european region with Frankfurt and Dublin handle about the average for relayed calls as expected. Also, the difference in relayed vs. non-relayed is not that different, which is good!
I hope this showed you the power you can have with the right tools and people. Want to work with us on WebRTC? We have an open position!