What kind of TURN server is being used?

TURN servers are an essential part of the WebRTC infrastructure as they help with NAT traversal. But how often are they used and what does that usage tell us about the cost of operating a WebRTC service?

There is lots of ICE in Ymerbukkta on Svalbard

Two weeks ago, my friend Gustavo Garcia, asked about statistics for how many WebRTC calls get relayed:

Since I like to do data nerding, he got an answer from me but it is quite hard to give enough detail in 140 characters. So I am trying to answer the following questions in this blog post:

  • What is the failure rate?
  • What percentage of calls get relayed via TURN servers?
  • What is the relative proportion of the UDP, TCP and TLS TURN variants?

These metrics are quite important for running a WebRTC platform or deciding development priorities.

First, how are such metrics measured? We’re running rtcstats for appear.in to get data for support cases and one of the features extracted over all calls is the type and local type preference of the first active ICE candidate pair as shown here.

The local type preference (explained in RFC 5245) has implementation specific values which depends on the browser. For Chrome, a value of 2 means “TURN/UDP”, 1 is “TURN/TCP” and “0” (least preferred) is TURN/TLS. Firefox uses 5 for TURN/UDP and 0 both for TURN/TCP and TURN/TLS (which means I can not see the impact of the addition on TURN/TLS in Firefox 54 unfortunately).

Measuring these statistics when the ICE connection goes to connected or completed, is a potential source of error as the active candidate pair might still change.

What is the failure rate?

“Big data” is fun so lets use a data set of ten million records, restricted to peer-to-peer calls using Chrome.

snoop=# create or replace view dataset as 
select * from features_permanent where (
iceconnectedorcompleted = 't' or
(icefailure = 't' and icefailuresubsequent = 'f')) and
usingicelite = 'f' and
browsername = 'Chrome'
order by datetime desc limit 10000000;
CREATE VIEW

Revisiting last years post, how many of those calls failed?

snoop=# select count(*), iceconnectedorcompleted
from dataset
group by iceconnectedorcompleted;
    count | iceconnectedorcompleted 
— — — — — -+ — — — — — — — — — — — — -
9949800 | t
50200 | f

The failure rate is roughly 0.5 percent. This looks a bit better than last time!

What percentage of calls get relayed via TURN servers and what is the relative proportion of TURN/UDP, TURN/TCP and TURN/TLS?

Next, how many of the calls that worked (and did not fail) were using TURN? Lets change the dataset slightly to only include calls that got connected

snoop=# create or replace view dataset as
select * from features_permanent where
iceconnectedorcompleted = 't' and
usingicelite = 'f' and
browsername = 'Chrome' and
firstcandidatepairlocaltypepreference is not null
order by datetime desc limit 10000000;

And run a select:

snoop=# select count(*), firstcandidatepairlocaltypepreference 
from dataset
group by firstcandidatepairlocaltypepreference
order by firstcandidatepairlocaltypepreference desc;
count | firstcandidatepairlocaltypepreference 
— — — — -+ — — — — — — — — — — — — — — — — — — — -
2046351 | 126
702720 | 110
5477486 | 100
1218542 | 2
507401 | 1
47500 | 0

This shows 12.1% of calls with a local type preference of 2 for TURN/UDP, 5% TURN/TCP and a bit less than 0.5% TURN/TLS. Combined that is 17.7% of calls that get relayed, slightly higher than the number I gave on twitter because I sliced the data a bit differently after discovering that the numbers for P2P and our SFU are quite different. And since everyone loves pie charts:

Repeating the analysis for our SFU

Running the same analysis for our SFU was rather surprising. We still use TURN servers since until recently, ICE-TCP was not enabled in Firefox (this changed in Firefox 54).

snoop=# create or replace view dataset as 
select * from features_permanent where
iceconnectedorcompleted = 't' and
usingicelite = 't' and
browsername = 'Chrome' and
firstcandidatepairlocaltypepreference is not null and
firstcandidatepairlocaltypepreference <= 2
order by datetime desc limit 100000;

The dataset is now restricted to calls that use TURN to make it easier to see what is going on. Its also smaller since only 4% of the SFU-calls end up using TURN.

snoop=# select count(*), firstcandidatepairlocaltypepreference
from dataset
group by firstcandidatepairlocaltypepreference
order by firstcandidatepairlocaltypepreference desc;
count | firstcandidatepairlocaltypepreference
— — — -+ — — — — — — — — — — — — — — — — — — — -
19650 | 2
78447 | 1
1903 | 0

So 20% of the relayed calls use TURN/UDP while 78% use TURN/TCP and 2% use TURN/TLS. Why is this so different?

The main reason is that our SFU is a server running on the internet and therefore has pretty good UDP connectivity. Also, we are allowing direct connections via UDP. Which means that the remaining 20% TURN/UDP traffic is from clients which can not connect directly to the SFU (on a random UDP port) but need to go through the TURN servers which listen on the UDP port 443. That suggests there are quite some networks which block random UDP ports but still allow UDP traffic on port 443. This is one of the many fascinating surprises I stumble upon while data nerding.

If you liked this, show your support below and make sure to read more about keeping your TURN servers next.