Gone in 1100 seconds — The weirdest bug I have ever met
A story of hunting bugs on the Edge of WebRTC.
Two days ago we got a strange report from one of our customers: 18 minutes into a video call he was having with one of his clients, the call mysteriously disconnected. And then it happened again — after 18 minutes. Was it a coincidence?
Not only was it a strange bug, it was a highly embarrassing one. Our goal is to make a video conversation tool that is so easy to use that a doctor or psychologist would want to subscribe to our service. Needless to say, calls consistently disconnecting after 18 minutes isn’t in line with that goal.
WebRTC has always meant dealing with bugs
Bugs are not an unknown phenomenon to us. Our product is based on WebRTC, a fairly new web standard that allows two browsers to transmit data, such as real-time video and audio, directly between themselves.
In September 2017, WebRTC finally got support in all major modern browser platforms. We have battled our fair share of bugs these last few months in an attempt to make WebRTC work for everyone. (Working with bleeding edge technologies means it’s a good idea to get familiar with writing good bug reports to browser vendors.)
However, this last bug is the strangest I’ve met in my five years of working with WebRTC.
The tools you need to track down a WebRTC bug
Luckily, thanks to some excellent contributors to the WebRTC community, we have a lot of tools at our disposal when hunting down bugs experienced by customers. You can’t expect a doctor to create a dump of chrome://webrtc-internals and send it to you together with a perfectly reproducible case.
Confirming the bug
We quickly found the relevant files from the two failing conversations. The video conversation was between Chrome 64 and Edge 16. Chrome failed after 18 minutes and 20 seconds. The subsequent ICE restart also failed, which caused us to close the RTCPeerConnection automatically.
The puzzling thing was that Edge did not seem to agree with Chrome, and its iceConnectionState remained connected until it got the ICE restart, finally closing the peer connection as the restart failed. What was going on here?
Reproducing the bug: 18 minutes and 20 seconds is the magic number
A few months ago, we bought a cheap Windows laptop, so that we could run Windows 10 with Edge 16 and see bugs in action for ourselves. We had our Windows laptop enter a video conversation with a Mac running Chrome 64.
Then we waited. And sure enough, after exactly 18 minutes and 20 seconds, the call’s ICE Connection state went to failed, the ICE restart failed, and we closed the peer connection. Never had we seen a bug similar to this before, we were floored.
So what do you do next? Where do you start? We didn’t have some magic timer in our code that would cause the video conversation to fail after 18 minutes and 20 seconds in certain browsers. It felt like an insurmountable task, but there was only one thing we could do.
One step at a time — methodically testing to narrow down the scope of the bug
We had found a reproducible case, but needed to narrow down the scope of our search. We devised a series of tests that would test our assumptions.
First, did this happen between two Chrome browsers, were we’ve typically had the fewest WebRTC issues? On the same two machines, under the same circumstances, we tested Chrome to Chrome. After 18 minutes and 20 seconds, the call was still running, and the ICE Connection state remained stable. So the bug didn’t happen to everyone.
However, testing with Firefox to Edge, it failed again.
The big question was whether this was a bug in Edge, or was it a bug in our code? appear.in is also built on WebRTC and our service shares a lot of characteristics in the way it’s built, since half our team previously worked at appear.in. We decided to test Chrome to Edge in appear.in on the same machines, using peer-to-peer (free) mode. But it didn’t fail!
We didn’t know whether to laugh or cry. Having narrowed down the scope of the bug, the fault seemed to lie in our own code. Confrere couldn’t deal with Chrome to Edge.
When your assumptions break
As every test ended up taking 20 minutes to complete, we decided to call it a day and continue from home.
To double check our assumptions, we decided to retest Chrome to Edge at home. To our big surprise, after 40 minutes of testing, the call was still connected. What!?
We checked everything over and over again. The one thing we noticed was that Chrome reported that the connection was relay <-> stun when we were in the office. This means that one side of the call needed to be forwarded (relayed) through one of our servers, while the other side could send data directly (STUN). This isn’t an uncommon scenario with modern network setups.
Due to the unique network setup at our office, my machine, running Chrome, was connected via cable to a network setup allowing direct connections, and the computer running Edge was connected through a highly unstable WiFi connection that needed to use a relay server. However, on the home network, we were connecting through local candidates, meaning we didn’t have to leave the home network at all.
IceTransportPolicy to the rescue!
Our next step was figuring out why the test, which failed at the office, now suddenly worked. Could it have to do with the stun or relay candidates? Thankfully, we can force a PeerConnection to use relay candidates only by setting iceTransportPolicy to relay when creating our PeerConnection. And surely, 18 minutes and 20 seconds in, the call fails.
Was it just a coincidence that Chrome to Edge worked on appear.in, perhaps we got connected using STUN on appear.in but relay won on our service? After a quick test on appear.in, we concluded that this was indeed our fault, as forcing relay (TURN) on appear.in did not trigger the bug.
At a loss, we decided to try one more thing to narrow things down. We built a custom version of Confrere, and with permission got the ICE credentials from appear.in to use their relay servers in our service for a one-off test. 25 minutes later we could conclude that our client-side application code was not at fault, as running the same code with different relay servers did not trigger the bug.
Logs to the rescue!
Having narrowed down the problem to our TURN server infrastructure, we could finally start fixing the issue, right? Wrong. We still had no idea what was causing the problem.Our TURN infrastructure is even heavily based on what we learned in the four years we built appear.in, using the same open source component, running on the same cloud provider, with the same configuration.
When you are at a complete loss of where to go, you can always turn to logs. Well made logging is the ultimate tool when chasing a bug, as it allows you to take a peep into the real-time state of a running system. We fired up another call, and this time, paid considerable attention to the logs.
At around the 10-minute mark, we started noticing an increase in log messages like this:
turnserver: 1054: session 000000000000000061: realm <confrere.com> user <confrere:1520714985>: incoming packet message processed, error 438: Stale nonce
When the call failed after 18 minute and 20 seconds, we went back through our logs and found the first instance of the Stale nonce message, and saw the following.
turnserver: 1053: session 000000000000000034: refreshed, realm=<confrere.com>, username=<redacted>, lifetime=600
turnserver: 1053: session 000000000000000034: realm <confrere.com> user <redacted>: incoming packet REFRESH processed, success
turnserver: 1053: handle_udp_packet: New UDP endpoint: local addr <redacted>:443, remote addr <redacted>:56031
turnserver: 1053: session 000000000000000061: realm <confrere.com> user <redacted>: incoming packet message processed, error 438: Wrong nonce
turnserver: 1053: session 000000000000000061: realm <confrere.com> user <redacted>: incoming packet message processed, error 438: Stale nonce
Why were we seeing so many messages of Stale nonce? What is a nonce, and how can it be stale?
Understanding the bug: Say hi to “stale nonce”
First of all, if you, like me earlier today, have no idea what a nonce is:
In cryptography, a nonce is an arbitrary number that can only be used once. It is similar in spirit to a nonce word, hence the name. It is often a random or pseudo-random number issued in an authentication protocol to ensure that old communications cannot be reused in replay attacks.
And if a nonce is stale:
A stale nonce is more of a warning than an error. In SIP your authorization credentials are encoded in the SIP headers. To prevent people from capturing that data and using it later to make calls on your account a nonce is used.
The SIP RFC requires that the nonce randomly change periodically. If the client uses a nonce that was expired it is considered a ‘stale nonce’. The client should then get the current nonce and use that instead. This message lets you know that the client tried to use a stale nonce, which can indicate someone trying a replay attack (using captured data from a previous session) or a client that isnt properly
getting the new nonce
At Confrere, we use the coturn open source software to run our TURN servers. If you read the documentation closely, there is a configuration parameter concerning stale nonces:
# Uncomment if extra security is desired,
# with nonce value having limited lifetime.
# By default, the nonce value is unique for a session,
# and has unlimited lifetime.
# Set this option to limit the nonce lifetime.
# It defaults to 600 secs (10 min) if no value is provided. After that delay,
# the client will get 438 error and will have to re-authenticate itself.
In our configuration, we had enabled stale-nonce, which had worked flawlessly for us up until our Edge user base started to grow.
The stale nonce logs rang a bell for Philipp Hancke, who was helping us out understanding the bug. A similar connection issue had been reported to appear.in by Microsofts Shijun Sun in May 2017, causing appear.in to be able to avoid the issue.
We were able to edit the configuration of our TURN server and remove the stale-nonce flag, which would mean that nonces now had an unlimited lifetime.
After 50 minutes in our test setup, we were finally ready to conclude: we had found the bug!
So why 18 minutes and 20 seconds?
You may have noticed the title of this blog post. Gone in 1100 seconds. Throughout this post I have written 18 minutes and 20 seconds quite a lot, but do you know how many seconds that equals? 1100.
We know that after 600 seconds the nonce we had at the beginning of the conversation is invalid. And 500 seconds after that initial stale nonce message, the connection is dropped. Where do the 500 seconds come from?
It seems there is a built-in sleep of 500 seconds after which some kind of check is performed on the connection. I’m not too familiar with the inner workings of the coturn software, but this tells me that if you use an invalid nonce 500 seconds after it gets invalidated, coturn will effectively stop forwarding your packets to the other peer. This will of course cause the other peer to see the connection as failed, as no data is being received.
The coturn server no longer forwarding packets explains the Chrome side of things, as it will observe no more packets coming from the other peer. But why doesn’t Edge also detect this?
A tale of two packets
Connections can be two types in WebRTC: UDP and TCP. Depending on how much you remember from networking class, UDP packets require no confirmation upon reception, is not guaranteed delivery, and may not come in order. In real time communications, this isn’t such a big issue, as codecs far and wide handle packet loss quite well.
TCP is another story. It’s useful when you require all data to be transmitted to the other party, in the order you sent them, and with confirmation of reception. Most traffic in WebRTC is UDP.
In the case of our tale, Edge was transmitting UDP packets through the TURN server to the other party. When the WebRTC server stopped forwarding packets, Edge had no way of knowing whether Chrome was receiving packets or not, so it just happily kept on sending data. After all, Chrome was behaving nicely and sent packets through the STUN address. So when all of a sudden an iceRestart happened Edge had no idea what to do and just silently failed.
The conclusion to the story is that we will be filing bugs towards the Edge bug tracker, as well as deploying a new TURN server stack without the stale-nonce flag for the time being. New servers should ensure that our customers will be able to have long and good video conversations in our service, without disruptions in the future.
I’m also incredibly grateful to be part of a great community of people willing to share knowledge and help each other out. From the browser teams responding and fixing the bugs we report, to individual contributors providing excellent tooling to the WebRTC community free of charge.
A special thank you to Philipp Hancke for pointing us in the direction of the stale-nonce flag, and keeping up with my incessant complaining when things didn’t seem to make any sense at all. In addition, a shoutout to Microsoft’s Shijun Sun for discovering this issue back in May 2017.