The Evolution of Calling

Published in

TextNow Engineering Blog

9 min readJul 17, 2018

As we’re releasing our final version of Elastic Calling to TextNow this week, we’d thought we’d take a moment to look back on Armando Murga’s deep dive into the architecture from last year, originally published in March 2017. — Ed.

Big telecoms tend to have a few things in common:

They’re hard to get a hold of for support and customer care
They bury their customers in piles of fees
They often use the public switched telephone network (PSTN) to transport phone calls

TextNow is an alternative to these “big guys”. We are a carrier in the US — similar to AT&T, Verizon, or Sprint but we combine the affordability and flexibility of VoIP, the feature-rich ecosystem of mobile applications, and the large availability of the PSTN network to bring our customers a unique offering. We have created the ability to place calls between VoLTE, traditional cellular voice, and WiFi networks seamlessly so that our users take advantage of every possible network available.

My name is Armando Murga, Architect @ TextNow, and I’m part of the team working to revolutionize the calling experience at TextNow.

The Basics: Voice over Internet Protocol (VoIP)

VoIP is the idea of transmitting voice audio over the Internet as opposed to using traditional circuit transmissions. This simple idea has transformed the telecommunication industry in great ways — think of voice applications like Skype, with greater flexibility and affordability.

VoIP has been around for a while, but it wasn’t popularized in the mobile world until the last decade or so. There were a few practical limitations that have kept VoIP away from the mobile world:

Cellphones were expensive and not the powerful mini computers we see today
Networks had to learn how to transport digital audio across the PSTN
Cell towers supporting 2G and audio compression algorithms did not exist

Thankfully we live in a better world now…

Calling v1.0: The Beginning

How does it work? During most of the day we often find ourselves close to a WiFi access point, and TextNow takes advantage of this fact. In our first iteration of calling, when a user on WiFi triggered an intent to dial a number, we quickly connected the call via WiFi. We prioritize WiFi calling because it eliminates the requirement to have cellular coverage which isn’t always available — think of the underground parking at the office, your home basement, or just being on vacation outside of the US where traditional cellphones require roaming to work.

Some of the difficulties with WiFi are that the signal can become out of range, the spectrum can become congested if there are many routers in the vicinity and the signal becomes unreliable, or the WiFi becomes dreadfully slow when everyone in the office is watching the soccer world cup (hypothetically speaking, of course!).

Although WiFi calling often provides a higher-quality audio, we noticed that sometimes using WiFi as the only network source introduced other problems like calls taking a long time to connect, audio delays or garbled audio, and dropped calls.

Calls Take Forever To Connect

Since we’re a carrier, we can take full advantage of the traditional cellular network, so our problem wasn’t that we had no other way to connect the call, but rather that we didn’t know when we needed to fail over to the PSTN.

This problem wasn’t too difficult to solve. Initially we thought that if the WiFi network has no Internet or our SIP library fails to register with our servers timely, then we can simply fail-over to the PSTN network.

Audio Delays or Garbled Audio

Bad audio is a symptom of a bad network, so we thought that before placing a call, we needed to test the network to ensure that the call could survive. This could also be a recurring test so that the audio stayed consistently good through the call.

We started with basic ping tests. The idea was that if the ping test on the network came back within appropriate thresholds, the network was good and the call would be good too. And it worked! We had our first quick way to test whether the network could withhold a call — this also helped with our call connection time as we could set the ping test to fail quickly if necessary.

Dropped Calls

Remember that a call is only as good as the network that carries it. I’m sure you’ve downloaded a large file from the cloud on a poor and unreliable connection, and it totally sucks! The download fails over and over because the connection will eventually fail before the download is complete.

This problem was an interesting one. A complete drop in a call wasn’t the average case, but it was still a significant case that we needed to solve . This a Calling 3.0 problem.

We are teaching our app how to get to know its environment better

Calling v2.0: Buckle Up!

In v1.0, we solved the obvious problems introduced when the WiFi connection is really bad, but we kept questioning whether a ping test was really sufficient. The short answer is no. The long answer is also no followed by a lot of complex networking and VoIP-related reasons.

For example, a ping test is composed of a very specific packet size, it’s transported in a different layer than audio packets, and it can be treated differently by certain networks. I’m having a “same same, but different” moment. You may notice that I haven’t mentioned the mobile data network yet, and the reason is that we can usually treat it similarly to WiFi. However, we started to notice that under some conditions a ping test would behave quite differently in the cellular network — for better or for worse. Think of us having a really slow connection and a pretty sweet ping response time, or vice-versa. This really threw a wrench into our goal of not placing a call in a bad network.

Meet the Packet Test

So we created a QoS back-end service that can talk to our clients in a similar language than VoIP to calculate an accurate jitter and latency of their network. This QoS service would receive packets from the client and send packets back to it, so that the client can run a quick analysis on its connection, and the packets would also look more like audio packets — timed and sized precisely — so that we were more in the left side of the “same same, but different” quote.

This is quite a simplistic introduction to a service that needed to be very fast. The benefit of running a ping test on the client side was that it was already available locally, and that we could configure the pings to fail quickly if necessary, but this QoS back-end service would need to be contacted, download packets, and upload packets all within a few seconds.

One of our golden rules is that our users should not have to wait longer than 3 seconds, on average, for their call to connect, which means that we have to decide whether the call needs to be connected through WiFi or PSTN within that time.

This proved to be very effective. We quickly noticed improvements into how accurately we were determining whether a network was good. We measured these improvements using call ratings and call duration.

Meet the Call State Machine (CSM)

Let’s talk about the heart of the CSM. Since audio packets are transported through the network, we realized that we could apply basic networking techniques to detect latency, jitter, and packet loss. And there’s a formula to it: Mean Opinion Scores (MOS).

The MOS is a real-time way of measuring VoIP call quality objectively, and it determines a human’s opinion of the quality of the network and the call.

Receiving a 1 means that the call is really bad
Receiving a 4.4+ means that the call is really great

The formula for calculating the MOS score is rather complicated if you read academic papers, but there are short forms online that aren’t too bad, and can be a good starting point.

We calculate the MOS score every 500ms to 1000ms during the duration of the call, and it opened up big possibilities because we can start optimizing the quality of the call and the network in real time.

Say that a user starts their call in a good WiFi but then the MOS score deteriorates, we taught the CSM to remember the state of the call and the available network interfaces at all times. It’ll know to test the cellular data network when the MOS score dips too low so that the call is transferred to data seamlessly. In the worst cases, we may also only have 3 seconds of poor audio before we lose a call completely — or before we lose the user’s interest in the call.

The Last Frontier

If a call starts in a good WiFi, then we can transition it to data seamlessly; but what if we’re in data? The only other network available to receive the call would be the traditional cellular network. No problem — we’re a carrier! We have the ability to join a call between the data network and the traditional cellular network without a blip.

This transition wasn’t technically difficult but it needed to be clever because there’s a lot happening in the device during the moment of transition; the fact that the device is transitioning the call and the transition of the call (i.e. audio) have to stay stealth. The transition needs to be:

Fast. PSTN networks aren’t fast networks, and they can introduce a delay in the transition of the call. Our main focus was to optimize our data usage and network monitoring during a phone call to ensure that we really limited the number of PSTN transitions because PSTN delays are largely outside of our control.
Reliable. When the transition to the PSTN network takes place, we cannot drop the call. Remember how we use SIP for our calls? Well, SIP is a bit chatty, and when the network is becoming degraded, SIP will try to reconnect, there will be socket timeouts for the media servers, etc. So before we tear down SIP, we must be sure that the call is completely transferred to the PSTN so that we don’t accidentally drop the call.
Seamless. Our black magic cannot interfere with the user’s calling experience or device too much the device may already be busy with other apps asking for network connectivity and updating its own network signal strength. Doing too much in our app, even in background threads, can worsen what may already be a bad situation for the call.

We are teaching our app how to monitor all available networks and to be ready to transfer the call quickly.

Calling v3.0: We’re Growing Up

We have gone from a naive WiFi calling application to a far more sophisticated calling service that can take full advantage of all networks. The evolution to v1.0 and v2.0 were a huge accomplishment.

The application seemed to behave well in great networking conditions or rapidly fading networks: we were holding calls and transitioning to the PSTN as a last resource. However, as the app matured so did our testing, and we quickly found “room for improvement”.

We were asking tougher questions:

What happens when we go from a good network condition to a sudden loss of network?
In slowly degrading networks, can we transfer the call from WiFi to mobile data faster?
How can we minimize call transitions between networks? (i.e. guess the best network more accurately from the beginning)
Is the MOS score good enough to determine that a call is good?

We’re getting closer to the crux of the problem. The problems are getting tougher, and the solutions more exciting as we discover new ways to evolve the calling experience for our customers.

If you find these types of challenges particularly compelling and want to get involved, check out some of our job openings. Who are we? Check out the video below — we can’t wait to meet you!