Why Weave Uses a Hybrid Cloud

Published in

Weave Lab

7 min readApr 12, 2018

At our last board meeting one of our investors asked why we didn’t have 100% of our infrastructure on “the cloud”, in this case AWS. I quickly quipped that it was because of reliability, which wasn’t convincing enough for him (and rightly so). It caused me to reflect over the next couple weeks and I decided to write this post to share why we originally chose to go with a hybrid cloud approach and also to take the time to re-evaluate the assumptions and reasons that got us here.

Reason #1: Our History

When Weave started selling hosted VoIP service over six years ago the consensus in the open source VoIP community, in particular the FreeSWITCH community (the sofware we use for our hosted PBX), was that virtualization technology wasn’t good enough for low-latency, real-time media needs. Even Barracuda, the company that offered enterprise support for FreeSWITCH at the time, would not offer support for FreeSWITCH running on virtual environments. RingCentral, Vonage, Ooma, and many others were all running on bare metal at the time as well.

Because of this, we decided not to risk compromising the call quality and built out Weave’s infrastructure on bare metal from the beginning. This forced us to create a solid plan for scaling our hardware and to make it easy and fast to scale. Tools like salt helped us achieve super fast server provisioning, and heavy load testing allowed us to stay ahead of the curve on our hardware needs.

Because of our history we own a big chunk of hardware already and have created a solid scaling plan. So this is the first reason why we have a hybrid cloud infrastructure. But does the time to manage our infrastructure, stability, and cost create a reason for us to migrate entirely to cloud? I’ll address each of those shortly.

Reason #2: Networking Control & Regionality

Fast forward to today, and virtualization has proven that it can handle real-time media in most cases. Vonage aquired a phone company entirely run on AWS, Twilio has shown it can be done, and Amazon support reps use an in-house phone system built on a large FreeSWITCH cluster and run it entirely on AWS. Containerization is bringing additional optimizations by getting services even closer to the metal. At Weave we have spun up two data centers on AWS in Virgina and California with great success.

But occasionally things go wrong. The internet is a fickle beast and BGP routes are always changing and adjusting. In particular, after a major DDoS attack on many of the major DNS providers in the US last year, public internet routes changed all over the place causing lots of call quality issues for our customers. To avoid things like this, we are looking at managing our own BGP routes, and creating peering agreements with major ISPs to avoid unnecessary latency instead of best-effort internet routes. Unfortunately, AWS doesn’t allow us enough control over the networking to accomplish either of those tasks, so we are left to manually move customers to different data centers when they are experiencing issues.

NOTE: We are looking at installing hardware routers in AWS data centers and direct connecting to our AWS instances which would allow us to get more control of our networking with Anycast and peering agreements, but also bring the benefits of cloud infrastructure.

Also, until we work out peering agreements (which I don’t believe can be done directly to AWS anyway) regionality makes a big difference for us. Spinning up our data center on the east coast drastically reduced call quality issues for customers on the east coast, and the same was true for our west coast data center. Unfortunately, amazon doesn’t have a presence anywhere in the mountain or central regions of the US, so our own Utah data center has been great at helping customers in those regions with bad routes to the coasts or to AWS.

Reason #3: Cost

I know this one is controversial, because the time spent on managing infrastructure is sometimes hard to measure, but I will make my case. Weave has two key differences that make our costs higher on cloud providers than most other software companies:

HIPAA
VoIP

Because we maintain a HIPAA compliant infrastructure we are required to have dedicated tenancy on AWS. This means we pay $2 per hour per region just to get dedicated tenancy, and then we pay an additional premium on each machine we spin up. Between our two datacenters we pay over $40k more per year just for dedicated tenancy.

In addition, because we are a hosted VoIP provider and need low-latency for our real-time media, we are also forced to over-provision resources to overcome what appears to be a Packets Per Second limit on smaller EC2 instances. Engineers at Flowroute, a SIP provider for VoIP companies like ours, told me that their rule of thumb is to never let the processors on EC2 instances go over 50% or else they see call quality degradation. This means we have to pay for more resources than other companies to maintain quality.

NOTE: Our architecture doesn’t allow us to automatically spin up more servers during busy hours, and turn them off during low traffic times. We are actively working on this but until then, we can’t see the savings of on-demand resources.

EDIT: I was just informed about this feature in AWS which may help us alleviate the need to over-provision resoures: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking.html

Here is a basic overview of our costs:

Utah Datacenter

We have an equipment lease which includes a PureStorage SAN with disks, 6 rack mount servers (128 gigs of RAM each), a high performance switch and all the cabling and other needs necessary to run the entire stack. The lease is for three years and also includes a three year extended support agreement to replace any failed hardware. At the end of the lease is a $1 buy out and we own the equipment.

NOTE: The Utah datacenter not only runs a full hosted VoIP system, it also runs our entire SaaS offering which includes a high memory redundant Postgres cluster and dozens of extra services that we don’t run on AWS.

Total monthly cost for equipment: $4,305
Total monthly cost for colo including a full rack and redundant ISP service (300 mbps dedicated) and a direct connect to AWS: $2,604

Total direct expenses for self-managed colo: $6,909

NOTE: One other advantage with owning hardware is that when the lease is up, we essentially use the equipment “for free” until end of life, which gives us extra cost savings.

Virginia Datacenter (AWS)

In our Virginia AWS datacenter, we only run our hosted VoIP service. This includes 2 redundant SIP proxies, 2 redundant SBCs/Registrars, 10 media servers, and 3 couchbase servers, all running on dedicated tenancy in a VPC.

EC2: $5,970
Data Transfer: $1,373
Support: $734

Total direct expenses for a VoIP-only datacenter on AWS: $8077

So direct expenses are cheaper for our own hardware, and when factoring in that AWS is only running our VoIP system, it is significantly cheaper. When we ran an AWS price calculator to see how much it would cost us to run our postgres cluster on AWS RDS with matching specs (HA Postgres with master and slave, 64 gigs of RAM on each instance) it would add another $1,500 per month just for running our database.

Reason #4: Flexibility & Redundancy

I have heard several people say at conferences recently that Kubernetes is “democratizing infrastructure management.” I couldn’t agree more. One of the best examples of how this can help companies is how much easier it makes it to run hybrid clouds. Because the vast majority of our services run on Kubernetes, we can easily run them on every major cloud provider (who now all offer a hosted Kubernetes service) or on-premise. Shifting workloads is trivial. If our on-premise systems are having issues, we can shift resources to Google Cloud or AWS and redirect traffic there. If AWS is having an S3 failure, we can move traffic back to our on-premise Kubernetes cluster running Minio. Kubernetes and a handful of other open source tools allow us to essentially run a basic cloud hosting provider on our own servers.

EDIT: A new article was just posted and has a section about going “multi-cloud” with Kubernetes, further illustrating how it is bring extra flexibility to infrastructure. https://www.softwaredaily.com/#/post/5a5a2387f43c8d000457a110

For new and exciting cloud features that we can’t run our own hardware, like Google’s Spanner service, we can access the service through our direct connect and take advantage of all the cloud has to offer from our on-premise servers.

Finally, our customers are small, local businesses that depend on their phone system to run their business. Knowing we don’t have all our eggs in one cloud provider basket is an extra comfort that allows me to sleep better at night.

Reasons to go 100% cloud

I wanted to finish with some of the compelling reasons to ditch our hybrid infrastructure and go 100% to the cloud.

Kubernetes makes true “multi-cloud” a possibility and would allow us to get flexibility and redundancy by having two cloud providers if we wanted.
The major cloud providers are adding so many features to help with debugging, tracing, logging, etc. which would save time across the entire department and come at little to no cost.
Even though we have a fast and simple hardware scaling system, it will never be as fast and as flexible as a cloud provider. If we start to grow significantly faster, we may find it is just easier to continue to scale on the cloud and stop purchasing and setting up hardware.
The major cloud providers are allowing more and more networking flexibility and services are starting to pop up to help optimize cloud networks for low-latency needs. If there was a simple way to get peering agreements with the major carriers and the ISPs right to our servers running on the cloud providers, that might solve most of our needs going forward.

Conclusion

Because of the reasons listed, I am happy with our decision to implement a hybrid cloud infrastructure. But I am also seeing that the cloud providers are actively solving all the problems that keep us on a hybrid system, in particular with the networking layer. The motto of our development department is “Never Satisfied”, and so we will continue to experiment and rethink our decisions. If the time comes that we see the benefits outweighing the costs, we will happily migrate 100% of our services to cloud providers.