How to Beat the Internet Latency?
Monkey around the world to learn about cloud network latency
TL;DR
We conducted network latency tests among a large number of virtual machines provisioned around the globe. Multiple cloud providers were used. Results show that by smart routing, we may be able to build fast networks—maybe faster then the internet as we know it.
What is this project about?
Can we build faster global network combining multiple cloud providers? Can we route network traffic more efficiently around the globe? Can we surf the web faster? Can we build network infrastructure for fast, globally distributed datastore? Can companies provide faster access to their online services? Can we build exciting new services that are of great value to the customers?
To start answering those questions, we started a project called Relay Monkey.
How did we do it?
Basic idea is to build a mesh of virtual machines around the globe that would ping
each other and record the results. We threw in Fortune 500 websites into the mix to have some fun.
Infrastructure needed to produce results was provisioned by Terraform and configured by Ansible. Results were analysed using Wolfram Mathematica.
Terraform
Terraform was used to provision 172 virtual machines running Debian 10, spanning between 7 cloud providers: Alibaba Cloud, Amazon Web Services (AWS), DigitalOcean, Google Cloud Platform (GCP), Hetzner, Linode and Microsoft Azure.
We covered majority of available regions and availability zones with minimum configuration, but some zones were simply refusing to cooperate.
- Great Firewall of China (GFW) heavily throttles
github.com
domain in some regions — we needed GitHub to download support applications. This affected Alibaba instances. - GCP is returning
The zone ‘...australia-southeast1-c’ does not have enough resources available to fulfill the request
. - Azure is making it difficult to deploy to all regions listed with
az account list-locations -o table
etc.
So called Terraform Providers, wrappers for managing selected provider infrastructure, are not all implemented the same way. Documentation is sufficient in all cases, but the same functionality is not present in all of them. To illustrate on how elegant it was to deploy 60+ instances in all GCP regions and availability zones, feel free to check out the repository. Using AWS wasn’t that elegant.
Additionally, proper respect goes to DigitalOcean, Hetzner and Linode for ease of use. Simplicity comes from available data
block that lists all available regions. In general, these blocks return data that can be used to automate part of the process, e.g. it returns list of available regions, available instance types and other provider specific data. Alibaba Cloud, AWS, Linode and Microsoft Azure don’t provide this functionality. Worth mentioning, Alibaba Cloud is fast when APIs are working as expected.
When provisioning larger infrastructure, you may want to use -parallelism
flag to speed up the process. We had mixed success, but it helped in general. If flag is set too high, some API requests fail. Trick is to find the sweet spot which may be different for terraform apply
and terraform destroy
.
Ansible
Once the infrastructure was provisioned, Ansible jumped in as a tool that helped us configure those instances.
To run Ansible against desired instances we had to build so called inventory of instances. Terraform output was defined in a way that provided all the information needed to build the inventory and it was exported in JSON to be used by Jinja. We had some fun with Jinja by reformatting Terraform output to match recognisable Ansible inventory input. Jinja is a powerful templating engine that basically allows you to convert input into desireble output, e.g. JSON format to arbitrary data structure.
Ansible playbook ended up with 11 tasks.
We’ve learned that tweaking Ansible can significantly speed up the process of running the playbook which becomes very noticeable on larger scale. We would like to share some tips, together with output tweaks that give better overview of current status while running:
Relay Monkey
The most important step in our Ansible playbook is Run Relay Monkey Agent container
. Relay Monkey Agent is one of two components of Relay Monkey microservice system. Second is Relay Monkey Jungle. Relay Monkey Agent got deployed on all instances, ran tests and reported back to Relay Monkey Jungle. Both are written in Python.
Relay Monkey Agent is running as Docker container on all instances. During application bootstrap, agent registers instance with unique ID over the Jungle API. Scheduled tasks pull information about all registered agents and sites to be tested and execute the tests. Results based on 3 consecutive pings are reported back to Jungle API.
Relay Monkey Jungle is API service used by agents to fetch information about other agents and to report back test results. It uses PostgreSQL as backend datastore provisioned as AWS RDS. We had to quickly scale Jungle API service containers for them to be able to process the load produced by the agents or otherwise they would start to produce errors.
Good friend of mine, Ivan Lakovic, was kind enough to help me create this system. As this is the project where we were able to take full swing at technologies we want to try, FastAPI was one of them. Ivan briefly referenced it as “Flask on steroids”. It worked in our case. Fast to create Jungle API endpoints and easy to scale.
Results were later analyzed using Wolfram Mathematica, their cloud solution. Since there are 500+ vertices and 80.000+ edges creating weighted directional graph, open-source solution wasn’t sufficient so we went for Home Edition plan (£14) which enabled us 5 minutes of computational time per cell. This was enough to produce searchable graph where we could then apply FindShortestPath
function to examine our data. Notebook is available here.
What we found?
We are glad to see positive indicators that could be leveraged by SDNs or similar technologies to build fast and flexible networks. That could be the platform for exiting new products and workflows —all on top of existing cloud infrastructure.
Cisco is doing it. Aviatrix is doing it. Fortinet wants to protect your SD-WAN. Many are doing it.
Why is this possible? Different cloud providers use different data centers with different ISPs. By combining them, traffic routes can be optimised.
Raw test results are available here. Basic data points are:
mymac
is my computer located in Zagreb, Croatia (Europe).www.*
are Fortune 500 websites.- Unique instance IDs are named based on provider and deployment region/zone/dc.
- Latency is in milliseconds between source and destination.
It may be noticed that there is less then expected number of tests and this is because some of them failed. Regardless, current data corpus is more than sufficient to surface important conclusions.
Results show that there are often multiple instances between the source and destination when searching for the shortest path between them (lowest sum of latencies)—meaning, the route could be faster then just connecting them directly over the internet.
Let’s take a few examples (D—direct latency, O—optimised latency):
mymac
⟶www.salesforce.com
: D = 141.48 ms, O = 139.42 msali-eu-west-1b
⟶ali-ap-southeast-2a
: D = 313.91 ms, O = 243.88 mslinode-us-east
⟶aws-ap-southeast-1a
: D = 893.53 ms, O = 212.64 ms
Conclusions we can draw from the results:
- Aggregated data gives us level of transparency that enables us to make decisions based on realtime global network performance.
- Ability to find shortest path among instances enables us to optimise global network routes.
- Direct connections are not always that bad. Shortest path from US West Coast to Asia-Pacific for direct connection would be 88.65 ms by using
gcp-us-west1-b
⟶linode-ap-northeast
. - Latency bellow 25 ms is not worth optimising, but only from
mymac
. This is because it is 25 ms apart from the closest provisioned instance. Some websites have faster response because there are edge nodes closer to my computer—in some data center that is not covered during our testing. - Regions and availability zones which are the best for connecting multiple cloud providers, e.g.:
gcp-europe-west2-b
⟶droplet-lon1
(0.56 ms),aws-ap-southeast-1b
⟶droplet-sgp1
(0.67 ms),gcp-europe-west2-c
⟶aws-eu-west-2a
(0.79 ms)… - Possibly faster access to some websites, e.g.
mymac
⟶www.salesforce.com
takes 141.48 ms with direct connection and 139.42 ms when optimised. This website case may not be very significant, but it means the network may be designed for the same, if not faster speeds. Much more interesting is to seeaws-ap-southeast-2b
⟶www.mohawkind.com
which falls from direct latency of 1431.20 ms to 211.44 ms, 1.2 s difference (could be a direct connection glitch).
What’s next?
Throw VPC networking in the mix
Cloud providers use their privately optimised backbones to connect multiple regions without necessarily using public internet routes. By creating VPCs to connect those instances, it is expected the latency should be reduced. As already mentioned, Cisco is doing it to some extent. Google is investing in connecting the world with fiber optics, so this is exciting news — but latest news about connecting Hong Kong is not optimistic.
“Cloud VPC Network Peering lets you privately connect two VPC networks, which can reduce latency, cost and increase security.” — Google
Introduce new providers
IBM and Oracle would be a great addition. Those are big companies investing in their infrastructure so we would expect them to perform well.
Future of Relay Monkey
Relay Monkey may become available as open-source project. Feel free to reach out if you have any questions or you would like to use developed tools related to this project.
Add new endpoints
We could create API endpoints that expose results meeting given criteria. Those results could be used by tools for managing cloud infrastructure, network management tools, by managed services that are parametrised and auto-provisioned, for continuous website latency testing and alerting or similar. Endpoints could return fastest routes, routes with highest throughput, global website benchmark tests etc.
Simply, there is a sense that a lot can be done for customers of all sorts—just by knowing some of the simple facts like global network latency.
Additional metrics
Ping is not necessarily the best way to benchmark network performance, especially for website performance. Time to first byte (TTFB) is interested metric in that regard and we would like to implement it to compliment existing results.
There are many more factors for measuring website performance that should be taken into account, like CDNs. Often, static content is served from cached servers on the edge which are fast in delivering this type of data. Total page load time can also vary based on dynamic JavaScript content and backend API calls.
Some businesses use multiple providers to secure their internet connection and pay for the bandwidth. Since those providers have high bandwidth needs, we should also measure throughput between instances to get better insight around this topic!
To conclude, we need “good enough” metrics, as exact and precise as they can be, but sufficient to surface new knowledge. ping
was therefore sufficient enough to give us some clues about the answers to the questions from the beginning.
Personally, I’m excited to see how we can innovate on top of available multi-provider cloud networks.
And yes, we spent some money, but it was all good fun. We monkey around.
Appendix #1 (March 15th, 2021):
I’ve just learned about Microsoft’s Pingmesh paper. Our approach with Relay Monkey Jungle and Agent seems to be similar to Pingmesh Controller and Agent approach.
“Network Monitor: A Tale of ACKnowledging an Observability Gap” by John Arthorne from Shopify talks about how to use eBPF to extract network data.