Improving ARK's snapshot distribution — faster rebuilds, less downtime
Database snapshots are very present in ARK, both delegates and relay nodes rely on them on a regular basis. Rebuilding the node’s database from a snapshot is the fastest way to recover a node from various scenarios, such as crashes, local forks, rollbacks or when the node gets stuck on an older block height.
Since delegates will miss blocks to forge if their nodes aren’t synced up, we should recover our nodes as quickly as possible to ensure healthy network. This is where https://snapshots.ark.land can help:
The current snapshot standard followed by ARK delegates works, but it can be improved.
Currently, snapshot nodes take backups (dumps) of their local blockchain database, and expose the files to the internet with a webserver, so other delegates can download them. While this approach works, there are a few flaws:
- There’s no way to check the height and integrity of the snapshot before downloading it. You might download a corrupted or outdated snapshot without knowing, only to have to download another one. The same issue can happen again when you download from different snapshot servers.
- The most recent snapshot is simply a file called
current. If you’re currently downloading this file and it gets overwritten by a newer snapshot, you’ll end up with an incomplete snapshot, and have to download another one.
- If you’re a delegate and fall under one of the scenarios above, while your node is offline, it will cause you to miss more blocks and affect the network’s health as blocks will take longer to confirm.
- Snapshot nodes aren’t fault tolerant, meaning that if a snapshot node is offline, you won’t be able to download anything from it.
- The farther away the snapshot node is from you, the slower it will be to download from it. Latency has a large impact on downloading large files, and on a mission critical application environment, you’ll want to be very close to the snapshot node.
So how can we improve on this? For starters, distributing the snapshots over a CDN, or Content Delivery Network, will eliminate 3 of the issues above. How so?
What is a CDN?
A CDN refers to a geographically distributed group of servers which work together to provide fast delivery of Internet content. Since it is composed of multiple servers around the globe, you can download snapshots from the server that’s closest to you. Multiple servers and replication make the network fault-tolerant, so snapshots will always be there for you when you need them.
Let’s look at some data. I’ve spun up a $5 DigitalOcean VM on 6 locations around the globe, and downloaded a snapshot from ARK’s official snapshot server on each of them.
San Francisco 02 DC
root@ubuntu-s-1vcpu-1gb-sfo2-01:~# wget https://snapshots.ark.io/current
100%[==============================>] 1.35G 13.7MB/s in 1m 42s(13.6 MB/s) - ‘current’ saved [1454445115/1454445115]root@ubuntu-s-1vcpu-1gb-sfo2-01:~# wgetwget https://snapshots.ark.land/current
100%[==============================>] 1.35G 152MB/s in 9.0s‘current.1’ saved 
If you had a node in San Francisco, and were to download a snapshot from the official snapshot node, it’d take it 102 seconds to complete. This is largely because there’s a whole lot of land and ocean in between San Francisco and Western Europe, where ARK’s snapshot server is hosted.
Downloading it from ARKLand CDN would connect you to a server in San Francisco, letting you download the snapshot in only 9 seconds. That’s 1033% faster.
New York 01 DC
root@ubuntu-s-1vcpu-1gb-nyc1-01:~# wget https://snapshots.ark.io/current
100%[==============================>] 1.35G 26.8MB/s in 53s(26.1 MB/s) - ‘current’ saved [1454385146/1454385146]root@ubuntu-s-1vcpu-1gb-nyc1-01:~# wget https://snapshots.ark.land/current
100%[==============================>] 1.35G 218MB/s in 6.5s‘current.1’ saved 
Having a node in NYC would be faster, but downloading from our CDN still brings downloads that are over 7 times faster. Here are all the results:
The improvement rates give us a clear picture of how distance affects download speeds. The difference for servers in London and France aren’t as huge as from India and Singapore, but there’s still a measurable improvement. It doesn’t mean that ARK’s snapshot server is bad or slow, it just shows how important having low-latency is for fast download speeds.
There’s also a bandwidth limitation. In periods of network outage, when many delegates download a snapshot from the official server at the same time, it will be much slower for everyone. CDNs are prepared to handle traffic spikes and will still deliver fast downloads. Serving snapshots across the globe will also encourage the community to hosts more nodes in other parts of the world, rather than the usual NA/EU configuration.
Now back to the first flaw in my list: Not knowing the height and integrity of a snapshot before downloading it. By visiting the Snapshot CDN’s home page, we can see the height, size, and creation time of each snapshot:
We only store a new snapshot if its height and size is higher than the last snapshot, and snapshot nodes are kept on check by Noah.
Other delegates have rolled up their own CDNs before, but maintaining your own CDN servers can be time consuming and costly. ARKLand CDN runs on a professional and reliable infrastructure maintained by Google, on Google Cloud.
Our edge servers, responsible for managing and routing you to the CDN, run behind CloudFlare’s load balancers, so if one edge server is ever offline, traffic is redirected to a backup server. CloudFlare also provide us with Anycast functionality, which routes you to the edge server that’s closest to you. The next step is to provide devnet snapshots as well, and support Core v2. The costs is be funded by our delegate share.
I'd like to invite all delegates and node managers to test out the Snapshot CDN!
This is the part 1 of 2 posts regarding snapshots. Decreasing the TTD of snapshots by such high margins is great, but it can still take about 2 to 10 minutes to actually restore a snapshot, depending on the server. That could mean a couple more missed blocks due to the node being offline. And fixing this is what we’ll tackle on the next post.
This new snapshot system has been under test over the last months, and it’s looking amazing. Snapshots take a mere second to dump. On this sneak peek you can see a modified version of Noah rebuild a node from block 1500 to block 6194127 in 5 seconds.
Stay tuned for the next post 😊