Our biweekly updates written by the entire Prysmatic Labs team on the Ethereum Serenity roadmap.
Lessons We Learned from Our Public Testnet
Goerli ETH1 Hard-Fork Post-Mortem
On October 30th, the Goerli test network conducted a successful hardfork update for Istanbul. Unfortunately, Prysmatic Labs ETH1 Goerli nodes were not ready for this upgrade and became incompatible with the newly forked blockchain. This caused our ETH1 subscriptions that monitor the deposit contract to fail and a cascading series of failures to occur. The end result was that all of Prysmatic’s validators went offline at once and the problem was not rectified in adequate amount of time to resolve without major manual intervention. Rather than hack the chain back to life, we are leaving it offline until our next scheduled restart on November 4th to include spec-v0.9 changes. We learned a lot from this incident about a single point of failure and improved many aspects of the testnet to prevent and detect issues like this in the future. Read the full incident report here.
Monitoring Critical Conditions
Monitoring everything is the key behind testing any complex system, but more importantly, it’s all about measuring and capturing the right context. Say we monitor all instances of the following error that can occur during a block being received via p2p:
[2019–10–29 10:31:20] DEBUG forkchoice: Executing state transition on block slot=112849panic: could not process block from fork choice service: could not execute state transition: could not process block: could not process block header: parent root 0x6a4f34d5192245e2ea5d30ba19860ed433fbcffd3981b4a6a0ca53fc486e5fd9 does not match the latest block header signing root in state 0xfc1c108b23cf44872e9ae2238fe88b3b823d47396f6b2148192a4699496f3fc9goroutine 221 [running]:github.com/prysmaticlabs/prysm/beacon-chain/sync/initial-sync.(*InitialSync).Start(0xc000a54030)beacon-chain/sync/initial-sync/service.go:106 +0x97dcreated by github.com/prysmaticlabs/prysm/shared.(*ServiceRegistry).StartAllshared/service_registry.go:44 +0x23e
…but what does this mean? What exactly does it tell us? At first sight, we can tell the block being processed has never had its parent processed in our beacon chain’s state…but, how did it get in this situation in the first place? Naively, we can imagine we received a wrong block from a forked node, but how can we ascertain our confidence over that situation? We start with a few hypotheses of our failure causes, and then look at the lowest hanging fruit of our monitoring infrastructure. We use the awesome Kibana logging platform to understand and visualize logs across nodes in our production cluster, understanding their context and frequency to correlate observations with each other appropriately.
Further inspection on the error, for example, could show that only 1 or 2 nodes received the error. We can then look at logs of the current block roots for our different nodes to then show whether the origin of the error was a fork. Often times, these hypotheses can be wrong, and in a particular case, we obtained the error above due to a bug in our serialized data structure cache instead of a fork scenario! Good devops is about tracking the right information and understanding its context over anything else. Good errors should be clear, but having enough information to ascertain beliefs about those errors’ context is far more important for a healthy system.
Problems With Caching
The first thing developers are inclined to do when a runtime operation is highly repetitive and expensive is to cache it, duh! Unfortunately, caches and distributed systems don’t often play nicely…
There are only two hard things in Computer Science: cache invalidation and naming things.
— Phil Karlton
We’ve had an incredible amount of problems when naïvely using even the most simple caches in our public testnet nodes the moment network instability kicks in or inconsistency arises between nodes in the network, such as in naturally occurring short-range forks. It’s very tempting to try to make every expensive function call retrieve cached data instead of persistent-disk data, but it’s more important to consider how to invalidate those caches upon any drastic network changes to prevent data corruption. Moreover, if a node restarts and has nothing stored in memory, we need to ensure it has no problems running as it normally did when everything was cached. We are currently working on a better caching strategy that will prevent fatal conditions upon any sort of network data corruption and properly invalidate data if needed. In the meantime, we’re placing all caches behind runtime flags until we have full confidence in each of their runtime stability.
Merged Code, Pull Requests, and Issues
Testnet Data Analysis — Kafka Data Extraction
As part of on-going testnet data analysis tooling, we’re focusing on ways to stream data in real time to other services for ingestion. This functionality allows us and other developers to build advanced tooling around abnormality detection such as slashable events, impossible block state transitions, validator liveliness monitoring and more.
Revamped Simple Serialize Caching
Simple Serialize remains one of the largest bottlenecks in our project when increasing the validator count in our public / local testnet runs. Eth2 calculates Merkle roots of data structures for fast recomputation if certain elements change, such as for the eth2 beacon chain state object. On a single block being received, only a few fields in the state are modified, and therefore recomputing the root of the full object should only take marginal time. A major issue we have is when recomputing arrays of roots, such as the `BlockRoots` object stored in the beacon state, which has size of byte.
Every time a state transition occurs, an element in that array is modified, and instead of keeping around the Merkleization for that element, we end up recomputing the entire Merkle trie from the leaves on every instance, also causing high memory utilization.
Instead, we can opt to keep around the specific Merkle trie for fields that are arrays of roots and only recompute modified branches if they change on each iteration. Doing this had a massive improvement on our runtime, giving us a lot more confidence in our go-ssz repo’s ability to handle larger state sizes in our testnet runs.
BenchmarkSSZ_NoCache-4 18 58928587 ns/opBenchmarkSSZ_WithCache-4 56 22740673 ns/op
We obtain almost a 62% improvement on our current benchmark!
Integrating Slick New BLS Library by Herumi into Prysm
During Devcon our team finally got to meet herumi, which allowed him to develop a compiled static BLS library here which could be used in Go projects. Previously we had spent a lot of time trying to get herumi’s BLS library to work with prysm. However we ran into many issues integrating it into Prysm; using it would lead to a negative developer experience as it did not play well with our IDEs and would make Prysm no longer ‘go gettable’. However with the compiled static library, it makes our job of getting it to work with prysm much easier as now it does not require users to install other dependencies such as OpenSSL and GMP in their local environment.
Shigeo Mitsunari, Herumi on Github, is a respected Japanese cryptographer that has been working on pairing schemes for 20 years, and we are confident in his ability to keep supporting the eth2 effort through his incredible expertise.
Herumi’s library led to a very big improvement in signature verification bringing a performance improvement of nearly 3 times. This is a big bottleneck for us, as we constantly verify multiple attestation signatures in a slot. Herumi made the library compatible with the eth2 spec, and als implemented a faster co-factor algorithm which led to faster MapToG2 and consequently a 2x improvement in signing messages. The PR to integrate it into prysm is here.
Aligning to Spec v0.9 Release — Tonkatsu
As pointed out in Danny’s latest quick update, we have been hard at work to update our runtime from spec v0.8.4 to v0.9.0 and it has been a blast.
Removing shard and crosslink information from the beacon state transition shaved off 3000 lines of code, it’s made the state transition simpler and easier to reason about. It’ll reduce the scope of runtime bugs. We are almost done with v0.9.0 transition, we started off with altering the new proposer logic and seed calculation, then we fixed minor adjustments such as beacon chain configs and rewards calculations, lastly we fixed all our tests. Now, we are working towards getting spec test aligned. You can track our progress in this mega tracking issue.
Current Status of Slashing
Continuing work on the “Hash Slinging Slasher” for catching slashable votes, some teams had a call with protolambda from the Ethereum Foundation this week to discuss possible designs for the slashing police watchtowers on the beacon chain. For those who don’t know, this is a service that users can optionally run in order to catch slashable votes and receive whistleblower rewards. While slashable votes should rarely occur, it’s important for people to be running these if they have the resources.
Going over protolamba’s new design, you can see the details here. It mainly revolves around storing the distance (target epoch — source epoch) for each validators attestation per epoch. By checking the distances a validator has already voted for, we can compare it with the distances of new attestations coming in and easily see if new votes surround old votes, or if they’re being surrounded by old votes. While naively implemented this detection method may take ~60GB of storage for 300K validators but since there is a lot of repetitive data chances are we can compress this down to even less than 1GB!
Interested in Contributing?
We are always looking for devs interested in helping us out. If you know Go or Solidity and want to contribute to the forefront of research on Ethereum, please drop us a line and we’d be more than happy to help onboard you :).
Official, Prysmatic Labs Ether Donation Address
Official, Prysmatic Labs ENS Name