How to Share a Secret in Production
We first deployed our Torus Network DKG back in 2018, and we launched out of beta earlier this year in February with a new set of node operators, and we recently also did a key refresh protocol to upgrade the node software for existing operators, as well as to ensure that key shares are regenerated for operators.
With our launch out of beta, we have also been able to focus more on products that build on top of the Torus Network like improvements to the Torus Wallet, the DirectAuth SDK, the tKey framework, and more.
All of this was only possible with the stability and scalability guarantees that we achieved through a year of testing and debugging back in 2019. We built several internal tools to enable these tests (torus-manager, torus-tester), and we document some of the processes we have in place here so that future DKG networks can benefit from some of the lessons we’ve learned.
The architecture of a Torus Node can be split roughly into three separate systems:
- a DKG / PSS protocol that is managed by a canonical Ethereum smart contract
- a Tendermint layer between nodes that are used for mapping user identities to keys (and other metadata)
- a (local) verification system that processes user-submitted authentication proofs and returns shares
The easiest part to test, as well as scale, was the verification system, since it was local, and did not require communication with other nodes. We initially had a version which had the OAuth types hard-coded into the deployed operator software, but in order to allow other types of verifiers, we had to allow for dynamic updates. To do that, we deployed a separate verifier-list contract on Ethereum that allows us to update information about new verifiers (eg. new Google authentication clientIDs).
The DKG/PSS protocol had unit tests and had pluggable transports which allowed us to run local tests easily on things like dynamic operator sets or even simulate things like offline nodes and slow nodes. Our Tendermint layer also had unit tests around transaction validation and consistency. Integration tests were easy to build since we had rewritten most parts of our backend into services that communicated via an internal service bus (which could also accept external RPC calls), and we were able to mock services through simulating responses on the service bus. However, we quickly realized that unit tests and integration tests on just a single node’s software were insufficient for network stability during migrations.
Although parts of the systems worked well in local tests, in simulated networks, strange behaviour would happen that were not always reproducible in local tests. In some instances, a 5-node cluster under load was often observed to contain a single node that would “lag” behind its peers. This happened because only 4 nodes are required for consensus, and a 4-node cluster is faster at consensus than a 5-node cluster, so the 4-node cluster would confirm transactions faster than the last node could catch up, leading to a widening gap between the lagging node and the rest of the network. Moreover, simulations differed from local tests due to external dependencies, like API availability of Infura, or gas price changes during smart contract updates, or even just network connectivity due to rate limiting, and all of these minute differences chipped away at assumptions we had. For example, it was not possible to (directly) depend on an external Ethereum smart contract state for Tendermint consensus in production, since reading that state was a non-deterministic outcome, as the Ethereum connection might not have 100% uptime.
Another problem was backward compatibility. In many cases, later version of the Torus node had completely new data structures and features that were not conceived in the previous versions and required us to write custom patches to convert previous data structures to new data structures. Furthermore, in some cases, the conversion to the new data structures might have compatibility issues which only appeared the next migration, which meant that we needed to test data migrations from v1 →→ v2 →→ v2. These issues applied equally to the Tendermint system as well as the DKG/PSS system, and so we needed to write network-level end-to-end tests.
But how do you write end-to-end tests for something like a network? Was it even possible to automate this through CI?
(Answer: Kubernetes. And yes, it is possible.)
Since the Torus Node software is deployable as a Docker container, we were able to spin up test clusters easily through Kubernetes. We then used Cobra with the custom configuration to configure networking and automate the process of spinning up clusters based off builds from AWS ECR (which are updated by CircleCI whenever a commit is pushed to GitHub). We built a management tool around this called torus-manager, and it is the same tool that is used by node operators today to spin up new Torus nodes.
To separate the infrastructure code from the actual network-level end-to-end tests, we decided to write a separate tool called torus-tester, which contained test scenarios. Basic scenarios tested core functionality like the ability to generate distributed keys via DKG, retrieving them to reconstruct the correct key, and migrating keys via the key refresh PSS protocol. We then had more advanced scenarios that tested things like malicious node behaviour (where we write custom code that generates random input, or delays input), as well as focused tests on particular parts of the protocol like testing backwards compatibility via different commit numbers and testing dynamic node sets. A lot of these tests turned out to be rather expensive to run even with auto-scaling, and so we decided to move most of the focused tests to run only on the master branch.
With that in place, we then added load tests and hooked up the metrics from these tests to our internal Grafana dashboards to analyse memory usage, CPU usage, discrepancies in message timings etc. In the process, we fixed some issues in other existing open-source libraries, got an audit done on our DKG system, and open-sourced our codebase so that other users could take a look at what we were working on.
More importantly, these testing tools helped us to set the base foundation for building on top of the existing code base without affecting network functionality. For example, in the latest operator migration, we added a standalone web-server proxy before the Torus node, to ensure that DDoS attacks don’t require a node restart. Network routing underwent massive rework (especially since we also use libp2p for inter-node communication), but we were still able to confidently proceed with the migration since our end-to-end tests were passing even with the new routing changes. Many of these tools will continue to be used even in the next version of the Torus Network, and we aim to continually improve on these tools to keep our systems robust and secure.