Launching Cosmos with Double-Sign Protection in Hardware and Resilience against Host Compromise

In preparation for the Cosmos Network launch, planned for Week 9, as well as the sister network IRISnet, at Cryptium Labs our priority during Game of Stakes was to test our physical infrastructure and the Cosmos stack, not only the protocol, but also all the components that validators would use.

We explain the options we considered for mainnet and how it led Adrian Brink to work on the integration of the KMS with the Ledger Nano S, which provides double-signing protection for all the slashing conditions that Tendermint has in hardware, which makes our validator resilient even against host compromise.

Old Swiss key

Background

From our experience of running a validator (baker) for Tezos in production in addition to offering a public validation service since July 2018, we knew that launching with double-signing protection in hardware was a requirement, specially in the case of Tendermint Consensus networks, where safety faults are severely punished and where delegations are slashable.

As Game of Stakes was finishing, it seemed that launching with in-hardware double-sign protection would not be a priority for the Tendermint team (it was tagged as non-launch critical). Although, there were some available solutions developed by validators, their double-signing protection did not meet our security standards.

Why Prioritise Safety Over Liveness?

At the time of writing Cosmos’ exact PoS parameters are not yet defined, making it hard to estimate the actual losses and gains in the event of safety or liveness faults. In the case of Tezos, it is cheaper to go down for 24 hours and lose the rewards, than to risk double signing once, due to misconfigurations in automatic failover systems or host compromise (link to the double-signing article on Tezos). Considering that Tendermint Consensus based networks, such as Cosmos, prioritise safety over liveness and that in Cosmos the delegated tokens are at stake, the losses in an event of a single fault will be larger than being offline for a small amount of time and performing manual failover.

All the solutions that have automatic failover have some trade-offs between liveness and safety. Even with a correct configuration, the validator might be vulnerable to a single host getting compromised and being used to generate bad signatures.

Available Signing Options for Validators

The starting point was the Key Management System (KMS). On a high level, it is an interface that communicates with Tendermint Core and is compatible with multiple signing solutions. The current version of the KMS does not provide any kind of double-sign protection. At the time, the options were:

  1. Software Keys + KMS
  2. YubiHSM + KMS
  3. Ledger Nano S + KMS

We eliminated option 1, as it is the most vulnerable option in the event of host compromise due to the possibility of key exfiltration. The plausible alternatives were to use the YubiHSM or the Ledger Nano S.

The YubiHSM

During Game of Stakes, our validator was producing signatures with the Yubico HSM and tested it in a dedicated server placed in a datacenter environment.

When considering the options, Adrian started with the YubiHSM, as it seemed to be the preferred option for most validators, and he looked into how the integration with the KMS worked. However, it was obvious that there was not any protection against double-sign in software or hardware, as the YubiHSM is not programmable and signs whatever it receives without verifying.

Here’s a practical example of how risky the current YubiHSM is. Last week, game-of-stakes-6 halted due to spam attacks. Although our validator did not halt, our gaiad process crashed and left us with a corrupted WAL file, which was caused by a bug in the code. The quick fix, which most participants followed without second thought, was to delete data/cs.wal. However, we were reluctant, because we were using the YubiHSM and we were running the risk of double-signing, since it offered protection against accidental safety faults. The way we solved this was to emergency patch the KMS with basic double-signature protection, deleted the corrupted .wal file, and restarted the node. Everyone that was using the YubiHSM and just deleted data/cs.waltook a massive risk of producing a double-signature. We strongly advise against following Riot chat advice on mainnet if you don’t understand the consequences.

Another problem/vulnerability with the YubiHSM is its backup strategy. You generate the signing key on the device, then generate a symetric key on the host, upload it to the Yubi and then export the encrypted signing key. At this point the symetric encryption key was potentially on an internet connected device and should be considered compromised. The Nano Ledger S offers a much simpler backup strategy. You simply need to correctly backup the 24-words and no toxic key material ever touches an internet connected device.

The Ledger Nano S

After eliminating the YubiHSM option, Adrian started considering the Ledger Nano S, which is programmable. There was already a first version of the Ledger app written by Juan Leni (from ZondaX), but it was not integrated yet with the KMS. Furthermore, we knew that it was not audited nor tested, as it was not the priority of any of the validators nor the Tendermint team.

Integrating the Ledger App with the KMS

Here are the components that Adrian used to integrate the Ledger app with the KMS with Juan Leni‘s support.

He started with the lowest level of the stack, ledger-rs (repository), the communication interface that knows how to exchange messages via USB with the Ledger Nano S. The next component was ledger-tendermint-rs (repository), which understands the behaviour of ledger-validator-app (repository). Then signatory (repository), which is a library that implements connections to different backends with a public API that is consumed by the KMS. With signatory-ledger-tm (repository), it implements the public API in a way that is backed by the Ledger device. Lastly, the KMS (repository), consumes signatory and signatory-ledger-tm.

Properly Testing and Debugging the Integration

It took many days of work and debugging, for example, he found out that Ledger backend is not on par with tendermint votes/proposals. It turned out that the specification for the application was wrong, the Ledger did not know exactly how to decode SignProposal and SignVotes request, because the test vectors that Juan received were out of date.

After multiple PRs, such as this one, the integration was closer to production readiness. Link to Adrian’s the pull request: Ledger integration into KMS.

The Latency Issue

The integration of Ledger with the KMS worked but did not provide any signatures. Initially, the main concern was that the Ledger was not fast enough in providing the signature.

When Adrian deployed it on gaia-11000 , not a single signature was included by other validators. After benchmarking it, the records showed that it takes ~423ms to get the signature from the Ledger. In comparison, it takes ~150ms from the YubiHSM. The KMS adds almost no delay.

Without being able to discard the possibility of Ledger Nano S being too slow, Adrian talked with Nicolas Bacca, CTO of Ledger, and Juan Leni, he found out that the latest firmware halved the speed on the Ledger Nano S. Adrian tried without success by downgrading the Ledger’s firmware or by finding another Ledger with the old firmware.

Still, considering that time between blocks was 5-6 seconds, the latency should not be an issue whether it took 400ms or 200ms, it was even stranger that there were0 signatures broadcasted to the network, if latency was the issue.

The Actual Issue

Testing on gaia-11000 was difficult, as we had very few coins and every validator was running auto-bonding scripts plus the unbonding time was 3 weeks. Luckily, gaia-12000 launched, which gave us more coins to test with.

Luckily Juan discovered a bug in Tendermint Core: If the KMS does not respond with the valid signature in under 10 seconds but after, which can happen during the Ledger initialisation, the KMS will send the signature to Tendermint Core and it will ask for future signatures. However, it will not broadcast any signatures, which is what happened when he tested the integration on gaia-11000.

Everything was swiftly patched: Upgrading validator app to 0.6.0.

Conclusion and Reflections

Among all the signing options, we had to integrate the Ledger Nano S because it is programmable, unlike the YubiHSM, with the KMS in order to enable in-hardware double-sign protection for all the slashing conditions of Tendermint.

With the new integration, all the slashable faults are covered in hardware. The Ledger Nano S verifies all the slashing conditions before producing the signature. As it happens in-hardware, it makes this solution resilient against host compromise: it would require the malicious actor or attacker to break into the host as well as the secure element on the Ledger to cause a double-signature.

Additionally, besides Cosmos, it will be usable for IRISnet and any other future Tendermint-based chains running on Tendermint Core.

We ❤ Open-Source

At the moment, the integration is being further tested, but we will continue to work with Juan to improve its user experience for validators. Initially, we thought of keeping the integration closed-source, but it is clearly in the best interest of the overall Cosmos network to have more secure validators at genesis, so we decided to open source it.

If you want to launch with in-hardware double-sign protection, try the Ledger + KMS integration. If you need help to get it work, message Juan and Adrian over Twitter, as they are happy to assist. As appreciation, please consider delegating atoms to Cryptium Labs or inviting Juan for a beer when you’re around Zug.