At 1:14pm Pacific time, May 15th, the Stellar network halted for 67 minutes due to an inability to reach consensus. During that time no ledgers were closed and no transactions were processed — basically, Stellar stopped.
However, the ledger state remained safe and consistent across the network. Stellar has roughly 150,000 users every day and over 3 million total accounts. No one lost their money; no one’s balances were confused by a fork. At 2:21, ledgers began closing where they left off, and the network is healthy this morning.
Needless to say, an outage like this is highly undesirable, and it uncovered a few improvements we need to make. Here are the main takeaways, which we expand upon below.
1) The halt wasn’t because Stellar’s Consensus Protocol failed — in fact, it worked as intended. For a system like Stellar, a temporary halt is preferable to the permanent confusion of a fork. But yesterday shows that Stellar needs better tooling around uptime. We need better status monitoring for validators, and it needs to be easier to restart a validator after it goes down.
2) We’ve seen claims that Stellar is “over-centralized” and that somehow a failure with SDF’s nodes dragged down the whole network. Ironically, the opposite is true. Stellar has added many new nodes recently. In retrospect, some new nodes took on too much consensus responsibility too soon. We need better community standards around maintenance timings, quorumset building, and validator configuration.
The Protocol’s Role in the Halt
As a fundamental design choice, Stellar prefers consistency and partition resilience over liveness. In other words, when faced with consensus uncertainty, the Stellar Consensus Protocol (SCP) prefers to halt rather than operate in a potentially inconsistent state. This is different from other blockchains, in which “the chain must go on” even at the price of soft forks.
Financial institutions prefer downtime over inconsistent data, that’s why they choose Stellar. It’s much better for a financial network to go offline temporarily than to produce permanent false or disputed results.
Still, with the right tooling, Stellar shouldn’t need to halt. Here’s how we will mitigate future risk:
- Better monitoring and alerting. People running validators need to be aware when nodes in their quorumsets are missing. We are doing several things to improve this. We are making changes to stellar-core that make it much more obvious when nodes are missing and that allow operators to receive alerts when an important node is down. We will also work with Stellarbeat to make it more obvious when critical nodes are missing from the network. We’ll also make a bot to post to our public validators channel anytime a node go down. All of this should make it much less likely that the network slips into the fragile state it did yesterday.
- Quicker restarts. The other important focus is to help operators recover quickly if the network does halt. Right now the process imposed by stellar-core is cumbersome and requires too much coordination by participants. There are a few ways to make it so validators can quickly and easily recover from a stuck state and we are now prioritizing getting these into stellar-core. This outage would have been much shorter with these in place.
Even before this halt, we’d been working on improving the reporting capabilities of Stellar-core. Stellar-core 11.1.0RC already contains a command for getting a full transitive quorum set report. Other monitoring commands will be prioritized.
Stellar’s Increasing Decentralization
For the past few months the Stellar community has been hard at work setting up new validators and building diverse quorum sets, so Stellar works without SDF’s direct involvement. You can read more about this effort in SatoshiPay’s recent post.
Many of these new nodes are still working toward the standard of availability that the network expects. In the past few weeks we saw, repeatedly, misconfigured or down validators hampering consensus. This led to flaky liveness status in which an additional failure or two at the wrong time could bring the whole network to a halt. And that’s exactly what happened yesterday: Keybase took down their validator for maintenance at a time when other validators were shaky or down, and Stellar stopped.
Here’s how we’ll keep this from happening again:
- Better onboarding for new validators. Users need published standards and explorers to help them create “good” quorumsets. There should be tools to allow simple “what-if” scenarios. Reasoning about quorum intersection isn’t easy, and there needs to be more public guidance.
- Better operator standards. We will increase operator coordination so that maintenance schedules are publicly communicated. We will also help operators keep their nodes and their quorum choices up-to-date.
The State of the Network
In response to yesterday’s halt, key validators on the network coordinated a configuration change in which quourumsets were reduced to only include highly available validators. Within an hour, the network was alive, processing transactions and closing ledgers.
UPDATE: The outage on May 15 left the Stellar network in a fragile state, with only 4 parties as the core validators. Later, we experienced a brief additional problem as the quorum sets of two parties no longer had sufficient overlap with the other two. The network was taken down briefly while we repaired this.