May 15th Network Halt

At 1:14pm Pacific time, May 15th, the Stellar network halted for 67 minutes due to an inability to reach consensus. During that time no ledgers were closed and no transactions were processed — basically, Stellar stopped.

However, the ledger state remained safe and consistent across the network. Stellar has roughly 150,000 users every day and over 3 million total accounts. No one lost their money; no one’s balances were confused by a fork. At 2:21, ledgers began closing where they left off, and the network is healthy this morning.

Needless to say, an outage like this is highly undesirable, and it uncovered a few improvements we need to make. Here are the main takeaways, which we expand upon below.

1) The halt wasn’t because Stellar’s Consensus Protocol failed — in fact, it worked as intended. For a system like Stellar, a temporary halt is preferable to the permanent confusion of a fork. But yesterday shows that Stellar needs better tooling around uptime. We need better status monitoring for validators, and it needs to be easier to restart a validator after it goes down.

2) We’ve seen claims that Stellar is “over-centralized” and that somehow a failure with SDF’s nodes dragged down the whole network. Ironically, the opposite is true. Stellar has added many new nodes recently. In retrospect, some new nodes took on too much consensus responsibility too soon. We need better community standards around maintenance timings, quorumset building, and validator configuration.

The Protocol’s Role in the Halt

Financial institutions prefer downtime over inconsistent data, that’s why they choose Stellar. It’s much better for a financial network to go offline temporarily than to produce permanent false or disputed results.

Still, with the right tooling, Stellar shouldn’t need to halt. Here’s how we will mitigate future risk:

  • Better monitoring and alerting. People running validators need to be aware when nodes in their quorumsets are missing. We are doing several things to improve this. We are making changes to stellar-core that make it much more obvious when nodes are missing and that allow operators to receive alerts when an important node is down. We will also work with Stellarbeat to make it more obvious when critical nodes are missing from the network. We’ll also make a bot to post to our public validators channel anytime a node go down. All of this should make it much less likely that the network slips into the fragile state it did yesterday.
  • Quicker restarts. The other important focus is to help operators recover quickly if the network does halt. Right now the process imposed by stellar-core is cumbersome and requires too much coordination by participants. There are a few ways to make it so validators can quickly and easily recover from a stuck state and we are now prioritizing getting these into stellar-core. This outage would have been much shorter with these in place.

Even before this halt, we’d been working on improving the reporting capabilities of Stellar-core. Stellar-core 11.1.0RC already contains a command for getting a full transitive quorum set report. Other monitoring commands will be prioritized.

Stellar’s Increasing Decentralization

Many of these new nodes are still working toward the standard of availability that the network expects. In the past few weeks we saw, repeatedly, misconfigured or down validators hampering consensus. This led to flaky liveness status in which an additional failure or two at the wrong time could bring the whole network to a halt. And that’s exactly what happened yesterday: Keybase took down their validator for maintenance at a time when other validators were shaky or down, and Stellar stopped.

Here’s how we’ll keep this from happening again:

  • Better onboarding for new validators. Users need published standards and explorers to help them create “good” quorumsets. There should be tools to allow simple “what-if” scenarios. Reasoning about quorum intersection isn’t easy, and there needs to be more public guidance.
  • Better operator standards. We will increase operator coordination so that maintenance schedules are publicly communicated. We will also help operators keep their nodes and their quorum choices up-to-date.

The State of the Network

Overall, yesterday’s outage was a stress test — one that Stellar passed in terms of user safety but failed in terms of uptime. We have already learned from that failure and will learn more in the days to come. Taking the above steps to prevent an outage like this will be our immediate priority. Thank you to the many community members and partners who helped Stellar get back online as quickly as it did.

UPDATE: The outage on May 15 left the Stellar network in a fragile state, with only 4 parties as the core validators. Later, we experienced a brief additional problem as the quorum sets of two parties no longer had sufficient overlap with the other two. The network was taken down briefly while we repaired this.

Image for post
Image for post

Stellar Developers

Development insights and resources for the Stellar…

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store