Post Mortem: Tuesday 2 May 2023

Sun Hyuk Ahn
Harmony
Published in
4 min readMay 12, 2023

Summary

On May 2, 2023, Shard 0 (S0) consensus was interrupted for approximately 3 hours, resulting in internal validators nodes being unable to add block 41256348. The issue was resolved by reconfiguring the DNS node to point to the explorer node, allowing synchronization with external nodes. Further investigation into the code hash issue is underway to prevent future incidents.

Impact

Shard 0 consensus was interrupted and no transactions were processed for approximately 3 hours. Shard 0 consensus was restored at 2023–05–02, 22:18:46 UTC.

Timeline

  • May 2, 07:12 PM (UTC): PagerDuty alerts for consensus stuck on shard 0
  • May 2, 08:54 PM (UTC): Revert from block 4125637 to 4125636
  • May 2, 10:14 PM (UTC): DNS is updated
  • May 2, 10:17 PM (UTC): Nodes are all restarted
  • May 2, 10:26 PM (UTC): Issue is resolved - consensus resumes

Root Cause

Internal S0 validator nodes could not add block 41256348 due to an error loading code hash of a smart contract. Internal non-consensus nodes, explorer nodes, and external validators (not yet upgraded) were not impacted, and have successfully added block 41256348. Internal and external nodes were at different heights and could not reach a consensus.

Action Taken

To resolve the problem, the internal S0 validator nodes were restarted. In addition, the DNS node was reconfigured to point to the explorer node. This allowed the internal S0 validator nodes to synchronize with the latest block in sync with the external node.

May 10 Outage Details:

The Problem: The protocol team noticed that a number of internal validators were unable to sync to block 41613066. The issue was reported as a repeating error, where the validators were not signing blocks as expected. The corresponding non-signing BLS keys were identified, and their associated IP addresses were extracted.

Diagnostics: The team used a Python script ( non-signing-keys.py ) to identify the non-signing keys and associated validators. The output indicated that all non-signing keys were linked to internal Harmony validators. There were no external validators involved in this issue.

Comparison with a Previous Similar Issue: The team noted that this issue had a different behavior compared to a previous similar one. In the past, restarting the nodes had resolved the issue. This time, even though all explorer nodes (which were DNS nodes) had already reached block 41613066, the internal validators were still stuck.

The Error Messages: The main error encountered during this issue was related to a missing code hash. The error message read: “can’t load code hash 422c55ce5f5e1ece63fb9347a319ccffb476da0a4e422082ab6714fe1285b3c2: not found”. This message was appearing in the logs of the nodes that were unable to sync.

The Restart: The team decided to restart some nodes while keeping one node that couldn’t sync for further investigation. After the restart, some nodes were still logging errors related to the missing code hash. However, they started to write new blocks to the explorer db, while the block number stayed behind.

Explorer DB vs Chain DB: The team noticed that the explorer db and the chain db were behaving differently. The explorer db was being updated with new blocks, but the chain db seemed to lag behind.

Further Investigations: The team decided to not restart the remaining non-syncing nodes until the problem was isolated. They also noticed that the beacon-ness of the sync was wrong on one of the nodes, which could be unrelated to the main issue but was still worth investigating.

Logging Issues: The team found that the debug logging couldn’t be enabled on any of the three nodes without a restart, which would hinder their ability to diagnose the issue.

Flag Setting Issue: The team noticed that a flag was wrongly set as false, irrespective of the Shard ID. The code responsible for this was identified as part of the old syncing code.

Persistent Error: The missing code hash error was persisting on some nodes even after a restart. The error logs indicated that the problem was related to the addition of a new block (41613066) to the blockchain.

Incorrect Beacon Flag: A point of confusion was the isBeacon flag, which was set to false. The team speculated that it should be true and identified a related fix in a pull request (PR#4428).

Concurrency Issue: The team discussed a potential issue with the SetCurrentCommitSig function being called concurrently by two goroutines. This could lead to inconsistencies due to the indeterminate order of goroutine execution. A fix for this issue was proposed in a separate pull request (PR#4429).

Cache Issue: The team also considered the possibility of a cache-related problem, as the same error persisted across all stuck nodes.

To summarize, the team is dealing with an issue where internal validators are unable to sync to a certain block. The problem seems to be related to a missing code hash, and the usual fix of restarting the nodes has not resolved the issue completely. Further investigation is ongoing, with the focus on the different behaviors of the explorer db and chain db, and the incorrect beacon-ness of the sync.

What’s Next

  • Fix the code hash issue (In Progress)
  • Hardfork (v2023.2.0) postponed until the root cause is formally fixed

Follow Up

For more questions and discussion, refer to our talk forum.

--

--