BlockCypher Ethereum Outage Post-Mortem 2020–11–11

Published in

BlockCypher Blog

4 min readNov 18, 2020

On Monday November 11, BlockCypher Ethereum API experienced a severe service disruption as a consequence of an unplanned hard fork on the Ethereum network. We understand that a lot of platforms and projects rely on BlockCypher APIs. And as such, we acknowledge this long interruption is not acceptable. We would like to apologize to all our customers for this situation.

In order to move forward, we would like to share the details of this incident: what happened, what we did and what action we are taking in order to have a more resilient platform so that you can build even more confidently with BlockCypher. Note that this post-mortem is only applicable to our Ethereum service, no other services were impacted.

Timeline

11/11/20 07:08 UTC: Our Ethereum nodes start to encounter an issue during the block merkle root verification on block 11234873. As such the sync is halted.
11/11/20 07:23 UTC: Our alerting system is triggered. The chain is stuck behind the tip by about 88 blocks. Notifications are sent to our engineers.
11/11/20 12:47 UTC: Cause of the sync error is identified. A merkle root mismatch at block 11234873.
11/11/20 13:58 UTC: We received information that some service providers such as Infura are also down. We realize this is we are not only experiencing a consensus error.
11/11/20 14:30 UTC: Initial attempts to debug our state failed.
11/11/20 14:56 UTC: A newer version of go-ethereum (1.9.19), which we already were working on, is deployed in an emergency. This version contains the fix for the consensus issue.
11/11/20 15:00 UTC: Another issue emerged with the Ethereum chain synchronisation. A “nonce too low” error during the block verification.
11/11/20 17:00 UTC: Several attempts are made in order to fix this issue including getting a newly synced Ethereum state.
11/11/20 17:47 UTC: We’ve identified the issue and deployed a hot fix. The chain is now syncing again. Status page is updated accordingly.
12/11/20 14:00 UTC: We become aware that our constant calls on our contract API are failing. We began syncing from scratch another Ethereum node on the latest version in order to compare the states and check the API calls.
16/11/20 14:00 UTC: We are waiting for the geth node to sync in order to fix the contract API. in the meantime, we realize that some internal transactions are not correctly reported.
16/11/20 22:00 UTC: The root cause of the errors is identified and fixed, we update our infrastructure.
17/11/20 6:49 UTC: Another issue with missing internal transactions is fixed. The hot fix is immediately deployed.
17/11/20 16:30 UTC: The contract issue is identified and fixed. Incident marked as resolved.

What was the root cause?

According to Péter Szilágyi, team lead at Ethereum:

Geth v1.9.7 (released 7th November, 2019) broke the EIP-211 implementation, whereby a memory area was shallow-copied, allowing it to be overwritten out of bounds. The bug was reported by John Youngseok Yang on the 15th July, 2020 and was silently fixed and shipped 5 days later in Geth v1.9.17 (20th July, 2020). This fix brought Geth back into consensus with Besu, Nethermind and OpenEthereum (and the Ethereum specification itself); however it broke consensus with earlier Geth releases.

Since BlockCypher was running go-ethereum v1.9.9 as a dependency, the sync error was triggered and our nodes started to stall.

Why was BlockCypher using geth (v1.9.9) when the latest version is (v1.9.24)?

As a policy, we do not update systematically to the latest version unless strictly required as it’s generally risky. We only update when the go-ethereum maintainers flag it as required. In this case the risk of forking was completely unannounced so we noticed just with everyone, when it happened.

We run a heavily patched version of go-ethereum that is tricky to maintain and take us a long time to ensure that everything is correct.

Luckily, we already had an upgraded version (v1.9.19) almost ready. We patched and re-tested the update as quickly as possible. However, this version had to be rushed out and some features were not fully tested, hence the subsequent issues that we encountered.

What were the lessons learnt and how do we prevent this from happening again?

Knowing that consensus issues might arise in older versions of go-ethereum we will track more closely the latest release, while still prioritizing stability over novelty. Along with this we are deploying new environments to test the validity of our Ethereum state.
Improve our incident recovery. As a result of the regression on internal transactions, some webhooks were not delivered. We will develop a procedure to allow the re-generation of internal transactions and webhook delivery.
Review other areas of our infrastructure that are currently not able to be simply replayed on failure.
Investigate additional automated testing on our Ethereum systems.

This issue was the most severe since March 2019. While Ethereum is our newest set of APIs, our mission is to guarantee a 99.999% uptime while smoothing out the wrinkles of blockchain development for our customers. In the constantly fast-moving landscape of cryptocurrencies this can often be a challenge.

We acknowledge that we are the interface between the bleeding edge and the rock-solid infrastructure we strive to offer. In this case, we failed. But based on our performance across all our services and the improvements we have in preparation, we trust you will continue to rely on us to power your applications for the years to come.

BlockCypher Ethereum Outage Post-Mortem 2020–11–11

Timeline

What was the root cause?

Why was BlockCypher using geth (v1.9.9) when the latest version is (v1.9.24)?

What were the lessons learnt and how do we prevent this from happening again?

Written by Quentin Le Sceller