Switcheo Exchange Downtime in August 2018

Ivan Poon
7 min readAug 17, 2018

--

Hi everyone,

As many are aware, Switcheo Exchange launched our V2 smart contract on the NEO blockchain on 25th of July. Shortly after the launch, our web interface, and API server experienced an unexpected increase in traffic, causing an unintended downtime. Our engineers worked round-the-clock to upgrade our infrastructure, and fix problems which stemmed from two main issues — newly developed queries that were not yet well tested, as well as the extra traffic due to our long awaited launch. The events lasted about 24 hours, and we were able to deliver a stable experience for a good number of days after resolving those issues.

However, to our dismay, we then encountered more issues, which caused multiple downtimes over the past two weeks, with a long 14 hour downtime, yesterday on 16 August 2018.

In this article, I will be sharing with the Switcheo community more about the reasons behind the recent downtimes.

Trading Architecture

As I’m sure most know, Switcheo is a non-custodian decentralized exchange. This means that funds sent to the exchange smart contract are not accessible without the depositer’s private key, and similarly, trades cannot be done without the appropriate signatures from this private key. The tradeoff of this additional trustless behavior however, is that every trade needs to be broadcasted on the NEO blockchain for proper exchange of tokens (balance state transitions) to occur.

Switcheo Exchange V2 has a scheduling engine that ensures that these trades are eventually broadcasted and committed on the NEO blockchain. This allows users to confirm trades instantly, relying on Switcheo’s order matching engine to pick the correct offers to exchange their tokens against. The appropriate transactions are then put into a priority queue, and broadcasted asynchronously to the blockchain. So far, we have not encountered any fundamental issues where the system has failed to perform correctly.

It is important to note, that since this process is currently done off-chain, it is not completely trustless — that is to say, users trust that Switcheo will eventually broadcast these transactions so that their tokens are actually swapped.

As such, we also have in place safeguards to ensure that all trades have their corresponding transactions broadcasted and confirmed on the blockchain in a timely manner. This is also known as clearing of trades, since the actual funds are controlled by the smart contract, and not Switcheo.

This extra check guards our users against extensive rollbacks of trades in the event of an accidental fault or bug on Switcheo Exchange’s off-chain order matching. When an unexpected amount of transactions cannot be confirmed on the blockchain, our backend automatically pauses trading.

Mempool Issues

This pause in trading has occured intermittently over the past weeks causing many frustrating downtimes — however, not due to the reasons we anticipated.

Seventeen thousand voices cried out in unison on Telegram

I’m sure many of you have seen the above error message. This message has appeared multiple times — however, it has never been due to the reason we expected — which is either:

  1. a bug or error in our off-chain order matching or transaction scheduler, which is causing multiple unconfirmed trades, or
  2. an sudden increase in trading load (which is the default, and so far wrong, error message), which results in an amount of pending transactions that is higher than our thresholds allow (which in turn is derived from historical trading loads and NEO blockchain TPS).

In either case, a pause in trading would allow us to examine that there are no issues with the system. At the start, this threshold was conservative.

However, we found that in each case, the true reason that the transactions were not readily cleared, was that the NEO blockchain was not clearing our transactions at the rate we expected.

Even though trade transactions were broadcasted properly, they ended up pending confirmation for a long time as they were queued in the blockchain’s mempool (mempool can simply be taken as the pool of transactions awaiting processing for our case). This can also cause a cascading effect to some extend, as dependent balance state transitions cannot occur.

While many people have commented on the cause of the large number of transactions, the reason for the large mempool is largely irrelevant to us, as the actual amount of transactions we have seen so far was not a number that is unfeasible for the NEO blockchain to process, even at the current stage.

We discovered after the first incident, that the number of free (0 network fee) transactions is a NEO block can have is only 20, even though the maximum transactions per block is 500. We therefore quickly upgraded our broadcasting engine to append fees to all our trade transactions. After that, we thought that this issue was resolved once and for all.

However, when the attacker struck again, we saw that while some of our paid transactions went through, there were also many transactions that did not get into blocks even though there was space in them!

A screenshot below shows an example of such a transaction during a phase where the mempool was large.

Paid transactions stuck in mempool with free transactions!

After consulting NEO Global Development, which is in charge of leading the development (and development community) of the NEO blockchain, we came to a understanding that there was an issue with the way transactions are relayed.

Because the NEO node architecture does not prioritize transactions with high fees when doing relaying, paid transactions may end up remaining in the mempool for a long time, as free transactions that cannot be put into the block (due to the 20 limit) are bounced around instead (a.k.a thrashing).

NGD and other entities are working on reproducing and solving this problem. Unfortunately at this time, we have no ETA on the resolution of the issue yet.

However, our engineers have been working around the issue by steadily increasing the aforementioned safeguards steadily, as we gain confidence in the correctness and robustness of our backend engines. On 15 Aug 2018, we made the largest increase in thresholds, and correctly broadcasted hundreds of queued trades asynchronously with a large delay during a period of high transactions, without pausing trading.

As I prepared to announce to the team the good news on the morning of 16 August 2018, a second, unexpected incident occured. The NEO blockchain experienced a “fork” on block #2,623,809.

Blockchain Fork

To be clear, a fork here simply means that two valid blocks were produced for the same block height. In a Proof-Of-Work blockchain like bitcoin and Ethereum, this is actually a somewhat common situation, and therefore is not typically described as a fork. This is as the consensus algorithm to choose the correct chain on POW blockchains is simply to follow the longest current chain. The orphaned blocks are known as “uncle” blocks instead (as they are the sibling to a parent block, or “uncle”).

However, NEO is meant to have instant finality — this means that users do not need to wait for confirmations or extra mined blocks, to ensure that they are indeed watching the longest chain. The uncle block that was produced on the NEO block was therefore unintended. The reason was an issue in the implementation of the chosen BFT algorithm (dBFT) where a crucial 3rd step was not implemented. This is explained in technical terms here: https://github.com/neo-project/neo/pull/320

The NEO team fixed the consensus node by abandoning the uncle block, and proceeded mining the chain from the second block produced (which included more transactions). This was done within a few hours.

However, in doing so, this also forced other node maintainers to have to re-sync their nodes from a previous state, as most listener nodes acquired the earlier, and now abandoned block.

This is as the NEO node software itself was not meant to handle a situation where there are two valid blocks (valid meaning signed block by a validator / consensus node) for a single block height. A NEO node cannot simply abandon a valid block easily, as each block is meant to be valid without change, and follow a alternate chain.

Upon realising that a fork occured, we immediately flipped the trading halt switch manually, and quickly begun re-syncing a number of our nodes from our last blockchain data backup which was done a week or so ago. However, we experienced difficulty acquiring the “correct” block which would lead to the now lively chain, as most public nodes were stuck on the abandoned uncle block, and kept broadcasting that block instead.

This issue, combined with the low connectivity with public nodes, significantly reduced the speed in which we could resume trading. Overall, we took about 6 hours to sync the blockchains the first time from our backup, before realizing that our nodes acquired the same uncled block, and another 6 hours to do a second sync where the team did a second backup just before the fork, so that the sync can be repeated multiple times quickly. An additional 1–2 hours was spent making sure that there were no unexpected consequences of the fork.

We had previously encountered similar issues where we could not re-sync our blockchain data quickly, and hope to reduce such a possible occurrence again by taking more regular snapshots of the various chains we interact with.

NGD has also assisted the community by providing additional public nodes, and offline chain data for download.

Summary

We have learnt a great deal from each and every incident, and we understand our users’ frustrations whenever it happens. I cannot guarantee that yesterday’s downtime will be the last, but our team will do our utmost best to make it the few.

I write this article on behalf of the Switcheo team in the hopes that our supporters have more insight into the development of decentralized systems. As the entire blockchain ecosystem progresses, we are confident that such issues will eventually decrease, and be as reliable and robust as their centralized counterparts.

Ivan Poon
Switcheo Team

--

--