OT4 — High Load Status Report
As of yesterday the Concordium Open Testnet v4 started seeing sustained transaction loads reaching > 2000 TPS. Most of those transactions are invalid.
Impact
There has been no hard outage as at time of writing.
Consensus and Finalization continue to work as expected, though there is more variance in block and finalization times due to a higher network load
The chain has not stopped or failed, and continues to run, as can be observed on the Testnet Dashboard.
A higher rate of transaction failures may be observed by users depending on which individual network node users are submitting to.
Causes
The testnet has effectively free transaction pricing, due to the unrestricted GTU drop function. This makes the current high transaction scenario unrealistically easy.
We believe that one cause of the high transaction load is our OT4-TX challenges, which ask participants to continuously submit transactions of various types in various schedules.
However, the currently observed transaction rate is way beyond the realistically modelled challenge transaction rates. Furthermore, while we have seen a high transaction load since launch of the transaction challenges a week ago, the current extreme transaction rates emerged suddenly last night and even seemed to increase throughout the day.
It’s currently unclear if users are accidentally or intentionally spamming the network. We’re tentatively investigating the possibility of a few nodes being bad actors and broadcasting or re-broadcasting a much higher than necessary rate of transactions.
Observed issues
We’ve observed the following happening:
Under-provisioned user nodes crashing
Based on log reports, a high number of participants on the network are not running the minimum required CPU/RAM resources for the hosts their nodes are running on.
These nodes are experiencing hard system crashes directly related to RAM exhaustion, or exhausted disk space.
In normal circumstances, nodes may get away with running with lower resources than recommended, but the high load situation has shown that low resource nodes have a hard time keeping up or may outright crash due to exhausting resources.
We will be making adjustments to our resource requirements advice to make sure users better understand the range of resource requirements.
Minimally provisioned user nodes degraded performance
Some users running the bare minimum requirements or close to them have also observed performance issues.
We will be making adjustments to our resource requirements advice to make sure users better understand the range of resource requirements.
Concordium Nodes degraded performance
No Concordium nodes have crashed. Our nodes are provisioned with 16gb of RAM.
Nodes have an internal buffer where transactions are temporarily queued when they are received over the network, and before they can be processed by consensus. The nodes running on the testnet have a buffer capacity of 65536 elements. If this buffer is full then transactions will be dropped immediately when they are received, and a warning is logged that the “inbound consensus queue is full”. Once the node manages to process existing transactions it will again accept new transactions.
Given the current transaction rate is exceeding the network TPS rate, excess transactions are being dropped until there is room in individual node queues.
Increased log size
Due to the large number of transactions being transmitted on the network, a node might receive transactions faster than it can process them. The transactions that it cannot process are dropped. The node is working as expected under such a high load.
This correct behaviour is logged as warning messages.
Nodes are consequently rejecting a lot of transactions and writing warnings into the log files. If the load remains at such a consistently high rate, the log size will increase and might cause disk space issues.
Mainnet impact
Currently we see no major mainnet impact.
The load on the testnet during the transaction challenges is much higher than what is expected on mainnet. We will be analysing the data from these tests to identify and rectify bottlenecks within the node.
Outcomes & next steps
The Testnet is exactly that: a test network!
We are thankful to the community participating in the OT4 challenges for making such a large scale load test possible — something that is very difficult to authentically simulate in the staging load testing we perform ourselves!
While the current situation may seem concerning, it’s actually a great opportunity for our team to assess the performance of the network at maximum load, and the team will be using the feedback and metrics to improve the Concordium network and node software further.
Join Team Concordium in our different channels:
- Twitter: https://twitter.com/ConcordiumNet
- Discord: https://discord.gg/MZyHgfw
- Telegram: https://t.me/concordium_official
- Reddit: https://www.reddit.com/r/Concordium_Official
- Learn more: https://concordium.com
- Concordium Blockchain Research Center Aarhus (COBRA):https://cs.au.dk/research/centers/concordium/