Developer: Xin Zhang
Throughout the testnet phase of the aelf Enterprise Blockchain, countless days were spent troubleshooting and resolving numerous problems, as expected when writing code from scratch. One such problem resulting in the nodes crashing soon after becoming active. This article will look at how aelf developers identified and resolved the core of the issue.
During the AElf single node testing phase, the developers found that the node suddenly went offline. After checking the log, it was found that Workers (during the transaction execution process) were all dropped and the execution of transactions ceased, and as a result, the node crashed.
This problem was quite unusual, since the nodes and all their Workers are on the same server, and thus network communication should not be a problem. Additional diagnostics revealed that the main node, all the Workers and Lighthouses dropped offline at almost the same time. We continued our troubleshooting and found issues through zabbix — the RAM of the server came close to reaching its operation capacity at one point. Looking at the time stamp, it coincided with the time which the node malfunctioned.
Identifying the Underlying Problem
We focused on testing the node’s memory usage. Preliminary results indicated that as the node runs for an excessive period of time, the process occupies the server’s memory on an ongoing basis. The memory usage grows significantly, especially after large numbers of transactions have been sent, and most importantly, the memory was not released even after transactions stopped.
Reproducing the problem:
First I must introduce the service environment used: Ubuntu 16.04.5 LTS & dotnet core version 2.1.402
The node has a base memory usage of approximately 90MB.
By continuously sending a large number of transactions to the node, we can monitor the node trading pool as it accumulates and executes the transactions (shown below).
After some time, the memory usage reached 1GB.
At this point, the transactions have stopped being executed. As seen below all transactions already in the trading pool have been executed.
“We continued to observe the memory footprint, and after some time we find that the memory usage has not decreased, and is maintaining a usage level of 1GB.”
Analysis & Solution
We use lldb to analyze our nodes.
This is done by first installing lldb on the server
Sudo apt-get install lldb
Find the local ibsosplugin.so location
Find /usr -name libsosplugin.so
Start lldb and attach it to the process
Sudo lldb –p 13067
Plugin load /usr/share/dotnet/shared/Microsoft.NETCore.App/2.1.4/libsosplugin.so
Analyze the next object
We observed a large number of the following objects
Since we are trying to identify objects that are unusually large, we focused on objects larger than 1024 bytes.
We can see that there are 4 objects of the same type that are relatively large
System.Collections.Concurrent.ConcurrentDictionary`2+Node[[AElf.Common.Hash, AElf.Common],[AElf.Kernel.TransactionHolder, AElf.Kernel.TxHub]]
Upon looking further at the objects corresponding to the MethodTable
We identified that of the 8 objects, 4 are unusually large. Picking one of them to view the object information, we found that 573,437 values are stored in it.
Based on the above analysis, we checked the corresponding source code and located the class: AElf.Kernel.TxHub. The main role of this class is to store the transaction pool transaction data. This class contains 8 ConcurrentDictionary<Hash, TransactionHolder> for storing transaction data. The TransactionHolder class stores Hash, Transaction, etc., which is consistent with the results of the above memory analysis. Looking again at the internal logic, we found that all transactions are stored in TxHub after they enter the transaction pool and are no longer released. From this we were able to locate the core of the problem.
After resolving this issue, we repeated the above steps for verification. The effect was obvious. After the transactions in the transaction pool had executed, the memory usage dropped significantly. The final results of the memory usage was as follows:
In addition, by looking at the contents of the objects in memory, we saw that the total number of objects has also dropped significantly.
Although the core problem was identified and resolved, we also noticed that there is still a small increase in the memory over time, and there are cases where large objects reside in the memory. We will spend more time analyzing the residual issues further.
— Join the Community:
· Read weekly articles on the aelf blog
· Catch up with the develop progress on Github
· Instagram: aelfblockchain
· YouTube Channel: aelf
For more information, visit aelf.io