Utilizing internal and external tracing of the Mina network to identify areas of code in need of optimization

Juraj Selep
15 min readMar 2, 2023

--

One of the primary barriers standing in the way of blockchain becoming widely adopted is their performance. This usually refers to the number of transactions that they are able to process per minute, which is also known as throughput.

However, we want to take a more comprehensive approach for improving performance. In addition to increasing throughput and reducing block production latencies, we want to also speed up SNARK work, increase the efficiency of the SNARK pool and we want to increase the rate at which the Mina network is developed.

First, we need to understand how the Mina network operates. We do this by measuring all relevant processes and detecting those that are particularly slow, which gives us a clear picture of which areas of code are in most need of optimization.

We achieve this by tracing Mina nodes. Tracing is a way to track the execution of a program and collect data about it, such as the values of variables at different points in time, the flow of control, and the order of function calls. The purpose of tracing is to help developers understand how a program is executing, to identify and debug issues, or to collect performance data.

Tracing is performed both externally (by observing the node behavior from the outside) and internally (by instrumenting the code).

For internal tracing, the code is modified by adding statements that record checkpoints in all the places deemed relevant. Such checkpoints include a tag describing a location, a timestamp of when the code execution reached that checkpoint, and some optional metadata with any useful information about the context at that time.

For each block, all the checkpoints captured for it, constitute a trace. These traces can then be analyzed to learn more about the traced processes. Our main goal is to measure speed, and with the recorded timestamps we can measure how long it takes to go from one checkpoint to another. The extra metadata also helps us understand the different decisions by the traced processes and the causes for failures (if any).

We also perform external tracing via the Network Debugger. With external tracing we don’t trace the Mina application, but instead the Linux kernel. Among other things, the kernel is responsible for the input/output (IO) of the application. By analyzing the IO data, we can infer what the application is doing and how fast it is being done.

Tracing helps us in the optimization of Mina nodes and the Mina network as a whole, and it also allows other developers to detect problems faster and easier. Tracing doesn’t limit itself to improving performance — it also points to problems in correctness, and allows us to easily check whether everything is working as intended.

The information gained from the tracing is then inputted into our own Open Mina user interface.

Important: Please note that this is a shortened version we have created for our Medium readers. For a full guide with added explanations, see our GitHub.

Let’s take a closer look at Metrics and Tracing interface on the Open Mina website

Dashboard

We want to first take a look at all of the nodes on the Mina network to gain a high-level overview of their latencies and block application times so that we can easily detect if any nodes are particularly slow, and then examine the reasons behind their low performance.

Please note that this shows nodes on the Mina testnet.

Nodes

This first opens up a list of synced nodes currently available (in the testnet).

In this screen, we can see that there are 8 Nodes altogether. Among them are 5 Producers of blocks, 8 Snarkers (SNARK workers), 1 Seeder and 0 Transaction Generators. Note that some nodes fulfill more than one role, for instance, in addition to producing SNARKs, a Snarker will perform the same role as a regular node.

Click on a node to open up a window with a list of checkpoints, as seen above. A checkpoint is a term we use to describe a place in the code (or the process which we are tracing) where we mark that we have reached that location.

Checkpoints are sorted chronologically from top to bottom, meaning that each action is performed after the action above it.

From left to right, each checkpoint describes:

1. The name of the process.

2. The time at which the process started. This time is local to the user (the person viewing the tracing UI).

3. How long it took to complete the process. This value is color-coded to reflect whether the process took longer than expected:

  • yellow is for durations over 100 ms
  • orange is for durations over 300 ms
  • red is for durations over 1 second.

We are primarily interested in the checkpoints that are slower than usual, with the time under the Process column being in either orange or red numbers. To explain how this list of checkpoints helps us in code optimization, let’s take a look at how we improved block application time.

When applying a block, when works are included, the biggest contributor to the time it takes to process the block is the check completed works step.

In the check completed works step, the node verifies the completed works included in the block (proofs for transactions included in previous blocks, that are waiting for their proofs to be included). When a block includes many proofs, this step can get expensive.

However, most of the time the completed work that needs to be verified has already been verified in the SNARK pool.

This optimization takes advantage of that by comparing and filtering out all the completed works included in a block that are already present in the SNARK pool. As a result, the amount of work that needs to be verified is reduced (quite often to 0).

We then tested the optimization by running two servers extended with internal tracing here, one unpatched, and another one patched with this optimization.

For block 7881 on berkeleynet, which includes 76 completed works, here are the times for the unpatched and patched nodes:

Unoptimized node (8.86s) / Block Total (12.03s) — Optimized node (0.015s) / Block Total (1.5s)

By tracing these checkpoints, not only are we informed of particularly slow processes, but we can also confirm whether our implemented changes have resulted in better performance.

Now let’s move onto the Explorer section, which is located immediately below the Dashboard.

Explorer

We need to have a view of the blockchain’s past and present, not only for debugging purposes, but also to help us understand what happened in its history, which also gives us a clearer picture of what to optimize.

The Explorer page enables you to access blockchain data recorded in the form of blocks, transactions, the SNARK pool, scan state and SNARK traces. It provides a view into the current and past state of the blockchain and its focus is the blockchain itself

Blocks

We want to be able to view the history of the blockchain, which is achieved through the Mina block explorer. The first tab displays a list of blocks sorted by the time of their publication.

Next, click on the Transactions tab.

Transactions

The blockchain’s present state is constantly changing as new transactions are added. We want a live view of this process, to see as pending transactions are validated and added to the state.

If you have created transactions via the Benchmarks page of the Dashboard, they will show up here.

Next, click on the Snark Pool tab.

Snark pool

On this tab, we can inspect the contents of the internal state of the node, specifically the node’s SNARK pool.

The snark pool contains work completed by snark workers in the pool of the current node. Snark workers compress transactions via SNARKs, receiving compensation in MINA for their effort.

  • Snark jobs — units of work performed by SNARK workers
  • Prover — the identity of the node acting as a snark prover
  • Fee — the compensation received by the snark worker.
  • Work Ids — a unique number identifying the snark job. Most snark jobs are bundled in pairs, which is why they have two work ids.

Next, we look at the Scan State tab

Scan State

Transactions in Mina require SNARK proofs in order to be validated. We want to take a closer look at this process because SNARKs are a key part of Mina’s block production process, are resource-intensive and may present a performance bottleneck.

We’ve created a visual representation of the scan state, a data structure that queues transactions requiring transaction SNARK proofs.

At each block height, there are operations requiring proofs such as transactions, snarks, user commands, fee transfers and so on. Scroll through the various block heights with the use of the buttons at the top of the screen, then scroll down to view the individual trees of operations at that height.

Each operation is represented by one of the following possible values:

  • Todo — A request to generate this operation’s SNARK proof has been made, but it has yet to be added.
  • Done — means the operation already has a SNARK proof. The operation has been waiting in the SNARK pool and it was selected by a block producer.
  • Empty — A request to generate this operation’s SNARK proof hasn’t been made.

Snark Traces

SNARKs have a certain life cycle, and we want to have a close look at their various stages to detect possible problems at any stage of this cycle. This helps us understand how we can optimize this process. For this purpose, we began tracing SNARKs.

Here is a list of Snark traces that were produced within a certain time range.

Resources

The node utilizes a variety of resources through processes such as reading, writing and communicating with peers. We want to have a graphical overview of how much resources are being used over time which helps us quickly detect possible problems or inefficiencies.

System

The Resources tab shows a graph representing the use of resources by the various nodes that are currently connected. Note that graphs for individual nodes will look very different due to varying lengths of time and varying uses of resources.

The number and labels of the subprocesses can change as the node launches various subprocessses for tasks like proof production, proof verification, evaluation of verifiable random functions (VRFs), and so on.

CPU

From top to bottom, first, we see a graph of the node’s CPU usage. It describes the percentage of the CPU that was utilized by the processes over a period of time.

Memory

Describes how much physical memory (RAM) each process is using at given times.

Storage IO

Displays the disk read and written bandwidth over a period of time. In other words, how much data was read/written from/to the disk (HDD/SSD) per second over a period of time.

Network IO.

Displays the network sent and received bandwidth over a period of time. In other words, how much data was sent and received per second over a period of time.

Network

The P2P network is the key component of the Mina blockchain. It is used for communication between nodes, which, among other things, also includes block propagation and the updating of the blockchain state. We want to have a close look at the messages sent by nodes to see if there are inefficiencies in communication so that we know where to optimize.

This is an overview of the messages sent by the node, other peer nodes connecting to it, as well as the blocks being propagated across the network.

The Network page has the following tabs:

Messages

We want to have a view of all messages sent across the Mina P2P network to see if there are any outliers, either in particular messages or when we filter through various layers. This shows us which types of messages are in need of optimization.

The Messages tab shows a list of all messages sent across the P2P network.

Below the filters is a list of network Messages. The most recent messages continuously appear at the bottom of the list.

Connections

Connections made across the P2P network have to be encrypted and decrypted. We want to see whether these processes have been completed, and if not, to see which connections failed to do so.

For this purpose, we’ve created a list of connections to other peers in the Mina P2P network.

  • Datetime — when the connection was made. Click on the datetime to open up a window with additional Connection details.
  • Remote Address — the address of the peer. Clicking on the address will take you back to the Messages tab and filter out all messages from that peer.
  • PID — the process id given to applications by the operating system. It will most likely remain the same for all messages while the node is running, but it will change if the node crashes and is rebooted.
  • FD — the messages’s TCP socket ID (a file descriptor, but it is used for items other than files). The fd is valid inside the process, another process may also have the same fd, but it is different socket. Similarly to pid, it is subject to change when the connection is closed or fails.
  • Incoming — This value informs us of who initiated the communication, if it is the selected node in the top right corner, it will be marked as outgoing. If it is a different node, then it is marked as incoming.
  • Decrypted In — the percentage of messages coming into the node that the debugger was able to decrypt
  • Decrypted Out — the percentage of messages coming from the node that the debugger was able to decrypt

Blocks

We want to view inter-node communication so that we can detect any inefficiencies that we can then optimize. We created a page that provides an overview of blocks propagated across the Mina P2P network. Note that everything is from the perspective of the node selected in the top right corner of the screen.

Block candidates are known as such because for each global slot, multiple nodes are vying for the opportunity to produce a block. These nodes do not know about the rights of each other until a new block appears. Multiple nodes may assume that they can produce a valid block and end up doing so, creating these block candidates, but ultimately, there is a clear rule to select only one block as the canonical one.

Blocks IPC

A Mina node communicates over the network with other peers as well as inter-process commands from Mina daemon on the same device. We want to track the block as the local node is creating it or as the local node first sees it so that we can detect any problems during this communication.

For that purpose, we’ve created the Block IPC tab, which displays inter-process communication (IPC).

Tracing

We want to know which processes in Mina are particularly slow so that we can then focus our optimization efforts on those areas. For that purpose, we’ve created the Tracing page, an overview of calls made from various checkpoints within the Mina code that shows us which processes have high latencies.

Overview

The first screen is the Overview tab in which you can see a visualization of the metrics for various checkpoints, represented by graphs.

By default, the graphs are sorted Slowest first since we are most interested in particularly slow processes.

Blocks

A list of blocks will appear, sorted by their block height. This is a number that designates their level on the blockchain, — the genesis block has a block height of 0, and each block built on top of it increments this height by 1. The higher the number, the more recently has the block been published, therefore the blocks are also listed in reverse chronological order (newest at top).

Note that multiple blocks may have the same block height:

In Mina, multiple block producers can get the right to publish the block at the same time. Therefore there can be multiple valid blocks at the same height. This, of course, causes a problem — if there are multiple valid blocks at the same level, the node must figure out which one is the canonical block and will have another block published on top of it, thus becoming a continuation of the Mina blockchain.

Part of the Mina consensus algorithm is a series of mechanisms that determine which block will be the canonical block.

First, the node attempts to select the longest chain, which works if the forks are short-range. If blocks are based on the same parent, VRF hashes are compared, with the larger one being preferred. If VRF hashes are equal, then we compare state hashes as a tie-breaker.

See the diagram above for a visual explanation. While there are two blocks at level 2, we already know that the one on the right is the canonical block (in blue), because an additional block or blocks are published on top of it. However, we haven’t determined which of the level 3 blocks are canonical, because no blocks have been published at level 4.

Source and Status

In the same window, on the right side of the screen, the source of each block and its status is displayed.

A block may have various sources:

  • External, which means it was produced by other nodes within the network and received via the gossip network.
  • Internal is for internally produced blocks, i.e. published by the user’s own node.
  • Reconstruct means your node has already received the data for these blocks, but because the node was reset, it needs to reconstruct the blocks using the data in its cache.
  • Catchup designates blocks that your node has requested from other nodes in the network via RPC in order to ‘catch up’ with the rest of the network.

Benchmarks

We want to be able to benchmark test the Mina network in order to measure its capabilities. For that purpose, we’ve developed a Benchmark frontend through which users can mass send transactions.

Wallets

The benchmarks page shows a list of testnet wallets from which we send transactions to the node. You can send any number of transactions, though the maximum number of transactions that can be sent at a time is based on how many wallets are available. For instance, if you want to send 300 transactions, and 200 wallets are available, you must press the Send button twice.

Tracing has given us a wealth of information about Mina nodes and how they interact with each other. The interface helps us quickly and easily detect if there are inefficiencies or problems in any part of the Mina network.

In the future, we plan on utilizing the Tracing and Metrics Interface in our optimization efforts, and together with our use of continuous integration (CI) tests, we will be able to accurately and effectively improve the performance of the Mina blockchain.

We thank you for taking the time to read this article. For a more detailed look at the Tracing and Metrics interface, check out our guide on GitHub. If you have any comments, suggestions or questions, feel free to contact me directly by email. To read more about OpenMina and the Mina Web Node, subscribe to our Medium or visit our GitHub.

--

--