Sitemap

There is a lot of money on the table, and it’s up for grabs.

WeRLman: To Tackle Whale (Transactions), Go Deep (RL)

By Roi Bar-Zur, Ameer Abu-Hanna, Ittay Eyal and Aviv Tamar. To be presented at IEEE Symposium on Security & Privacy 2023.

--

TL;DR

Selfish mining is a known security vulnerability in blockchain protocols. In our latest paper, we show reward variability leads to increased vulnerability to selfish mining in Bitcoin and similar cryptocurrencies. To do so, we use deep reinforcement learning to analyze the effect of whale transactions, occasional large rewards, on selfish mining. Since reward variability in Bitcoin will increase in the future, its security will degrade.

In this post, we first explain why analyzing selfish mining is important (Motivation). After some background on blockchain (Blockchain Primer), we describe what selfish mining is and explain why it’s a big security concern (Selfish Mining). We then mention previous approaches (Analyzing Selfish Mining), and provide background about Markov decision processes (MDP Primer).

Afterwards, we explain the main contribution of our paper, analyzing non-constant rewards in the form of whale transactions (Beyond Constant Rewards) and our deep RL based approach (WeRLman). Finally, we discuss the implications of our analysis (Implications).

Motivation

Why should you care about selfish mining?

coinmarketcap.com, July 2022.

Hundreds of billions of dollars are secured in blockchain protocols.

Blockchain security critically relies on incentives. Blockchains are maintained by participants, who receive rewards for doing so. Without proper incentives, rational participants will maximize their rewards by deviating from the protocol.

Such profitable deviations are called selfish mining. It is an undesirable behavior. When it occurs, the protocol cannot guarantee the same level of security as it usually does.

Thus, the following question arises: Do rewards in blockchain protocols actually succeed in incentivizing participants properly? The short answer is: not always. For example, a strategic deviation in the Ethereum network was recently spotted. It is important to understand when incentives are misaligned

The answer to this question can be quantified. The vulnerability of a protocol is characterized by the minimal amount of computation resources (relative to the entire network) necessary to perform selfish mining. This quantity is called the security threshold. Below the threshold, participants are properly incentivized. However, participants larger than the threshold, can deviate and profit. A good protocol should have a large security threshold.

To understand how selfish mining can happen and its security ramifications, we will need to dive a little deeper into how blockchains like Bitcoin work. If you are already familiar with blockchain, feel free to jump ahead to an explanation of selfish mining, or skip it as well to read about how to analyze selfish mining.

Blockchain Primer

Blockchain protocols maintain a ledger organized as a chain of blocks. To update the ledger, participants mine blocks. Miners group transactions into a block and try to solve a cryptographic puzzle related to the newly formed block.

Blockchain structure.

Miners are rewarded for each block they create with newly minted tokens, called subsidy. In addition, every included transaction nets the miner a fee, specified by the issuer of the transaction.

Since blockchains are decentralized, there is a chance that two miners will mine a block at roughly the same time. When this happens, the network hears of two blocks which point to the same block. This is called a fork.

Blockchain fork.

We focus on blockchains that follow the longest chain rule. In case of a fork, each miner should mine on top of the longest chain known to her. In the event of a tie in chain length, miners should follow the first block they have heard of.

When a fork occurs, the network splits into two (or more) groups, each mining on top of one of the newest blocks. When a new block is found, one of the chains becomes strictly longer. Miners are then all expected to follow the longest chain, resolving the fork.

Selfish Mining

Here’s a simple example of selfish mining.

Selfish mining example. Orange blocks are secret.

If a selfish miner manages to create block C, the miner can withhold it from others. Then, most of the network will keep mining and try to create block B. In the meantime, the miner tries to create block D. If the miner succeeds, the miner can publish blocks C and D together, making them the longest chain. This leads to block B being discarded.

In this example, the miner nets 2 out of 3 blocks. If the miner had not deviated, then she would have gotten of 2 out of the total 4 blocks instead.

However, if the miner doesn’t have enough computation power, there’s a good chance block B will be created and extended before the creation of block D, rendering the private chain useless. Thus, this strategy might not be profitable in expectation.

But in general, strategies can be more elaborate. A miner can withhold more blocks and can decide to reveal them at strategic times, depending on the length of the competing chain. We term any such deviation as selfish mining if it is more profitable than mining honestly.

A blockchain protocol cannot guarantee the same level of security when selfish mining is happening. The selfish miner can revert the transactions in the discarded blocks when overriding the public chain.

Selfish mining also disrupts the cooperation in the network. It causes many blocks to end up out of the longest chain, leaving it shorter than what it would have been, had everyone cooperated. Thus, an attacker will be able to more easily create an alternative chain to override the longest chain and revert transactions.

In addition, a mining pool which mines selfishly can get more than its fair share. This will incentivize more miners to join the pool and lead to centralization, the exact opposite of what a blockchain tries to achieve.

If all miners are smaller than the security threshold, then honest mining is a Nash equilibrium. No party has incentive to deviate. A large security threshold prevents the first domino from falling. Once it falls and someone deviates, this changes the incentives of others who might become incentivized to deviate as well.

Once the equilibrium is disturbed, there is no knowing how everybody will react to it. But, we can know for certain, it will hurt the cooperation and leave the protocol more vulnerable.

Analyzing Selfish Mining

Previous research has shown selfish mining is possible in Bitcoin, Ethereum and several other protocols and calculated the security threshold.

Eyal and Sirer were the first to suggest a selfish mining strategy in Bitcoin and show it can be profitable in some cases. Sapirshtein et al. were the first to find the optimal selfish mining strategy in Bitcoin.

The security threshold can then be calculated by comparing the optimal strategy to the prescribed strategy for different miner sizes.

Sapirshtein et al. model all possible actions of a miner in a Markov Decision Process (MDP), a stochastic sequential decision-making process. Then they find the optimal strategy by solving the MDP.

The next section gives a quick background on Markov decision process and how to solve them. If you know this already, go to the next section.

MDP Primer

An MDP is an iterative process of an agent operating in an envionment. At each step the environment is captured by some state. The agent performs an action, causing the environment to transition stochastically to another state. The agent then receives a reward, observes the next state and so on.

The agent’s goal is to maximize cumulative reward.

Markov Decision Process (MDP).

To solve the MDP, one can use dynamic programming to calculate the optimal Q-values. Given a state and an action, the optimal Q-value is the maximum sum of rewards obtainable after the environment is in the given state and the agent performs the given action.

To calculate the optimal Q-values, bootstrapping is used. You start with an initial guess of the Q-values and iteratively calculate a better estimation based on the previous values.

A simple bootstrapping technique is the one step lookahead (also called the Bellman equation). The new Q-value each state-action pair is calculated by taking the expectation of the imminent reward plus the Q-value of the next state.

After calculating the Q-values, the optimal stragey can be easily obtained by choosing the action which maximizes the Q-value for each state.

Beyond Constant Rewards

The rewards of blockchain participants come from two sources: constant subsidy and varying transaction fees. We use WeRLman to analyze the impact of varying rewards on selfish mining. In contrast, the aforementioned work considered constant block rewards only.

To model transaction fees, we consider the presence of whale transactions, rare transactions with a high fee. We assume each block can contain one transaction at most and that all whale fees are worth F times the subsidy.

When considering only constant rewards in an MDP, it is sufficient to count blocks in the miner’s secret chain and the public chain. However, when transaction fees are present, the model must track which blocks contain transactions.

Left: state in the constant rewards model. Right: state in the varying rewards model.

This leads to the state space becoming exponentially large with respect to the length of the chains. The constant rewards model contains about 1,000 states. In contrast, the varying rewards model contains more than a 100 million states.

To overcome the large state space, we use deep RL. In addition, WeRLman uses several other techniques for higher accuracy.

In the next section, we explain deep reinforcement learning and the techniques used in WeRLman. If you are not interested in the technical details, you can safely skip the next section and go directly to the security implications of our analysis.

WeRLman

We use WeRLman to find optimal selfish mining strategies in the presence of whale transactions. Like Sapirshtein et al., we model mining as a Markov decision process.

However, the MDP cannot be solved with dynamic programming. The state space is simply too large to calculate all Q-values.

This is where deep reinforcement learning comes in. The core idea is to use a neural network to learn an approximation of the Q-values instead of calculating them all.

This is useful for two reasons. First, a neural network is a more compact representation than a tabular representation. Thus, less memory is required.

And second, the neural network can learn to generalize Q-values accross states. To learn a tabular representation, the dynamics of the MDP for all states is required. In contrast, in deep RL, the dynamics of only a subset of the state space is sampled. Hence, deep RL requires less running time compared to dynamic programming.

The most prominent deep RL algorithm is the Deep Q Network (DQN) algorithm. The algorithm samples the MDP dynamics in some states and the uses stochastic gradient descent to improve the neural network with bootstrapping.

In addition, WeRLman also uses several additional techniques for higher accuracy. First, to get a better estimation of the MDP dynamics, WeRLman uses Monte Carlo Tree Search (MCTS) instead of a simple one-step lookahead. In MCTS, multiple deeper trajectories are sampled and used to bootstrap the Q-values.

Left: one step lookahead. Right: Monte Carlo Tree Search (MCTS)

Second, unlike most deep RL algorithms, WeRLman uses the knowledge of the transition probabilities for bootstrapping the Q-values. This leads to more accurate estimates and reduces the effect of sampling noise.

Furthermore, in most applications there are usualy stark differences between Q-values. For example, the difference between winning or losing a game.

However, in selfish mining MDPs, the differences between Q-values are subtle. Thus, in WeRlman, we train the neural network to learn only the difference between Q-values from a baseline instead of needlessly training the nureal network on the absolute Q-values, which are all roughly the same magnitude.

This method allows WeRLman to be more sentisive to small differences and results in better policies.

Revenue WeRLman obtains by learning differences from a baseline compared to using plain MCTS.

To compare to previous approaches, we evaluate WeRLman in the constant rewards model. Using the techniques described, WeRLman manages to consistently find near optimal policies and surpasses the state-of-the-art for selfish mining analysis using deep RL, SquirRL.

Learning curves of WeRLman vs SquirRL.

WeRLman is available on GitHub.

Implications

To understand the impact of whale transactions on selfish mining, we use WeRLman to analyze our new model. We instantiate the model for different miner sizes and see how whale transactions impact miner revenue.

Revenue as a function of miner size when whale transactions are present.

We see that WeRLman finds policies which take advantage of whale transactions to obtain more than what is possible to obtain in the constant model.

We consider different values of the model parameter F, and calculate the security threhsold.

The security threshold as a function of the magnitude of transaction fees in the model.

We find that increasing the reward from whale fees, F, decreases the security threhsold.

When translating the new model to the real world, blocks without a whale transaction correpond to blocks with low transaction fees. Blocks with a whale transaction correspond to blocks with high transaction fees.

The parameter F is a knob that controls the expected ratio between high blocks rewards and low block reward. Thus, F encapsulates reward variability and not transaction fee magnitude.

Consequently, considering the role F plays and its impact on the security threshold, we conclude that reward variability hurts blockchain security.

In Bitcoin, the constant block reward is scheduled to halve every 4 years by design. The existing variability of transaction fees will thus become significantly more pronounced in the future. We use historical fee data in Bitcoin and simulate future halving events to estimate suitable parameters for our model. We then use these to predict the security threshold in the future.

Projected security threshold in Bitcoin.

As opposed to the previously known threshold of 0.25, we find that the threshold will significantly decrease in the future. Bitcoin will be more vulnerable to selfish mining, and consequently, its security will degrade.

Conclusion

Preventing selfish mining is critical for blockchain security. It is crucial to understand the incentives of blockchain participants, considering all available sources of revenue.

Using our deep RL framework, WeRLman, we show whale transactions, which correspond to reward variability, hurt blockchain security. This shows Bitcoin will soon become vulnerable to selfish mining.

The money will be there for the taking.

--

--

Roi Bar-Zur
Roi Bar-Zur

Written by Roi Bar-Zur

PhD student researching blockchain security at Technion. Website: https://roibarzur.github.io/

No responses yet