Optimizing Bitcoin’s Transaction Generation Code
Lately, I have needed to generate many transactions from a few machines to simulate very-large-block Bitcoin networks. I had created a few simple scripts that use bitcoind’s “sendtoaddress” API, so I fired these up in an infinite loop to see what would happen.
The results were less than satisfactory. To understand why, you must understand the problem of “unconfirmed inputs”. Basically, if you spend a fraction of a prior payment, you produce 2 new outputs — one to your destination and one to yourself (the change). These outputs are then considered “unconfirmed” until a block is mined. You can then spend these unconfirmed outputs again, but Bitcoin client only lets you build a relatively shallow chain of unconfirmed outputs before it returns a “too many unconfirmed outputs” error.
Since I would be generating 50000+ transactions from a single wallet per block, I quickly realised that I needed a similar magnitude of confirmed outputs to reliably produce that many payments.
And that’s when bitcoind lost its mind.
Transaction generation slowed to 1 every ten seconds or so, using 100% CPU. This is a problem. Even though I created the situation artificially, is that really an unlikely quantity for a large enterprise? I’m guessing that large businesses process millions of payments per year… and re-using addresses is no solution — the problem is the number of transaction outputs in a wallet, not the number of addresses.
So I looked at optimization. Ultimately I was able to increase the efficiency of the transaction generation by about 1000 times and as an important bonus help minimize UXTO growth. Read on for details!
The first thing I noticed was a significant time spent in locks. In particular, in HaveKey(), and GetTimeOffset(). This behavior typically occurs when locks are taken and released inside an “inner” loop rather than being taken and given once at the beginning and end of the loop. And sure enough, I found a tight loop in CWallet::AvailableCoins(…). So, I added a lockless version of IsMine, moved LOCK(cs_KeyStore) outside the loop and saw maybe 5-10% improvement.
As previously mentioned, I also noticed that GetTimeOffset() was consuming lots of CPU… hmm… Again it was taking a lock. This time the purpose was to ensure the atomic read of a 64bit variable. There are better ways to do this on all CPU architectures, so I replaced the lock with an atomic 64 bit operation on supported compilers:
#ifdef __GNUC__ // BU optimze — do a barrier rather than a lock
t1 = __sync_fetch_and_add(&nTimeOffset,0);
t1 = nTimeOffset;
This issue could also have been addressed by factoring the GetTimeOffset() call out of the inner loop, but the structure of the code made that awkward so I chose this approach for another 5–10% improvement.
Although these kind of fixes served as a good introduction to the subsystem, it wasn’t going to get me where I wanted to go. To really address this problem I needed to review large implementation decisions.
The first glaring issue is that the entire wallet was being extracted into a vector every time a payment was made (see SelectCoins() calling CWallet::AvailableCoins() here). And that vector was being passed by value (that is, copied) several times here . Since this bitcoind is typically the only entity paying from this wallet, and since we will do a final transaction validity check at the end, its safe to keep an “available” TXOs data structure around and just remove used TXOs as payments are made. Currently the code regenerates this structure if its size falls lower than a constant. This is simpler than trying to guarantee that this structure is always perfectly coherent with the wallet by adding new TXOs as they come in, fixing block re-orgs, etc. and does not significantly affect small wallets since a small wallet can regenerate the data structure quickly.
The second efficiency bottleneck is that the code validates that the wallet contains enough money to meet the payment separately from constructing the transaction. The problem is that to do this, the code must sum up all the outputs in the wallet — so yet another iteration through the entire wallet’s TXO set. Generally, if an exceptional condition is expensive to test and can be caught early or late, its better to catch it later in performance sensitive code. Another approach would be to cache the wallet’s current balance and update it based on funds flowing in and out rather than recalculation. But again, this approach requires a lot of fragile code to ensure consistency.
Finally, the code in CWallet::CreateTransaction has significant inefficiencies. First, it would always run the coin selection assuming no transaction fees, which is completely absurd given the blocksize limit and the other fee estimation code recently added to Bitcoin Core (more on this later). Second, if the code guessed the wrong transaction fee, the code would rerun coin selection with a larger fee even if the selected inputs actually had enough value to cover the actual transaction fee (see this, vs this)!
The final nuclear option in an optimizer’s toolbox is to ask “is this algorithm actually achieving my goal?” And the more I considered it, the more I realised that it was not. The algorithm implemented was attempting to solve the “knapsack problem” — basically the problem is to select the maximum value of “stuff” that does not exceed a specific weight. In the case of the Bitcoin wallet, the problem is analogous but different: it is to select a set of inputs that are larger than the transaction value but to attempt to match the value of the inputs to the outputs as closely as possible. The “knapsack problem” is incredibly difficult — its provably NP-hard. However, there is an added twist in Bitcoin that I realized makes it worthless— the size of the “backpack” (the transaction value) changes based on the transaction fee, and the fee is not known until the full transaction is serialized.
The end result is that if the optimizer did a “good” job by finding a set of inputs that were near the output value, it would fail to include the fee, and so the transaction would be rejected. In this case, the code added the calculated fee to the transaction value and tried again. However, this makes no sense. There is no actual relationship between the size of the prior solution and the next one. Instead, this was only working due to the general similarity of transactions — if you know the # of inputs and outputs in a transaction, you can approximate its size.
In fact, in my tests, the knapsack optimizer was being called several times per payment almost every time.
And what is the purpose of this “knapsack” optimization anyway? Who cares if .1 or .0001 is received as change? The penalty for failing to find an exact solution is an additional “change” output, and is exactly the same regardless of what the algorithm misses by. Yes, an exact match saves an output, but the likelihood of that given 6 to 10 digit numbers (1BTC is 100,000,000 Satoshi and the code counts in Satoshi) is small. Especially since the transaction fee adds a small non-round-number value to every payment, so we really are dealing with 6 to 10 digit numbers, not 3 digit numbers with lots of zeros at the end. And the algorithm does not have much flexibility — there’s a real fee penalty for using lots of inputs. And finally, a near miss is actually a tiny bit WORSE for the user than a large miss because in the near miss case, the code rolls this “dust” into the transaction fee rather than create an output. But a larger miss pays the full leftover amount into the change address…
So it became clear that the coin selection was optimizing the wrong thing.
So what should coin selection actually optimize? And when you ask that in the context of Bitcoin today, the answer becomes obvious. The coin selection should minimize the UTXO set, without significantly impacting transaction fees. In other words, it should attempt to consume more TXO inputs than it produces as outputs, and ideally spend some of the wallet’s near-dust outputs in transactions with larger inputs that will allow the transaction to not be penalized. [Privacy advocates might be alarmed — shouldn’t wallets try to keep coins separate to confuse blockchain analysis? To which I’d respond that its very dangerous to rely on the algorithm to do so because, well, it must combine these outputs to build up a transaction. A better approach is taken by many wallets, and that is to allow you to separate your wallet into “accounts” or subsections that will not be mixed]
A New Algorithm
The new algorithm is very simple because the purpose is no longer to solve a NP-hard problem, but to minimize the UTXO set and maximize efficiency.
It begins “offline” by storing all the wallet outputs in a container sorted by output value. This container is not reloaded from the wallet every time a payment request comes in. Instead, it is only reloaded when it contains less than a few hundred TXOs.
When a payment request comes in, it begins by estimating the transaction fee given a reasonable number of inputs and outputs and adds this to the request’s value. Let’s call this the “target value”. It then simply picks the smallest TXO larger than the target value and stores this as a solution (if there isn’t one it combines the largest TXOs until a solution is found, or none is possible). It then iterates backwards, picking TXOs smaller than this target value. For each of these, it creates a new target value = target value-TXO value, and recurses to build solution sets of up to a configurable number of TXOs. At the same time, it picks random TXOs and tries the same algorithm. This ensures that the algorithm considers lots of different-value TXOs, especially in the case where the backwards iterator may be working through a bunch of tiny value variations.
After a bunch of iterations, or if the algorithm gets “close” (the more iterations it takes, the more we relax what “close” means) to the target value, the best solution is returned.
The entire algorithm is implemented in about 200 lines of code.
One thing that users seem to complain about is when a fee is included that is a large percentage of the sent value. This is obvious — nobody wants to pay 5–20%. Might as well use a credit card. So why isn’t some check included in the code? It must be complex, right? Nope.
The algorithm sustains 20 payments a second on “normal” 2GHZ laptop hardware and inside VMs, and uses only a few percent of the CPU (I will update this document when I discover what is inhibiting greater throughput — but a likely candidate is writing the wallet changes to disk). Throughput is therefore about 200 times that of the original algorithm for tested wallet sizes, and most importantly increases as the log of the wallet size rather than exponentially. And the original algorithm used 100% of the CPU, whereas this one maxes out at 10% — so efficiency is on the order of 1000 times that of the original algorithm.
When provided with a wallet with 10000+ random TXOs ranging from 1 mBTC to 10 BTC, it mostly picks 2 to 4 inputs and pays to 2 outputs (output + change). Therefore, UTXO size either stays the same or is reduced.
There is a lot more analysis that could be done, including explicit comparisons its TXO selection results against the original algorithm. However, this code is easily meeting my goal of testing very large blocks so I felt that it was time to write this.
There is also clearly a lot that can be done around transaction creation to improve the user experience, and to create a fully-validating bitcoin wallet implementation that is enterprise friendly. As I think I showed, these sorts of changes can also encourage better wallet behavior (for example UTXO bloat control) in a manner simpler than other solutions. And algorithms that already attempt to minimize UTXO bloat will likely play nicely with future miner algorithms that also prioritize UTXO-reducing transactions.
Hopefully with time and effort this kind of changes can occur. With enough of these changes — changes that focus on the UX and on making the Bitcoin client usable for enterprises — we may actually find node counts increase!