Why Bitcoin Node Statistics Aren’t Trustworthy
I see a lot of statistics about how X% of Bitcoin nodes are running the Y client and how that means Users support Y. Note this is not about what miners are signaling in blocks, but rather what software nodes on the Bitcoin network are reported to be running. I want to put to rest this argument that these statistics actually mean something.
How the Bitcoin Network Works
In order to understand how this statistic is typically gathered, you have to understand a little bit about Bitcoin’s network. Each node on the network communicates with other nodes which are called peers to form the actual network. Bitcoin’s network is what you would call a gossip network. That is, the network spreads information in a decentralized way. The information in this case are new mempool transactions, new blocks that were found and the location of other nodes on the network. The specifications of this network communication can be found here.
Each node, when connecting to another node, sends out version information about itself as defined in BIP0014. The
user_agent value of the version information is what identifies what software it is running. If you are running the reference client, you can see the software your peers are running through the command
bitcoin-cli getpeerinfo and look at the
subver value for each peer. This value, for the reference client looks like
Satoshi part is the software being run, the
0.13.2 is the release of the reference client and
: are used as separators in the BIP0014 specification. For Bitcoin Unlimited, the software identifier string is
BitcoinUnlimited and for btcd, the software identifier string is
btcwire and so on.
How Node Counters Count
To count how many nodes run each node software, what a counter typically does is first, get on the network as a node, and then derive what software each peer is running by looking at the version information that it received. Each peer’s IP address and software version are counted.
Then the node counter asks each peer what peers it has and connects to those nodes. By connecting to the peers of peers, once again the node counter can figure out what software each node is running. Again each of these node’s IP address and software version are counted if it hasn’t been counted already.
The node counter then connects to the peers of peers of peers and gets their version information. This continues until there are no new nodes to be found and voila, we have a count of how many nodes run what software.
This presents a problem as nodes are self-reporting. A node can report running some other software than it’s currently running.
As a matter of fact, a node doesn’t have to be running any full-node software at all! When other nodes ask for blocks or transactions a fake node can simply ask a real node or a block explorer for the information and relay that. From the other nodes’ perspectives, there would be no way to tell if the fake node is real or not as the behavior is essentially the same.
An Obvious Sybil Attack
Node counting statistics, then are pretty easy to Sybil Attack. An adversary can spin up a ton of nodes that report running any arbitrary software and inflate the statistics toward whatever end. The only resources needed are IP addresses and a server to serve a bunch of instances. I suspect that this has happened already.
The more importance we give to node counting statistics, the more incentive there is to manipulate these statistics. Therefore, it’s in the collective interest of the Bitcoin community to stop paying attention to these statistics and instead focus on other important matters.