There was a post recently comparing online nodes for Ethereum and Bitcoin, where I mentioned in comments that numbers related to Ethereum are not representative enough.
I’ve been tracking online Ethereum nodes for some time now, so I have all this data for further analysis. I publish some of the results for the Ethereum Classic network on the Gastracker page, though I have more detailed data, with deeper insights about all Ethereum based networks.
So what is the problem with crawling the Ethereum network? First and the most obvious obstacle is the existence of Ethereum and Ethereum Classic because both share the same protocol, same identifiers, and same initial history, including genesis block. Nodes from both networks sporadically connect to each other, exchanging new transactions and data from a shared history. By connecting to an ETH node, you can’t be sure that all of its peers are ETH nodes as well. Initial Handshake doesn’t provide enough information, so a node keeps connections even with a peer from another network, telling other ETH nodes about that node, and so on.
Network ID 1 is not Ethereum
That’s the main issue with data provided by services like Ethernode. It just shows all nodes, regardless of the actual chain. It, in fact, doesn’t say “ETH nodes” anywhere, but “nodes with network id 1”, which is shared by a few different blockchains, not only ETH and ETC.
The only way to distinguish them is to connect to them and download part of the blockchain history, as it has different blocks for different forks. From my experience, less than ~70% of such nodes are actual Ethereum nodes.
Here is a distribution of nodes per chain with network id = 1, i.e., all of them are in the same bucket on the Ethernodes page.
In total there are ~10,000 nodes, with only ~6000 ETH nodes, 1000 of ETC nodes, and 2000+ of other networks.
Compare to distribution of all Ethereum based networks (i.e. not only network_id = 1):
One thing that may be misleading here is that not all 6000 of such nodes are always available. The data above is based on the past 7 days, and it includes all nodes that showed up at least once. Though during a typical day less than half of them are online:
Geth vs Parity
The other thing is the difference between Geth and Parity behavior. These two are leading node implementations, and they share most of the market. However, it’s tough to measure the exact numbers of their shares, and any mistake in bot logic leads to distortion of resulting numbers.
Why is that? It seems that Parity and Geth have different usage scenarios; first is prevailing on servers, and second is more common on desktop. Ethereum official wallet Mist was coming with Geth, and the same was for Classic Geth, which was used by Emerald Wallet for desktop on ETC.
This difference in an environment (desktop vs. server) leads to a situation when Geth is harder to catch. One problem is that online time for Geth is short. Another problem, which probably makes a substantial impact, is that home internet is usually misconfigured for port forwarding and firewalls and less stable than the Internet in a datacenter. Because of that, you cannot connect to such Geth instance but should expect an incoming connection from it. For this case, the bot needs to make sure that all possible Geth instances somehow learn about the bot and connects to it to introduce itself.
To illustrate that, take a look at the distribution of nodes which allowed incoming connection by different software.
In my experience, about 70% of Geth nodes don’t allow incoming connections, compared to 52% for Parity. On the other hand, Parity nodes are not so active in the discovery of other nodes, which may be explained that such nodes are running on a server, and actively discover for peers only in the first few minutes after launch.
Distribution by a country could be interesting as well:
Please note the number for China. I’m sure the real amount of nodes in China is at least twice as larger, but the bot can’t connect to them because of the Great Firewall. The only way to measure everything correctly would be running a bot instance in China mainland, though I’m not sure how, but I certainly want to do in some future.
One of the main reasons why I created my scraper bot is to watch hard forks, to track the progress of network upgrade, i.e., which versions of the software are online. Sometimes it shows fascinating artifacts, like one I posted before Constantinople upgrade:
Notice how the share of compatible Parity nodes is growing organically over time, but Geth makes a huge spike in one day. It probably means that there is a way to control the distribution of the majority of Geth nodes, and therefore the network itself.
Another example is a recent ETC upgrade, which was made even without the majority of node consensus. Risky, but fortunately, it didn’t damage the network. It took about a week after the fork to reach 50% of nodes upgrades, though the upgrade is not finalized yet, nodes are still in the process of upgrade. Even worse, some of the old nodes are still actively mining the old unforked chain.
It’s hard to measure the Ethereum network, though there is undoubtedly a lot of interesting data and insights. I still work on improving the bot, trying to gather more details with better precision, and hope to crawl other blockchains as well.
If you use data from the article, please put a link back to this page. Thank you. For further questions, you can reach me by firstname.lastname@example.org