Going to Mainnet is the dream of every Crypto startup, being able to say that its all public and real and we are out of the safe environment to face the transaction fees, network congestion and high-security risks. But with all that comes trust, motivation to contribute to the ecosystem and to solve all these problems. Not just for ourselves, but for everyone ~ it’s open source after all.
While it may seem that moving to mainnet is as simple as changing the Web3 provider to a Mainnet node and using the right chain ID within transactions. it’s actually much more than that. Mainnet involves dealing with real ETH, which means with every move, you’ve got one eye on transaction fees. If its too low then it will never be executed, but if its too high then business won't be able to survive. One of the decisions we made at Zinc is to pay for users transaction fees and open the ecosystem to the mass instead of just the crypto enthusiasts and take away all of the complications through design abstractions. I personally believe that it's not a matter of finding the right answers but asking the right questions and that was the goal from the beginning.
Building a system which is fast, secure, trustless and decentralised remains an unsolved technical challenge. It may take a long time before we get there, so it’s important that we share every step of progress we make towards achieving it. To understand the complexity of the issues at hand, we had to get our hands wet diving into Mainnet and using this opportunity to learn and grow and give back to the community ~ without which, we wouldn't be here today.
Let's look at some of the problems which left us miserable zombies for days, if not weeks.
Last week, we switched to Mainnet using Infura, the same provider we were using for Ropsten over the last couple of months, however; things didn't go as planned.
whatever can go wrong, will go wrong — Murphy’s law
As soon as we switched to Mainnet, with all the precautions we could have possibly taken, we saw that our transactions started failing and the error was “nonce too low” but how could it be? it works on Ropsten fairly well and the only reason we tested on Ropsten for months despite it being slow was that its the only network which simulates the mainnet with a proper PoW algorithm. Then why was the Nonce too low?
To find the answer, let's look at how we get the nonce. The general way is to get the transaction count for the account sending the transaction and use that as the nonce and that's exactly what we were doing. The only problem was that if two users call the function at the same time, they both will get the same nonce meaning only one of 2 transactions can succeed and other will fail. The situation is the same even if there are a thousand calls to the function, only one transaction at max will be able to succeed. To counteract this issue, we looked at the documentation and added the keyword “pending” to the request which at the time seemed like it will give all transactions including the ones which are pending (we ensured that one transaction is broadcasted before we try to get the next transaction count), and this seemed to work fine on Ropsten TestNet, so why did it fail on Mainnet?
The first theory we had is that Infura loads balancing with its nodes and every node has its own memory pool. Because the synchronisation is eventual consistency, it means that those two nodes may return different answers to the same question due to various reasons such as network delays, different hardware specification and many more. This made sense as Infura may have been running a single node on Ropsten and therefore we had consistent results. To test this, we used quiknode to spin up our own Mainnet Ethereum node and after 30 mins of wait, were ready to test, so we wrote a simple script to run the following web3 function call
The key point to note here is that using the “latest” keyword returned consistent results whereas using the “pending” keyword returned inconsistent results. This was very odd…we were missing part of the puzzle and after researching for several hours and trying to make sense of things, I came to the following understanding of how the “pending” is implemented. (Note that this is my best understanding and it could be wrong, but please do let me know if I have missed something).
When we asked for the transaction count with the pending keyword, the node looks at all transactions for that account, including the ones in the memory pool of the node. It then makes a prediction whether or not the pending transaction will get into the block currently being mined. Wow, that's super smart right? Not quite. The prediction is made based upon the gas fees of the transaction. If your transaction has the highest fees then there is a high probability that a rational miner will put that transaction in the block currently being mined. Unfortunately, not all miners are rational, some just pick at random but let's assume for now they are rational and the probability of your transaction being included in the next block is quite high, what happens now?
The transaction count will be all confirmed transactions (already mined) + the pending transactions for which the prediction is true. New transactions come in which may have higher gas fees than yours and now the transaction count returned is lower than you previously got when the prediction was true. As you can imagine, using that as a nonce will result in the error “transaction nonce too low” and the system becomes really unstable.
You may think that's all, but there is another problem. The memory pool of the full Ethereum node has a maximum number of transactions it can keep and again. It keeps the ones which are more likely to be mined ( high transaction fees) which means that your transaction can even be kicked off the memory pool of pending transactions. (Note that it doesn't mean they will never be mined because they may still be in other nodes memory pools). The question at this point is how do you know? Well, you don’t, unless you run your own node with specified configurations and remember even in that case if the transaction isn’t sent from one of the accounts that the node controls, it will be treated just like any other transaction and will be kicked off when memory pool gets full.
Luckily Infura has really big memory pools. It doesn’t matter how big it is simply because someone could send out million transactions to the same Infura node with a higher transaction fee than yours and yours will still be kicked off. Coming back to how Infura does load balancing and first theory we had… It turns out that its true and even just calling the transaction count function fluctuates the results because of the load balancing and desynchronisation of nodes they control, but hey, its a free service, what do you expect.
It turns out that “pending” was only implemented to make the UX/UI better by showing the users a progress bar of how long their transaction will take before its mined and it makes sense but documentation isn't clear. I would suggest to only ever use “pending” for this scenario alone and nowhere else, as its inconsistent. Especially never use it to get a nonce.
Here is the summary of what we have learned from this experience
- Trust no one. Run your own node
- Keep the schedule free for a week at least when going to mainnet
- Get plenty of sleep before going live (who knows when you will get a chance again)
- Test out everything and be very skeptical
- Have Microservices in place which notifies you when something goes wrong (trust me that things will go wrong)
- Enjoy the experience and share it with others.
Its been an incredible week with lots of learning and sleepless nights but its all worth it. If you have any questions or comments, please leave below or message me on twitter @aliazam2251
why the return eth.getTransactionCount decrease · Issue #15994 · ethereum/go-ethereum
Hi there, System information Geth Version: 1.7.2-stable Git Commit: 1db4ecd Architecture: amd64 Protocol Versions: [63…