# The BatchOverflow Bug and How to Catch All Bugs

There has been a lot of press lately about the batchoverflow bug and how it allowed someone to send themselves a bazillion trillion tokens. I know there is an exact number, but does it really matter? Earlier, a group of researchers from Singapore and the UK analyzed almost one million ethereum contracts and concluded that over 34,000 ethereum smart contracts representing \$4.4 million ETH may be in danger of locking, leaking, or being killed.

All of this points to one important concept — making software provably bug-free. In case you came here lured by the title, hoping to find the magic technique of writing provably bug-free software, let me give you the real, absolute bad news in all this and get it out of the way: it is impossible to prove (in a rigorous, scientific, mathematical way) that any software is 100% bug free. Catching the bug is the real fly in the ointment.

The batchoverflow bug is just one example and an exemplar of a core issue. But first, what is this batchoverflow thing anyway? It is important to look at this in some detail because that will help us understand the complexities of such problems, how to make our software as bug-free as possible. It will also help us understand why it is impossible to guarantee bug-free software in any general way.

Let me offer an intuitive analogy to the general overflow problem: assume you pull out an old calculator that has only a 10-digit display and no scientific notation. The largest number you can display on it is 9,999,999,999 (without the commas). If you add 1 to it, what happens? The answer is 10 billion, which has ten zeros. The most significant digit (1) “drops off” the left end, leaving nothing but zeros. This is the “overflow” error, which can cause considerable anguish to a nine-billionaire who is going after his tenth billion. A well-designed calculator would show an error, leading you to suspect that something was wrong. But, a poorly designed calculator would just show a bunch of zeroes or just one zero. Another example is the odometer in a car, which would reset to zero on the millionth mile (assuming it has six digits).

This is pretty much what happened with the batchoverflow bug, and it all happened in the context of performing multiple transactions in one batch (that’s where the “batch” comes from). The numbers involved, of course, are much, much larger, but every type of number representation has a limit. To prevent overflows in 8-bit computers, you can create and use 32-bit computers, but after about 4.2 billion, the overflow error strikes. You can go on to 64-bit or 1024-bit or even more advanced computers, but you’d be out of luck eventually because the number of integers is infinite. No matter how big a house you build, you would just be pushing the significant overflow bit out of the house at some point. (Of course, if you are working on 1024-bit computers, which is highly unlikely any time soon, and keep encountering the overflow error, you have an entirely different type of problem on your hands.)

Programmers use different types of number representations, such as int and uint. ‘int’ stands for integer and allows the computer to store whole numbers in the range of -2,147,483,648 to 2,147,483,647. ‘uint’ stands for unsigned integer (a fancy way of saying ‘no negative numbers’) and can represent numbers in the range of 0 (zero) to 4,294,967,295 (these ranges are for 32-bit computers).

You may ask, why not just use the biggest available number type all the time? In the batchoverflow case, that is no defense, but generally speaking, the bigger the number type, the more memory it takes up, slowing everything else down. It’s like putting only one or two people in every bus; pretty soon, you’d have a large number of under-used buses on the road and a massive traffic problem.

Overflow errors (and there are many flavors of these) can’t be easily detected because the may not cause the system to fail (as would be the case if some errant piece of code hogged all of the computer memory and caused the entire program to crash, or even caused the computer to crash); they may merely give rise to unpredictable behavior in subsequent code. Consider, for example, what happens when a car lot is full. One extra car that arrives at the lot is turned away. The car lot doesn’t “crash,” but the extra car is left to its own devices. If its owner parks it outside the car lot’s boundaries, it could be infringing on private property or be placed in a tow-away zone.

As a quick example in the special case of the overflow bug, it is not possible to detect that this type of error has occurred, except by detecting other undesirable effects (such as transferring a bazillion dollars) or writing data to memory areas that are outside the computer’s internal buffers. However, it is possible to safeguard against integer overflow errors by using what are known as “integer-safe libraries.” It is best to use well-tested libraries rather than creating your own.

Ironically, the BEC (Beauty Chain) contract is based on the ERC20 standard, which itself includes the SafeMath library. Below is the excerpt of the BEC contract code, where only multiplication is shown, but go read the rest of the code if you are interested in seeing how it works for addition, subtraction, and division (the lines in green between “/*” and “*/” or preceded by “//” are comments in the original code):

Here is how the derived BasicToken itself uses the SafeMath addition and subtraction:

However, the BEC token, implemented from this basic infrastructure, doesn’t use the SafeMath operation. Others have written extensively about the batchoverflow bug, describing the bug and analyzing the underlying code, so I won’t repeat all that here. But just in case you are curious, here is the errant piece of code, specifically line 257 from the BEC token contract:

The overflow error is only one of numerous errors caused by a combination of quirks in programming environments, differences in syntax, and sometimes just sheer unintentional carelessness.

In modern software systems, even a simple GUI-based “Hello, world!” type of program can run to several thousand lines of code! A quick note to nitpickers: Yes, just about all of those lines of code deal with managing the GUI and the marginal increase due to non-trivial application code is small. Modern cars have tens of millions of lines of code, with the Ford F150 topping at 150 million lines of code, causing us to speculate if the “150” should now stand for lines of code! By way of comparison, the 787 Dreamliner has a mere 7 million lines of code. While the issues with ethereum smart contracts bring attention to such errors (money does that), such errors have always been around.

Now, back to the bad news. Creating bug-free code is possible but unprovable in the mathematical sense (which, epistemologically, is the gold standard for any type of proof). If your computer program passes the hundreds of millions of tests of every possible scenario, you can be reasonably certain that it is robust and mostly error-free, but it’s impossible to prove that it is completely error-free.

There are two reasons for this mathematically frustrating yet juicy and entertaining situation. One is called the halting problem of Turing machines (and all general-purpose programming languages are Turing-complete).

It is proven that the halting problem (or question, if you prefer) is undecidable. Happily, the proof is easy for a non-mathematician to follow (but requires at least a 2-minute attention span, which I know is a lot to ask for).

The topic of Turing machines — and of the halting problem in particular — belongs to computational theory. However, the undecidability of halting is a ramification of Russell’s paradox in set theory, a topic in pure mathematics, which Gödel formalized. Rice’s theorem delivers the fatal blow (“Et tu, Rice?”) by generalizing the halting problem to properties of formal languages. The most entertaining explanation of Russell’s paradox comes from Groucho Marx in his quip, “I don’t want to belong to any club that will accept me as a member,” if you assume he’d rather belong to a club that wouldn’t accept him as a member!

The other reason is that it is impossible to prove a negative, or in this case, the absence of a bug. The root of this epistemological dilemma is metaphysical: that which does not exist leaves no trace (or has no effect) on the universe. It is for this most important reason that we can’t ever prove that someone is innocent (which is the absence of wrongful activity) or that something doesn’t exist (for example, a six-legged blue unicorn dragon). The burden of proof in all these cases lies with the person who states the positive (i.e., that something exists). For example, the person who claims that your code has a bug has to prove it by showing the bug; the person who claims that you committed a crime must prove it while you have no obligation to prove that you are innocent (which, as we saw, is impossible). As you might imagine, this is a deep topic in itself.

Having gotten the bad news out of the way, let us see what we can do to make software as robust as possible. The clue that there is a silver lining of sorts on this cloud of undecidability is that these mathematical and philosophical arguments only state that the absence of bugs cannot be proven in a general way. But that shouldn’t prevent us from using a number of tools and techniques to deal with particular cases.

To begin with, we have the usual headaches associated with improper specification of requirements, improper translation or loss of intent when converting requirements into code, and improper checking for boundary conditions (of which the overflow is an example), and many more transgressions. Then, more complications are introduced by decentralized and distributed systems, open-source software, types of programming languages, types of development environments, multiple technologies, run-time infrastructure, and so on.

In the context of blockchain, we need strategies to alleviate the problem (we already know we can’t provably eliminate all problems). One serious issue with blockchain is its immutability and the implicit enforcement of the ‘code is law’ principle. The advantage of immutability at the data level turns into a problem at the code level because most modern code is produced in an Agile way. This means rapid and frequently mildly buggy (or at least not full-featured) code that is deployed into production, then iteratively and continually improved (or more features added). This conflicts with the need to have contracts (which, after all, are legal in nature) as perfect and error-free as possible from the very first implementation. There are solutions for this, but completely distributed production of smart contracts in an open and permissionless blockchain only adds to the challenge.

When serious money is involved (i.e., not just affordable and speculative play money), one strategy is to constrain the environment to implement only a permissioned-blockchain. Only certain trusted parties are allowed to participate in the blockchain, either to enforce the security protocol or execute smart contracts. However, having permissions is not the be-all and end-all of either security or executing correct contracts. The standard security moats around the castle (prevention, detection, recovery, and ‘retribution’) do not address the quality of the smart contract code itself.

This brings us to failures at the application level. There are a number of techniques to improve the code quality. However, exhorting programmers to be careful is not one of them. We have to assume that even the most experienced programmer will write buggy code. Automated tools for examining code, checking for direct usage of operators (in fact, integer-safe libraries should be imported or required, not just cut-and-paste), and finally, a governance process to review and release code into production (similar to App store reviews, but perhaps more rigorous), and continual monitoring.

While not exhaustive and detailed, here are some principles that are relevant for all applications but especially for blockchain applications:

1. Prevent as many bugs as possible (follow best practices, reuse extremely well-tested libraries or modules, conduct peer reviews, conduct exhaustive automated tests, test for boundary conditions, identify code that depends on operating system, representation (number systems, character encoding, collations, type casting, etc.), and so on.

2. Fail as gracefully as possible — try not to crash and burn, but parachute down if engine fails.

3. Notify — tell somebody when failures occur.

4. Identify any potential failures — these are anomalous, abnormal, or unusual situations (data, transactions, usage patterns, trends); some of these may be legitimate transactions but should be identified because of potential errors.

5. Correct or recover from failures as quickly as possible.

6. Feedback into next iteration — conduct retrospective to plough back lessons into a continuous improvement cycle.

7. In case of failures or errors caused by deliberate or malicious intent, do everything reasonable to deter future attempts; this includes providing data to and cooperating with legal enforcing entities.

As the seriousness of potential errors increases, the level and scope of governance must also increase. Generally, strong governance is associated with centralization and bureaucracy, but that need not be the case. It will be useful to think of governance in several layers. It starts with self-governance at the individual level, then validations at the team level, and then moving on to the level of the collective (association, community, or consortium). Financial systems require governance at the industry and government level. Many financial transactions are global in nature, requiring some form of governance at the international level.

There are two other important aspects of complexity that are sources of functional errors (if not bugs in the classical sense). Code that passes inspection at one time and in one domain, operating system, or virtual machine can behave differently in another context. This situation is a consequence of working in distributed systems. By way of comparison, note for example the myriad inconsistencies in browser implementations and the difficulties of rendering html on various devices. The final source of complexity is change. As conditions in the environment change, this gives rise to outdated and inconsistent behavior.

Freedom to do what one likes in a completely decentralized system sounds great. Until someone else has the freedom to do unto one what one doesn’t want or the consequences of loss are beyond an acceptable threshold.

Flexibility comes at a price, paid in the ‘coin’ of complexity, automation, cost, governance, and regulation. The more flexibility is desired, the less immediate automation is possible (as in smart contracts). For example, bitcoin’s security and immutability of data (different but related concepts) comes at the expense of a having only a simple set of transactions. When we desire more types of transactions or more flexibility in their implementation, then we require smart contracts. These demand, as we have seen, much stronger governance of code quality. A well-engineered approach to systems design and to governance is necessary if we move into even more flexible and sophisticated systems. Complexity in future blockchains would arise from the use of artificial intelligence, massive amounts of data, transactions involving multiple-parties and jurisdictions, privacy mandates, multiple types of transactional instruments such as contracts and securities, and complex regulation.

Ultimately, it is this systemic approach to governance that builds trust in the markets. When we despair of regulation and intermediaries, what we are really bemoaning is the lack of flexibility, low cost, and slow transactions.

Cobbling a patchwork of features using a fingers-in-the-dyke approach to building decentralized distributed ledger technologies is more than just unsustainable. It is unconscionable. The cowboy approach in the name of unfettered innovation is fine for laboratory proofs-of-concept where some bugs may be tolerable. Maturing the solutions to the point of handling other people’s hard-earned money requires a mindset of engineering with responsible and efficient governance. While none of this can provably eliminate errors, this should at least prevent and mitigate the adverse effects of most of the fatal flies in the ointment.

Kiran Garimella, Ph.D., is a Chief Scientist and Chief Technology Officer at KoreConX. He is the author of two books on Business Process Management and a soon-to-be-published book on the intersection of AI & blockchain. He is the also the co-founder of CognitiveWorld.com and the chief curator of iKnowCentral.com.

Written by