Solidity Bytecode and Opcode Basics
As we go deeper into writing smart contracts, we will come across terminologies like “PUSH1”, “SSTORE”, “CALLVALUE” …etc. What are they and should we even care about them?
To know these commands, we have to go deeper into the Ethereum Virtual Machine (EVM). I was surprised there were very few resources on this subject when I googled around. Perhaps they were too technical? In this article, I’ll try to explain some EVM basics as simple as I can.
Like many other popular programming languages, Solidity is a high level programming language. We understand it but the machine doesn’t. When we install an ethereum client such as geth, it also comes with the Ethereum Virtual Machine, a lightweight operating system that is specially created to run smart contracts.
When we compile the solidity code using the solc compiler, it will translate our code into bytecode, something only the EVM can understand.
Let us take a very simple contract for example:
pragma solidity ^0.4.11;contract MyContract {
uint i = (10 + 2) * 2;
}
If we run this code in the remix browser and click on the contract details, we see lots of information.
In this case, the compiled code is:
60606040525b600080fd00a165627a7a7230582012c9bd00152fa1c480f6827f81515bb19c3e63bf7ed9ffbb5fda0265983ac7980029
These long values are hexadecimal representation of the final contract, also known as bytecode. Under the “Web3 Deploy” section of the remix browser, we see:
...
{
from: web3.eth.accounts[0],
data: '0x606060405260186000553415601357600080fd5b5b60368060216000396000f30060606040525b600080fd00a165627a7a7230582012c9bd00152fa1c480f6827f81515bb19c3e63bf7ed9ffbb5fda0265983ac7980029',
gas: '4300000'
}, function (e, contract){
console.log(e, contract);
if (typeof contract.address !== 'undefined') {
console.log('Contract mined! address: ' + contract.address + ' transactionHash: ' + contract.transactionHash);
}
})
In simple terms, it means that when we deploy the contract, we simply deploy the hexadecimals under the data field with the recommended gas of 4300000.
We have to start thinking hexadecimal if we want to talk to the EVM. Ever wonder why there is a “0x” in front of your wallet or transaction address? That’s right, anything beginning with “0x” simply means the value is in hexadecimal format. Having “0x” in front of a hexadecimal is not compulsory because the EVM will treat any value as hexadecimal irregardless.
We also see the operation code (aka opcode):
PUSH1 0x60 PUSH1 0x40 MSTORE PUSH1 0x18 PUSH1 0x0 SSTORE CALLVALUE ISZERO PUSH1 0x13 JUMPI PUSH1 0x0 DUP1 REVERT JUMPDEST JUMPDEST PUSH1 0x36 DUP1 PUSH1 0x21 PUSH1 0x0 CODECOPY PUSH1 0x0 RETURN STOP PUSH1 0x60 PUSH1 0x40 MSTORE JUMPDEST PUSH1 0x0 DUP1 REVERT STOP LOG1 PUSH6 0x627A7A723058 KECCAK256 SLT 0xc9 0xbd STOP ISZERO 0x2f LOG1 0xc4 DUP1 0xf6 DUP3 PUSH32 0x81515BB19C3E63BF7ED9FFBB5FDA0265983AC798002900000000000000000000
Opcodes are the low level human readable instructions of the program. All opcodes have their hexadecimal counterparts, eg “MSTORE” is “0x52”, SSTORE” is “0x55" …etc. Pyethereum github repo and the older Ethereum yellow paper have some good reference for all the solidity opcodes and their hexadecimal values.
The EVM is also a Stack Machine. To explain simply, imagine stacking up slices of bread in a microwave, the LAST slice you put in is the FIRST one you take out. In computer science jargon, we call this LIFO.
In normal arithmetic, we write our equation this way
// Answer is 14. we do multiplication before addition.
10 + 2 * 2
In a stack machine, it works in LIFO principle
2 2 * 10 +
It means, put “2” in the stack first, followed by another “2”, then followed by multiplication action. The result is “4” sitting on top of the stack. now add a number “10” on top of “4” and eventually add the 2 numbers together. The final value of the stack becomes 14. This type of arithmetic is called Postfix Notation or Reverse Polish Notation.
The act of putting data in the stack is called the “PUSH” instruction and the act of removing data from the stack is called the “POP” instruction. Its obvious that the most common opcode we see in our example above is “PUSH1" which means putting 1 byte of data into the stack.
So, this instruction:
PUSH1 0x60
means putting a 1 byte value of “0x60” in the stack. Coincidentally, the hexadecimal value for “PUSH1” happens to be “0x60” as well. Removing the non-compulsory “0x”, we could write this logic in bytecode as “6060".
Let us go abit further.
PUSH1 0x60 PUSH1 0x40 MSTORE
Looking at our favourite pyethereum opcode chart again, we see that MSTORE (0x52) takes in 2 inputs and produces no output. The opcodes above mean:
- PUSH1 (0x60): put 0x60 in the stack.
- PUSH1 (0x40): put 0x40 in the stack.
- MSTORE (0x52): allocate 0x60 of memory space and move to the 0x40 position.
The resulting bytecode is:
6060604052
In fact, we always see this magic number “6060604052” in the beginning of any solidity bytecode because its how the smart contract bootstrap.
To further complicate the matter, 0x40 or 0x60 cannot be interpreted as the real number 40 or 60. Since they are hexadecimal, 40 actually equates to 64 (16¹ x 4) and 60 equates to 96 (16¹ x 6) in decimal.
In short, what “PUSH1 0x60 PUSH1 0x40 MSTORE” is doing is allocating 96 bytes of memory and moving the pointer to the beginning of the 64th byte. We now have 64 bytes for scratch space and 32 bytes for temporary memory storage.
In the EVM, there are 3 places to store data. Firstly, in the stack. We’ve just used the “PUSH” opcode to store data there as per the example above. Secondly in the memory (RAM) where we use the “MSTORE” opcode and lastly, in the disk storage where we use “SSTORE” to store the data. The gas required to store data to storage is the most expensive and storing data to stack is the cheapest.
Assembly Language
It is also possible to write the whole smart contract using opcodes. That’s where the Solidity Assembly Language comes in. It might be a lot harder to understand but could be useful if you want to save gas and do things that cannot be done by solidity.
Summary
We have only covered the basics of bytecode and a few opcodes. There are so many opcodes not yet discussed but you get the idea. Back the original question of whether we should even bother learning solidity opcodes — possibly yes and no.
We don’t need to know opcodes to start writing smart contracts and it adds to the learning curve. On the other hand, the EVM error handling is still very primitive at the time of writing and its handy to look at opcodes when things go wrong. At the end of the day, there is no harm learning more.