Creating Your Own Smart Contract Languages Using LLVM

Alan Li
11 min readApr 28, 2020

It is used to be very daunting to create your own programming language. People have to know a compiler pipeline from end to end. Thanks to LLVM, (which is a multiple source multiple target compiler framework) developers can now create their own languages without going into the machine details.

Now with EVM-LLVM project, LLVM can support EVM byte-code generation, which means we can start to create our own smart contract languages using LLVM.

The Goal of this article

This article shows how we can use EVM-LLVM to make the Kaleidoscope toy language to generate blockchain-deployable smart contracts. In this version of article, we are not going to write a complete compiler which covers all the areas, but we are going to cover the essential parts of porting over a LLVM-based simple language to smart contract platform. But going through the article, you should be able to figure out how you can create your own smart contract languages using the LLVM framework.

So you may ask, isn’t the glory LLVM supposed to support all the languages built upon it by default, if we implements a corresponding architecture’s code generation backend? Well, yes and no. EVM is a very peculiar software architecture that requires the compiler to do some housekeeping work before we can deploy it on to the blockchain. In the follow context you will see why we need to do some extra work to make our program executable on blockchain.

Preparations

Download EVM-LLVM

You need a working EVM-LLVM development package to build your smart contract language.

git clone git@github.com:etclabscore/evm_llvm.git
cd evm_llvm
git checkout EVM
mkdir build && cd build
cmake -DLLVM_EXPERIMENTAL_TARGETS_TO_BUILD=EVM ..
make -j8

In case you want to know, EVM backend is based on latest LLVM 10, which is released in March 2020. Since EVM-LLVM is yet another backend in LLVM project, everything follows the LLVM convention. Whenever you get stuck for whatever reason, just go to LLVM’s page to see if you get a clue.

We need the compiled libraries and the EVM-LLVM headers to generate our smart contracts.

Baseline language implementation

The Kaleidoscope language is used in the LLVM project as an example to showcase building a new language using LLVM infrastructure. The complete Kaleidoscope language, in the tutorial, is implemented over a 9 chapter lecture.

Albeit very toy-ish, it can still be quite daunting at the first sight. The tutorial includes a lot of topics, which we cannot cover fully in this article. For the sake of time, we are not going to port over the complete Kaleidoscope language, but we will start with the example code at the end of Chapter 3. You can find the starting code at:

https://github.com/etclabscore/evm_llvm/blob/master/examples/Kaleidoscope/Chapter3/toy.cpp

The original Chapter 3 implementation already has a simple lexer, parser and an AST-based codegen. You can write a function declaration and it will emit the lowered LLVM IR on the screen. That is what we need to show the essentials of building a smart contract language using LLVM. Of course we can add a lot more features to our small language to make it much better, but please bear with me for now as we will only focus on the essentials.

In short, we want to achieve the following functionalities by the end of this article:

  • compile a complete Kaleidoscope function into LLVM IR, and subsequently use the EVM backend to generate EVM byte-codes.
  • generate EVM-specific supporting functions to make sure our little smart contract can interact with input and output resources.

The complete short specs of Kaleidoscope can be found inside the tutorials of LLVM’s website. The complete Kaleidoscope includes a JIT compiler, but in this tutorial we will only cover a subset of it. In short, we want users to write a declaration of Kaleidoscope function, and our tiny compiler can generate EVM-LLVM IR which is executable.

Changing the compiler implementation

EVM-specific headers

To create a fully functional EVM smart contract, we need to access to some EVM-specific ops such as CALLDATALOAD, RETURN. EVM-specific instructions are exported to compiler developers using intrinsic functions. In order to access those defined intrinsic functions, we need to include EVM intrinsic function declarations so intrinsic functions are exposed:

#include "llvm/IR/IntrinsicsEVM.h"

Supporting 256bit integers

Floating point data types are not allowed on blockchains, because different machines might have slightly different floating point computation results, which will cause forks on the chain. So we need to convert it into integral types. In this case, I want to show case the support of 256-bit integers in EVM-LLVM.

In LLVM, an arbitrary length integer is stored using the llvm::APInt class. In order to create an 256-bit integer, one can simply call:

Value* int256 = ConstantInt::get(TheContext, APInt(256, "1234567890123456789", 10));

We can wrap it inside a function to make materializing an 256-bit integer easier:

static Value* Get256ConstantInt(int64_t val) {
return ConstantInt::get(TheContext, APInt(256, val));
}

In our small language, we changed the structure of NumberExprAST to take in a std::string instead of an int64_t, so we can process 256-bit integers which is 4x the size of an int64_t:

class NumberExprAST : public ExprAST {
std::string Val;
public:
NumberExprAST(std::string Val) : Val(Val) {}
Value *codegen() override;
};
Value *NumberExprAST::codegen() {
return ConstantInt::get(TheContext, APInt(256, Val, 10));
}

Notice here we used the overloaded string-flavour APInt constructor to correctly parse a 256-bit integer.

Adding smart contract function Dispatcher

The function dispatcher

Every EVM contract starts its execution from the beginning of byte code section. At the very beginning, the memory and stack are empty, bare-metal. So it is up to the compiler to generate proper initialization code snippet to bootstrap the system. So, smart contracts need a meta function, here we called it “function dispatcher”. EVM-LLVM has a specific function layout to ensure proper smart contract generation, readers are redirected to this wiki page and the Function Selector section in Deconstructing a Solidity Contract if want to read the details.

The entry function is a special function and the LLVM backend handles it differently:

  • it must be named “main”
  • its return type should be void
  • it should have no linkage. i.e., no other functions should call the entry ruction
  • It should be the first function in the LLVM IR Module
  • it is invisible to developers and users

The special name “main” is given to identify the definition of entry function, and the compiler will not generate return subroutine code for the entry function (because the entry function is always noreturn – it will always terminate by instruction RETURN or REVERT). In our smart contract compiler, we come up with a function to generate entry function.

Function *GenerateWrapperFunction(Function* calleeF) {
FunctionType* FT = FunctionType::get(Type::getVoidTy(TheContext),
inputs, false);
Function *F = Function::Create(FT, Function::ExternalLinkage,
"main",
TheModule.get());
BasicBlock *BB = BasicBlock::Create(TheContext, "entry", F);
Builder.SetInsertPoint(BB);
...
}

Notice that GenerateWrapperFunction takes the callee function (the “wrapped” function) as an argument. The callee function is the function we actually want to execute. In this example, we want to take in callee function solely because we need to generate code to extract smart contract arguments.

Initializing frame pointer

There are many ways we can execute a smart contract. Modern architectures utilize hardware-aided context switching mechanism to call subroutines. Instructions like JUMP AND LINK will store the current context to a special register before starting to execute a subroutine.

LLVM-compiled EVM contracts maintain a frame pointer in the memory to record the start address of the call frame (which is also stored in memory). See this wiki page to know more details. Ideally, those machine-dependent initialization code should not appear in LLVM IR.

In the case of EVM, the compiler cannot do it for contract developers, because you can have many ways to implement your function dispatcher. A language frontend might have their own ideas about function dispatcher. So we think exposing the function dispatcher details to a language frontend gives freedom to implementing a new smart contract.

In this implementation, our tiny compiler is taking full control and is responsible to take care of the initialization. Luckily, we only a single IR instruction to initialize the context.

The frame pointer is stored at memory address 0x40. You should initialize it to a value greater or equal than 0x60 (so the memory stack section is not overlapped with frame pointer itself). Meanwhile, the value should be aligned to 32 byte. In our example we initialize the value to 128.

EVM-LLVM has expose EVM opcode MSTORE as an intrinsic function to explicitly modify an EVM memory address, particularly to be used to initialize frame pointer. Once included the specific EVM intrinsic header, we can create an intrinsic Intrinsic::evm_mstore to do the job:

Value* addr = Get256ConstantInt(64);
Value* val = Get256ConstantInt(128);
Function *mstore = Intrinsic::getDeclaration(
TheModule.get(), llvm::Intrinsic::evm_mstore, llvm::None);
Builder.CreateCall(mstore, {addr, val});

Extract EVM input values

Unlike other platforms, EVM uses CALLDATALOAD opcode to explicitly extract blockchain argument inputs. This will generate a series of CALLDATALOAD instructions to extract input arguments:

std::vector<Value*> extractedParams;
for (size_t i = 0; i < calleeTy->getNumParams(); ++i) {
Value* val_int = Get256ConstantInt(i * 32);
Function *calldataloadF = Intrinsic::getDeclaration(
TheModule.get(), llvm::Intrinsic::evm_calldataload, llvm::None);
CallInst * calldataload = Builder.CreateCall(calldataloadF, {val_int});
extractedParams.push_back(calldataload);
}
CallInst *call_calleeF = Builder.CreateCall(calleeF, extractedParams);

Return a value from EVM

The entry function should be able to return a value from smart contract, in case the defined function needs to return a value to the user. EVM uses a special method to return value from a smart contract. First, the return value should be stored in a memory address. Then, call RETURN EVM opcode and specify the beginning and end of the memory region.

Value* addr_int = Get256ConstantInt(0);
Function *mstore = Intrinsic::getDeclaration(
TheModule.get(),
llvm::Intrinsic::evm_mstore,
llvm::None);
Builder.CreateCall(mstore, {addr_int, call_calleeF});
Function *evm_return = Intrinsic::getDeclaration(
TheModule.get(),
llvm::Intrinsic::evm_return,
llvm::None);
Builder.CreateCall(evm_return,
{Get256ConstantInt(0), Get256ConstantInt(32)});

Move the entry function to the beginning of the function list

EVM-LLVM generates functions according to the order of the defined functions. So we should make sure our entry function main appears as the first one in the list:

auto *wrapper = GenerateWrapperFunction(FnIR);// You should include "llvm/IR/SymbolTableListTraits.h" here
using FunctionListType = SymbolTableList<Function>;
FunctionListType &FuncList = TheModule->getFunctionList();
FuncList.remove(wrapper);
FuncList.insert(FuncList.begin(), wrapper);

Compiling and linking our language frontend

llvm-config can be used to specify linking flags and libraries when integrating a LLVM backend. In our case, we have to utilize EVM-LLVM’s llvm-config to retrieve the correct path. Below is an example calling llvm-config to fill in the compilation options.

clang++ -g toy.cpp `~/workspace/evm_llvm/build/bin/llvm-config --cxxflags --ldflags --system-libs --libs core` -o evmtoy

Generatin the LLVM IR

Our little compiler can generate specific LLVM IR, which will be used to generate EVM byte-codes. Here is an example of emitted output:

ready> def foo(a b) a*b + 2 * a + 3 * b;
ready> Emitting Smart contract IR:
; ModuleID = 'My cool EVM function'
source_filename = "My cool EVM function"
define void @main() {
entry:
call void @llvm.evm.mstore(i256 64, i256 128)
%0 = call i256 @llvm.evm.calldataload(i256 0)
%1 = call i256 @llvm.evm.calldataload(i256 32)
%2 = call i256 @foo(i256 %0, i256 %1)
call void @llvm.evm.mstore(i256 0, i256 %2)
call void @llvm.evm.return(i256 0, i256 32)
unreachable
}
define i256 @foo(i256 %a, i256 %b) {
entry:
%multmp = mul i256 %a, %b
%multmp1 = mul i256 2, %a
%addtmp = add i256 %multmp, %multmp1
%multmp2 = mul i256 3, %b
%addtmp3 = add i256 %addtmp, %multmp2
ret i256 %addtmp3
}
; Function Attrs: nounwind writeonly
declare void @llvm.evm.mstore(i256, i256) #0
; Function Attrs: nounwind readnone
declare i256 @llvm.evm.calldataload(i256) #1
; Function Attrs: noreturn nounwind
declare void @llvm.evm.return(i256, i256) #2
attributes #0 = { nounwind writeonly }
attributes #1 = { nounwind readnone }
attributes #2 = { noreturn nounwind }

Run our small compiler

Generating EVM Bytecodes

Let’s copy and paste the generated LLVM IR to a file, and fire our EVM-LLVM’s llc to get the EVM assembly:

build/bin/llc -print-after-all -debug -mtriple=evm -filetype=asm toy.ll

It will produce a toy.s file which contains generated EVM assembly for your examination. To generate EVM binary, specify -filetype=obj in the option:

build/bin/llc -print-after-all -debug -mtriple=evm -filetype=obj toy.ll

Now we havetoy.s and toy.o. But to execute it we only need the object file. One last step: in order to run it in commandline, we have to extract the binary to a hex string. I use a simple Python script to do the job:

#!env python3
import sys
def get_contract(inputfile: str) -> str:
import binascii
output = []
with open(inputfile, 'rb') as file:
byte = file.read()
output.append(binascii.hexlify(byte).decode("utf-8"))
output_str = ''.join(output)
print(output_str)
if __name__ == "__main__":
get_contract(sys.argv[1])

Just specify toy.o as the argument to the script, then we will get EVM binary in plain string:

5b6080604052602080356000808035610040909192939091604051806108200152604051610840016040526004580192565b60405160209003516040529052f35b8082029190910290019056

Run it!

Now it is the time to execute our newly generated EVM smart contract! Let’s fire up Geth’s EVM tool in our command line, and it should run the command locally.

The command line options is like the following:

evm --input 0000000000000000000000000000000000000000000123456789001234567890000000000000000000000000000000000000000000098765432109876543210 --code 5b6080604052602080356000808035610040909192939091604051806108200152604051610840016040526004580192565b60405160209003516040529052f35b8082029190910290019056 run

So here is how we parse it. The --input specifies the input value to the smart contract, and --code specifies the smart contract byte-codes. The --input option is a string of hex code which contains the input to the smart contract. Usually it contains the function signature and the arguments to the function.

In our small entry function implemented above, we did not ask users to specify information other than two 256-bit integers. We can definitely do a lot more, like adding a function signature to the input field and a function selector mechanism in entry function, or parsing different types of arguments, et cetera.

The input field is not visible to EVM, except the CALLDATALOAD instruction, which is specifically used to extract values from the input field.

Parsing results

After EVM finishes executing, evm emits the returned result from smart contracts:

0x00000674f561a2226dd39084c7dccd2395a994d8e52577362432026874293089

This result is the 32 bytes we copied using the RETURN instruction at the end of our entry function.

Other LLVM IR codegen topics

  • alloca instruction allocates a 32 byte frame object (function-local object) on the memory frame of current function.
  • memory spaces are indexed.
  • Usually, it is up to the language frontends to emit EVM smart contract ABI information. But of course we can write an LLVM IR pass to emit the contract ABI.

Restrictions

EVM is designed for deterministic execution. Because of this, some utilities we are using daily are not accessible. Here is a list:

  • No heap space. You cannot call llvm.malloc or something similar to get a dynamically allocated memory space. If you really need a heap space, you can write a malloc function to dynamically allocate memory space.
  • No floating-point support. It is obvious, EVM does not have any floating point instructions because that will very likely break consensus. Workaround: use software implemented emulation libraries for floating point computation if needed.
  • No dynamic jumps. Dynamic jumps makes static analysis and verification difficult. EVM has a specific no-op, placeholder instruction JUMPDEST which serves at the labeling of a jump destination. jumping to a non-JUMPDEST address will terminate EVM execution.

So, EVM is a restricted, domain-specific execution environment, one cannot expect it to execute all the existing programs.

What to do next?

So much about creating (porting) a small smart contract language! It is no way a “complete” language implementation, but it shows the essential components we need to generate EVM smart contracts using EVM-LLVM. With the LLVM infrastructure, we can definitely create a much more complicated and useful language that would benefit the blockchain world.

There are definitely a lot we can do to make our tiny compiler better. Here are some ideas:

  • Create a complete pipeline to emit your smart contract source to EVM binary code.
  • Adding more sophisticated mechanism to your toy language. Such as: implementing other functionalities of Kaleidoscope, EVM-specific operations, or deploy on to a support blockchain, such Ethereum Classic!
  • Create an ERC20 token implementation in your shiny new language.

Resources

About ETC Core

ETC Core is a leading Ethereum Classic core development team. We deliver infrastructure tooling, specifications, and resources to the Ethereum Classic ecosystem. We strongly believe in high-quality software, readability, and cross-chain compatibility. We maintain the Core-Geth client and actively participate in protocol research, upgrades, and events. We maintain the EVM-LLVM backend project and committed to maximizing EVM capabilities and innovating smart contract development. Check out our projects: https://etccore.io/projects

Follow ETC Core

--

--

Alan Li

I work on advanced compilers and virtual machines technologies.