Where does AVM come from — process description

robbie wang
NewEconoLabs
Published in
6 min readOct 25, 2019

Here are two ways to generate AVM.

1. Using an assembler helper class, synthesizing AVM directly in code through assembly code

2. Use the compiler to get AVM from a high-level language

Assembler

I need to talk about the compilation here.

This involves several concepts, assembly language, machine language, assembler

Interestingly, you don’t need to learn the assembly for this purpose.

NOP

PUSH 1

PUSH 2

ADD

RET

Remember the code in the previous article?

This is the kind of thing that we can call assembly language.

The content input by NEOVM is the machine language that NEOVM simulates, which is in byte[] format.

So the above five instructions are machine languages ​​that can be turned into a byte[] that NEOVM recognizes.

The tool that does this is called the assembler.

Let’s make an assembler.

ScriptBuilder.cs in NEOVM does most of the work of the assembler, except for linking.

Linking is more complicated, and it is also a focus of the assembler. This requires us to have more links to virtual machines such as NEOVM to continue our discussion. Let us focus on turning the five assembly instructions into byte[].

There used to be an official assembler project in NEO (neoa, which has been in disrepair https://github.com/neo-project/neo-compiler/tree/master/neoa

To study the compiler, the assembler is also a prerequisite. Maybe I will maintain a new assembler project)

Call ScriptBuilder to generate NEO machine code (AVM)

Let’s get to the code directly. This program is located in samples/neovm01

Note that the reference to Neo3.0 NeoVM, this series of articles are only for NeoVM3.0, you don’t have to worry, in fact, NeoVM3.0 is not so different.

Introduce Neo.VM from nuget

Then use the ScriptBuilder to directly complete the work of the assembler, we can get

machinecode=0x6151529366

Then we have neovm do this

We can also get retvalue=3

Ok, here we know that the .avm machine language is compiled from the assembly language by the assembler. Although we didn’t talk about the link, the problem is more complicated, and we will discuss it in the future.

Where does the assembly language come from?

Then here is a problem. You can’t always compose the assembly. Here is a concept of a compiler.

We need a tool to translate

“1+2”

into

NOP

PUSH 1

PUSH 2

ADD

RET

This is the job of the compiler.

Ok, we’re going to focus on how to implement this automatic compilation of addition operations into assembly language.

Generate NEO machine code (AVM) with the compiler

Let’s get to the code directly. This program is located in samples/neovm02

There is an extremely simple compiler here.

It can only compile the addition of positive integers, such as

“1+2+4+5”

The source code of this simple compiler is divided into two parts; one part is to sort the source code into an abstract syntax tree, that is, AST, which is the ParseSynatxNode function.

Then we get the abstract syntax tree of the expression “1+2+3+4”

The next step is to turn the abstract syntax tree into the code we actually want to execute.

It’s also very simple, call the assembler, traverse the syntax tree deeply and get the machine code

Then I will execute this code with neovm and get the result 12

Process analysis

As all the code is here, let’s analyze it.

There should be several processes along the way; they are often referred to as compilers in general.

Word Segmentation -> Create Abstract Syntax Tree -> Convert to Assembly Code -> Convert to Machine Code

  1. Word segmentation

Word segmentation is the first job of the compiler

For “1+2323+4”, the compiler can’t always analyze it one byte by one byte. First, split the string into each word “1”, “+”, “2323”, “+”,” 4"

Because our test compiler is very simple, string.split can finish this.

2. Establish an abstract syntax tree

Then do the parsing, the most common form of organization is to generate an abstract syntax tree.

“1+2+4+5” is organized into a tree

The top node is an addition node, the left value is “1+2+4”, and the right value is 5

“1+2+4” is split into an addition node, the left value is “1+2”, and the right value is 4

“1+2” ​​is split into an addition node, the left value is 1, and the right value is 2

Some scripting language compilers only do this, and the abstract syntax tree is built, so that it can be interpreted and executed.

For example, common algorithms used in four arithmetic operations and string calculations: prefix expressions. In fact, the prefix expression is an ast (abstract syntax tree) and then evaluate by deep traversing the tree node.

3. Convert to assembly code

We still look here

We traverse the tree in-depth, the deepest node is 1 2

PUSH 1

PUSH 2

Then the addition of the upper layer

ADD

4 on the same level

PUSH 4

The addition of the upper layer

ADD

PUSH 5

The addition of the upper layer

ADD

Organize them

PUSH 1

PUSH 2

ADD

PUSH 4

ADD

PUSH 5

ADD

Compare it with the code of EmitCode

We directly use the assembler to turn PUSH ADD into machine code.

But think about if we save the instructions first?

You can get

PUSH 1

PUSH 2

ADD

PUSH 4

ADD

PUSH 5

ADD

4. Convert to machine code

According to the strict process, the output of step 3 should be the assembly.

PUSH 1

PUSH 2

ADD

PUSH 4

ADD

PUSH 5

ADD

Then use the assembler to turn it into an AVM

But when writing this article, our independent assembler project has not yet been completed, so we only introduced the simple assembly assistant using SricptBuilder

Usually, the compiler’s own assembler is not called an assembler, but called a linker, because it is mainly responsible for the task of address translation. Friends who are familiar with C++ must understand that the C++ compiler is clearly divided into two processes: compiler and link.

This article is mainly to explain the generation process of AVM, and does not go into the details of address translation.

--

--