Writing a simple Code Generator

Gabriel Sassone
11 min readJun 29, 2019

--

UI using ImGUI, SDL and the code generated with this article.

Motivation

Following my previous article about Flatbuffers and data reflection the quest for Data-Driven Rendering continues!

In this article I want to show how to write a very simple code-generator to help you automate writing of code in any language.

The code is here:

https://github.com/JorenJoestar/DataDrivenRendering

There is a balance that constantly needs to be found between code and data, and having a code-generator in my opinion helps tremendously in focus on the code that is necessary to be written.

From a data perspective, normally the ‘baking’ pipeline is a series of DCC formats as source transformed into very project specific and optimized data.

Code-wise, depending on the engine/technology you are using, ‘baking’ of the code is more uncommon.
In a time in which iteration time has become almost more important than the tech itself, playing with this balance can be the key for any successful software. It could sound exaggerated, but I really believe in that.

As always, both ImGui and SDL will be our sword and shields for this adventure.

This will be the second step into data-driven rendering: code generation.

Are we writing a compiler ?

Short answer: yes!

Long answer: we will be writing the simplest possible compiler that reads a source file and transform in a destination file, like Flatbuffers.

There are few links on both theory and practice that can help shed some light on the subject:

The “Dragon Book” (called because of the dragon in the cover) is still THE to-go in compiler writing as far as I know.
It is an intense book and explores writing a full compiler with depth, starting from Automata theory (just reminds me of how everything you study can be useful, I did 2 exams at University about that, wondering when I would use it! Hello prof Di Battista!) to full code examples:

https://www.amazon.com/Compilers-Principles-Techniques-Tools-2nd/dp/0321486811

This is for me the best website on the subject, very precise and readable and follows closely what is inside the Dragon Book:

www.craftinginterpreters.com

My interest was rekindled in 2015, when I was following the amazing Casey Muratori and his Handmade Hero.
He generates code for introspection purposes, and really show a simple and effective way of generating code that works for you.

Wikipedia itself also contains a lot of good articles on the subject. The more you know about the it, the more you want to know.
It is fascinating and very, very deep!

Compiler 101

A real compiler is a very complex and fascinating subject/software so I will try to get the simplest possible approach giving my (flawed and incomplete) perspective.

A compiler is a series of transformations applied to data (you can apply this definition to every software actually…).
The input data is a text, and normally the output is still text, but with very different meaning.

The raw depth of the subject is astonishing, consider that we are defining a grammar and thus a language, and how to express concepts into it.

The main steps are the following:

1. Lexer/scanner/tokenizer
2. Parser
3. Code generation

We will define the code generator from a custom language called HDF (Hydra Definition Format) to C++.
HDF will be a subset of Flatbuffers in this exercise, but once the concepts are clear it can be expanded to more stuff.

Lexer/Scanner/Tokenizer

A lexer or scanner (or tokenizer) is a software that translates an input string into a list of Tokens based on Lexemes.
A Lexeme is one or more characters that create a Token. Think of a keyword (like ‘if’, ‘class’, ‘static’ …).
A Token is identified by a unique Lexeme and abstracts the Lexeme itself.
It normally contains a type and some attributes, for example it can save where that lexeme is into the input text, the line. The final structure of the token can vary a bit.
In trying to find a simple definition for this step:

The act of Tokenizing is the act of abstracting the input text.

For example, given the following input text:

static void amazing_method() {};

It will generate the list of tokens ‘keyword, identifier, identifier, open parenthesis, close parenthesis, open brace, close brace, semicolon’.

This IS abstracting the text!

Normally a lexer/scanner is used by the parser to go through the code and retrieve a token and use it in some way. Let’s start seeing what a lexer could be!

Code

The main code is how to recognize a Token.
There can be different kind of Tokens, and you can include keywords here or use identifiers and then parse them.

Already an extensive code, but this is possibly all you need from a lexer/tokenizer.
The main method is ‘nextToken’, that skips whitespaces and get the next token.
Everything else is used to classify a token or to expect one for check correct syntax (still by the parser).

After we’ll see the parser it will be easier to understand how to use the lexer.

Links

With the lexer/tokenizer ready, we can proceed to the parsing stage.

Before moving though, some links:

Parser

So far we have abstracted the input text into a list of Tokens, and now we need to generate some more information before arriving at generating new code.

As far as I understood it, a parser reads the tokens and generates an Abstract Syntax Tree.
Sometimes, and in simpler parsers, the act of parsing itself can generates a new code if the language we are targeting is simple.

I prefer to always separate Parsing from Code-Generation just for mental clarity!

Given a list of tokens and a grammar, a parser generates an Abstract Syntax Tree.
It gives meaning to the input text, and is responsible to check the syntax correctness.

A simple definition for a grammar is the following:

A grammar is a set of production rules that transforms a series of non-terminals into terminals.

Putting everything in the perspective of data and transformations we can define:

  • Terminals are finalized data
  • Non-terminals are data that must be transformed
  • Production rules are transformations of non-terminals to terminals

Another definition of a parser than it could be :

A parser is a software that transforms non-terminals in terminals following production rules.

Grammar

It is time to write the formal grammar (a context-free grammar) and see how it maps to code.
It will be very simple — much simpler than many examples you find around — but it is a starting point. We will not deal with any expression, statements and such, not in the context of this code generator. I will point out some examples for more complex stuff, but I want to study more the subject for that to be more precise about the subject.

Each line will be a production rule (a transformation), with the left-side being always a non-terminal. We are using regular expressions syntax here:

alphabet → [a-zA-z]

number →[0–9]

identifier → alphabet (alphabet | number | “_”)*

variable_declaration → identifier identifier “;”

struct_declaration → “struct” identifier “{“ (variable_declaration)+ “}” “;”

enum_declaration → “enum” identifier “{“ (identifier)+ “}”

module → (struct_declaration | enum_declaration)+

First we define what an identifier is — a sequence of alpha-numerical characters that can contains also the underscore character. Notice that with the identifier production rule, the identifier cannot start with an underscore.

A variable then is declared simply by two identifiers: the first for the type and the second for the name, following a semicolon.

A struct is simply a list of variable declarations. Notice the “+” in the rule — this means that at least one element must be present.

Enums are literally a name for the enum and a list of identifiers in curly braces.

Finally the module is the root of our grammar. It will contain all the declarations we describe. See it as the data file we are writing to generate the code — one file is one module.

Now that we defined a simple grammar, we can move to the theory behind the parser.

Predictive Recursive Descent Parser

The grammar we defined is a context-free-grammar. Depending on the type of grammar we can write different parsers.

One of the most common type of parser (and easier to start with) is the Predictive Recursive Descent Parser, and that is what we will write given our grammar. You can dive into all the details of writing a context-free grammar, writing a Left-to-right Leftmost-derivation grammar (LL(k)) and such and be amazed by all the concepts behind.
Again, I am personally starting on this subject, so my knowledge is not deep.

Back to the parser, the main characteristics of this parser are:

  1. Descent = top-down. Start from root and generate the Abstract Syntax Tree.
  2. Recursive = the parser has mutually recursive methods, one for each non-terminal.
  3. Predictive = no backtracking needed. For our simple grammar we do not need any backtracking.

So the parser will start from the root (module non-terminal) and by sequentially reading all the tokens will generate a tree that represent our syntax.
Let’s see some code!

Central here is the definition of the class ‘Type’ — that I can use to save structs, enums, and ‘commands’ — a custom keyword I will show to generate code I use when writing commands.

The method ‘generateAST’ is the entry point, and the parser will use the lexer to scroll through the tokens and generate the AST.

The different methods are straight from the grammar: declarationEnum, declarationStruct, identifier, … .
Identifier is used to check if an identifier of a token is a keyword of our language or not.

The last piece of the puzzle is the Abstract Syntax Tree.

Abstract Syntax Tree

We choose to simply have data definitions, and I’ve decided that the nodes of the tree will be types.
A type can be a primitive type, a container of variables (like a Struct in C, but without methods) enums and commands.

Commands are just a way of showing the creation of a construct that I use and requires some boilerplate code, but I don’t want to write that code.

If we remember the definition of the class Type from the code before, it all boils down to a name,a list of names and optionally types.
With this simple definition I can express primitive types, structs and enums all in one!

For enums, I save the anme of the enum and in the name list all the different values. That is enough to later generate the code.

For structs, again the name is saved, and then the variables. A variable is a tuple of identifiers ‘type, name’. When parsing them, the type is searched in the registered ones.
A trick here is to initialize the parser with primitive types, and then add each type (both struct and enums) when parsing them.

As a bonus, a command is a name and a series of structs.

This is an example:

command WindowEvents {    Click {
int16 x;
int16 y;
int16 button;
}
Move {
int16 x;
int16 y;
}
Wheel {
int16 z;
}
};

Code Generation

The last stage will generate the files in the language that we want, using the informations from the AST.
This part will literally write the code for us, the all purpose of this code.

The most fundamental question is: “what code do I want to generate?”.

A simple but deep question.
We are trying to remove the writing of boilerplate code from or lives, so anything that you consider boilerplate and easy to automate goes here.

Even if until here we wrote in C++, the final output can be any language.
This means that you can define data and translate it to multiple languages!

For our example, we will output C++ code and add UI using ImGui, similar to the Flatbuffers example I wrote before.

Let’s see the three different construct we can output with our language.

Enum

We defined an enum as a name and a list of named values.
For the simplicity of this example, we are not assigning manual values to the enum, but it is something easily changeable, and I will do it in the future.

Given the enum in HDF:

enum BlendOperation : byte { Add, Subtract, RevSubtract, Min, Max }

Which code do we want to generate ?

When I write enums, I almost always need the stringed version of the values. Also I want to add a last value, Count, so that I can use it if I need to allocate anything based on the enum.
As a bonus, I can create a second enum with the bit shifts — called mask — for some use cases.
All of this will be automatically done by the code generator, starting with a simple enum!

In this piece of code, I will use three different streams for the different parts of the enum (enum itself, value names and mask) and combine them into the final generated file.
Also to note that the strings here are ‘String Ref’ — basically a string that points to the input source code and stores the length of the string, so that there is no need to allocate it newly.
I will use a temporary buffer to null terminate it and write into the output file.

Already here you can see how powerful this approach can be.

Struct

Structs are the bread-and-butter of data definition.
In this simple example we do not handle pointers or references, so it is pretty straight-forward, but as a start in coding generation this could already be powerful for many cases.

Let’s start with a definition for our dream Data-Driven-Rendering:

// file.hdf
struct RenderTarget {
uint16 width;
uint16 height;
float scale_x;
float scale_y;
TextureFormat format;
};
struct RenderPass { RenderTarget rt0;
};

We want to generate both the ready to use header in C++ and UI using ImGui.
The output for this struct will be obtained by simply iterating through all its members and, based on the type of the member, write some code.
For primitive types there is a translation that must be done to the C++ language — thus we saved a list of c++ primitive types keyword into the code.

For the UI area we will define two methods: reflectMembers, that simply adds the ImGui commands needed, and reflectUI, that embeds the members into a Window. This is done so that when starting from a root type I can create a window that let me edit its value, and recursively it can add other member’s UI if they are coming from another struct.
This is shown with the RenderPass struct.

Let’s see the code!

Here we also generate the code for the UI, already in the code!

Command

I wanted to include an example of something that does not exist in any language, but it shows the power of removing boilerplate code.

I define commands as little structs with a type used anytime I need to do some command parsing, normally from a ring buffer.

The command should have an enum with all the types already, and each struct should have its type assigned.
The type is normally used to cycle through the commands and do something accordingly.

It will output structs because of the need to allocate them in the ring buffer, thus must be simple.

Conclusions

We learnt how to write a complete Code Generator, an incredible tool that can speed up the development if used correctly and remove most boilerplate code possible.

The usage of the command keyword was an example of something I use and I don’t want to write code, something that is custom enough and hopefully will give you more ideas on how you can break free from languages constriction when you write…your own language!

In the quest for data-driven rendering, the next step will be to use the knowledge from code generation to create a shader effect language, that can generate both CPU and GPU code for you.

This article is the longest and more code-heavy I have ever written. There are many concept that I am beginning to be familiar with, but still not so used to.

So please comment, give feedback, share!
Thank you for reading!

--

--

Gabriel Sassone

Mad Scientist, Software Engineer, Musician, Life Hacker.