ChatGPT Helped Me Build a Compiler

13 min readApr 24, 2023

Two things have been on my mind lately:

How compilers work
How to use ChatGPT for programming projects

I decided to tackle two bird with one stone: leverage ChatGPT to help me build a compiler.

I’d built a basic compiler a few years ago for the Nand2Tetris course. But I was rusty on the concepts. What better way to review than trying to build a compiler again. Plus, I could lead on ChatGPT to guide me and (hopefully) help when I get stuck.

This article is my attempt to learn in public. Hopefully this will inspire you to do the same.

Structure of this article: I’ll first describe the learning goals I established before starting this project. Then I’ll walk through the steps to builder the compiler. I’ve highlighted my interactions with ChatGPT. Finally, I reflect on the experience and talk about what I’d do next.

If you’d like to skip to the final result, see the GitHub repo: https://github.com/byanofsky/simple-compiler

Learning Goals

My primary goal was to learn how compilers work by building a compiler.

Some secondary goals:

Understand how ChatGPT can be utilized as a coding partner
Gain some additional experience with C programming language
Utilize some compiler related tool I hadn’t used before

Next, I’ll walk you through highlights of the process.

Getting Started

I decided it’d be best to build a compiler for a simple programming language. This way, I could concentrate on building the end-to-end compiler instead of getting stuck building out features.

The simplest language I could think of would consist of the following:

integer data type only
values can be stored in variables
addition and subtraction are the only arithmetic operators

Thus, I started with my first prompt to ChatGPT:

Help me build a compiler for a simple programming language. To start off, the language will only have integers as the data type. The only operations are addition and subtraction.

ChatGPT responded with 6 steps required to build a compiler. I’ve summarized the response:

Define a syntax. Example <variable_name> = <expression> ;
Write a lexer. The lexer breaks the source code into tokens. Tokens would include integers, variables, addition/subtraction operators.
Write a parser. The parser reads tokens produced by the lexer and builds a syntax tree. Each node in the tree represents a construct in the language, such as an assignment statement or expression.
Write a code generator. The code generator traverses the syntax tree and generates code in some target language.
Write a driver program. The driver program reads the source code, runs the lexer and parser, generates the target code, and then executes the target code.
Test your compiler. Ensure that the compiler generates the correct output for different inputs.

The next few sections address each step.

Step 1: Define a syntax

I asked ChatGPT to help me define the syntax for my language. ChatGPT produced the following grammar (which I believe is BNF notation):

<program> ::= <statement_list>
<statement_list> ::= <statement> | <statement> <statement_list>
<statement> ::= <variable> '=' <expression> ';'
<expression> ::= <term> | <term> <add_op> <expression>
<term> ::= <integer> | <variable> | '(' <expression> ')'
<add_op> ::= '+' | '-'
<integer> ::= <digit> | <integer> <digit>
<variable> ::= <letter> | <variable> <letter> | <variable> <digit>
<digit> ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
<letter> ::= 'a' | 'b' | ... | 'z' | 'A' | 'B' | ... | 'Z'

I reviewed and it seemed to align with my expectations. However, I realized such a language wouldn’t be useful without any mechansim for printing values. I asked ChatGPT to update the grammar:

I’d like to be able to print results of add operations. Can we add a print operation?

ChatGPT modified the <statement> entry as such:

<statement> ::= <variable> '=' <expression> ';' | 'print' '(' <expression> ')' ';'

This looked good and I was ready for the next step.

Step 3: Write a lexer

I asked ChatGPT to produce a lexer for this language. I specified that the lexer should be written in C. The below code block shows sections of the produced code (see full code in gist):

enum TokenType 
{
  TOKEN_INTEGER,
  TOKEN_VARIABLE,
  ...
}

struct Token get_next_token()
{
  char c;
  struct Token token;
  token.type = TOKEN_UNKNOWN;
  memset(token.lexeme, '\0', MAX_TOKEN_LENGTH);
  while ((c = getchar()) != EOF)
  {
    if (is_whitespace(c))
    {
      continue;
    }
    else if (is_add_op(c))
    {
      token.type = TOKEN_ADD_OP;
      sprintf(token.lexeme, "%c", c);
      break;
    }
    ...
  }
  return token;
}

Additionally, it provided me this explanation:

This lexer uses getchar() to read characters from standard input and returns a Token structure with the appropriate type and lexeme. It supports the following tokens:
TOKEN_INTEGER: a sequence of digits
TOKEN_VARIABLE: a sequence of letters and digits starting with a letter that is not the keyword “print”
TOKEN_ADD_OP: either “+” or “-”
TOKEN_ASSIGN_OP: “=”
TOKEN_PRINT: the keyword “print”
TOKEN_LPAREN: “(“
TOKEN_RPAREN: “)”
TOKEN_SEMICOLON: “;”
TOKEN_UNKNOWN: any other character that doesn’t match any of the above
To use the lexer, you can call get_next_token() repeatedly until it returns a TOKEN_END token,

The provided code mostly worked as is, with only slight modifications required. You can see the final code for the lexer here: https://github.com/byanofsky/simple-compiler/blob/main/lexer-v1.c

Step 3: Write a parser

With the lexer completed, I was ready for the next step. So I asked ChatGPT. Here is the response:

After building a lexer, the next step is to build a parser that can take the stream of tokens produced by the lexer and generate an abstract syntax tree (AST) that represents the structure of the program. The parser reads the stream of tokens and checks whether they conform to the syntax rules of the language. If the input program is syntactically correct, the parser generates an AST that captures the structure of the program in a more structured way than the stream of tokens.
There are many ways to implement a parser, but one common technique is to use a parsing algorithm called a recursive descent parser. A recursive descent parser is a top-down parser that works by recursively descending through the grammar rules of the language, generating nodes in the AST as it goes. Each non-terminal symbol in the grammar is associated with a parsing function that implements the corresponding rule. The parsing functions use the current token and lookahead tokens to determine which rule to apply and how to construct the AST.

Here‘s the example code it provided for a parser function:

// Parse an expression and return its AST node.
ASTNode* parse_expression() {
  ASTNode* node = parse_term();

  while (current_token.type == TOKEN_ADD_OP || current_token.type == TOKEN_SUB_OP) {
    Token op_token = current_token;
    eat(op_token.type);
    ASTNode* rhs = parse_term();
    if (op_token.type == TOKEN_ADD_OP) {
      node = create_add_node(node, rhs);
    } else {
      node = create_sub_node(node, rhs);
    }
  }
  return node;
}

At this point, I decided I wanted to do some of the coding myself. So I wrote the required parsing functions for each token type.

Now, I needed some way to verify that both the parser and lexer worked as expected. I decided I could write a small driver program that would receive input, call the parse_program function to produce the AST, and then print the resulting AST.

For example:

./parser-test < ./sample-program.txt

PROGRAM
  STATEMENT_LIST
    STATEMENT
      ASSIGNMENT_STATEMENT
        ...
    STATEMENT
      PRINT_STATEMENT

I needed a function to traverse the AST and print each node. Writing such a function is not too complex, but I decided to let ChatGPT handle it for me.

Printing Syntax Tree

Here is the code ChatGPT produced:

#include <stdio.h>
#include <stdlib.h>
#include "ASTNode.h"

void printTree(ASTNode *node, int depth) {
    if(node == NULL) {
        return;
    }
    for(int i = 0; i < depth; i++) {
        printf("\t");
    }
    printf("%d\n", node->type);
    printTree(node->left, depth+1);
    printTree(node->right, depth+1);
}

At first glance, seemed correct. But when I integrated it into the parse-test program, it printed this result:

I forgot that the NodeType enum is represented as an integer. Thus, I had to produce a function that would return a string representation for each node type.

With that done, I was able to verify that parser and lexer worked.

However, one thing bothered me about the ASTNode struct I was using.

Modifying ASTNode

At first, the ASTNode struct had this shape:

struct ASTNode {
  string type; 
  ASTNode* left_child; 
  ASTNode* right_child; 
};

However, such a shape had limitations. For example, variables and integers had values that needed to be stored with ASTNode.

I had some ideas of how I could extend the struct, I wasn’t sure if they algined with best practices. So I asked ChatGPT for some help. I provided the grammar we’d been working with and asked for a suggestion of an ASTNode struct. Here is what it provided:

struct ASTNode {
  string type; 
  string value; 
  ASTNode* left_child; 
  ASTNode* right_child; 
};

It also gave me an example:

          Program
             |
      StatementList
        /         \
  Statement      Statement
      |              |
  Variable '='  FunctionCall
      |              |
   Integer         Variable
      |              |
     1+2             x

In this example, the type field of the top-level ASTNode would be “Program”. Its left child would be a StatementList node, which would have two children: a Statement node representing the assignment x = 1 + 2; and another Statement node representing the function call print(x);. The Variable node would have a value of “x” and two children: an Integer node with a value of “1+2”, and a FunctionCall node with a value of “print” and a child of Variable with a value of “x”.

This didn’t align with the grammar we’d been working with thus far. For example, this didn’t accurately account for arithmetic expressions. Instead, it stored the expression as a string value on the Integer node. How would this work when we introduced arithmetic expressions using variables?

Thus, I continued working with it.

One suggestion arrived at that I liked was utilizing a union type on a data property:

typedef struct ASTNode
{
  enum NodeType type;
  struct ASTNode *left;
  struct ASTNode *right;
  union
  {
    int int_value;
    char *string_value;
  } data;
} ASTNode;

I wasn’t as familiar with the union type. But it seemed to work here. Perhaps there are ways to improve it, but I decided to continue with this.

Another sticking point I’d encountered with the current ASTNode was how we represented the StatementList:

StatementList
  Statement
  StatementList
    Statement
    StatementList
      ...

This seemed unnecessarily verbose. I was thinking we could use an array of list. So I asked ChatGPT:

Using an array or a linked list to represent the list of statements in the AST node would be a better way to represent the statement_list. This way, the ASTNode will have a pointer to the first element in the list, and each element in the list will have a pointer to the next element.

It provided an implementatin using a linked list. You can see this implementation in the final parser code.

Final ASTNode

Here is the resulting ASTNode:

typedef struct ASTNode
{
  enum NodeType type;
  struct ASTNode *left;
  struct ASTNode *right;
  union
  {
    int int_value;
    char *string_value;
    enum OpType op_type;
    StatementList *statement_list;
  } data;
} ASTNode;

To reiterate, perhaps this is still not optimal, but it works for now.

The final parser code can be viewed here: https://github.com/byanofsky/simple-compiler/blob/main/parser-v1.c

Step 4: Write a code generator

Next step after the parser was code generator.

I asked ChatGPT to produce a code generator using C. Here is portions of the produced code:

void generate_code(ASTNode *node) {
    if (!node) return;
    generate_code(node->left);
    generate_code(node->right);
    switch (node->type) {
        case NODE_TYPE_INTEGER:
            printf("PUSH %d\n", node->data.int_value);
            break;
        case NODE_TYPE_VARIABLE:
            printf("PUSH %s\n", node->data.string_value);
            break;
        ...
        default:
            fprintf(stderr, "Unsupported node type: %d\n", node->type);
            exit(1);
    }
}

It stated that the code it provided “generates assembly code for a hypothetical stack-based machine.”

I didn’t want to generate code for a hypothetical machine. I wanted the code produced by my compiler to run on my machine.

Luckily, ChatGPT included a hint at the end of the mssage:

You can modify the generated assembly code instructions to match the instruction set of your target machine or generate code for another output format, such as LLVM IR or actual machine code.

LLVM had been on my mind ever since I’d listened to Lex Fridman’s interview with Chris Lattner (co-founder of LLVM):

However, I’d never worked with LLVM before. I wasn’t even sure how to get started.

Luckily, I had ChatGPT to help me.

LLVM IR

I asked ChatGPT to modify the program to produce LLVM IR code. The resulting code’s structure remained the same, but the printf statements changed:

case NODE_TYPE_ASSIGNMENT:
            printf("  store i32 %%x%d, i32* @%s\n", var_counter - 1, node->left->data.string_value);
            break;

I decided to test this out. I fed it a sample program: var1 = 5 + 1;. Here is an example of the LLVM IR it emitted:

%x1 = add i32 0, 5
%x2 = add i32 0, 1
%x3 = add i32 %x1, %x2
store i32 %x3, i32* @var1

Even with my limited knowledge of LLVM IR, there seemed to be some performance optimizations available. Most glaringly: why did an add operation of two integers (ie, 5+1 ) require loading both integers into variables first. And why did the loading operations require adding the integers to 0 first?

Instead, it seemed we could just do the following when adding two integers:

%x1 = add i32 5, 1
store i32 %x1, i32* @var1

However, I decided not to dive deep into this rabbit hole at this point.

Variable Table

When I tried to compile the LLVM IR code via clang, I encountered an error:

main.ll:6:24: error: use of undefined value '@testVar'
  %x1 = load i32, i32* @testVar

I copied this error into ChatGPT:

This error occurs because LLVM IR is trying to load a value from a global variable named testVar, but the variable has not been defined. To fix this error, you need to define global variables in your generated LLVM IR code for each variable that appears in your input program.

ChatGPT showed me how to use global variables. I started with that, but then decided to pivot to using a variable lookup table so that variables are not globally defined.

ChatGPT helped me set up the scaffolding for a variable lookup table.

To initialize the variable lookup table, I would iterate through all nodes and find the variable nodes. For each variable node, I would add an entry where the key is the variable name and the value is a unique LLVM IR variable name.

Then, when generating the IR code, we can lookup the appropriate LLVM IR variable name to use.

Here is the code relevant to the variable lookup table: https://github.com/byanofsky/simple-compiler/blob/main/generate_llvm_ir.c#L9-L78

Note: While writing this article, I did uncover a bug. If a variable is assigned a value twice, it will attempt to allocate the variable twice. I assume this will result in an error. The variable lookup table code probably needs additional work

Example LLVM IR code

Here is one test program with the resulting LLVM IR code:

// Test Program
testVar1 = 2 - 3;
testVar2 = 5 + 6;
testVar3 = testVar1 + testVar2;
print(testVar3);

// LLVM IR
target triple = "arm64-apple-macosx12.0.0"
; ModuleID = 'main'
declare void @print_integer(i32)
define i32 @main() {
  %x0 = add i32 0, 2
  %x1 = add i32 0, 3
  %x2 = sub i32 %x0, %x1
  %ptr_3 = alloca i32
  store i32 %x2, i32* %ptr_3
  %x3 = add i32 0, 5
  %x4 = add i32 0, 6
  %x5 = add i32 %x3, %x4
  %ptr_4 = alloca i32
  store i32 %x5, i32* %ptr_4
  %x6 = load i32, i32* %ptr_3
  %x7 = load i32, i32* %ptr_4
  %x8 = add i32 %x6, %x7
  %ptr_5 = alloca i32
  store i32 %x8, i32* %ptr_5
  %x9 = load i32, i32* %ptr_5
  call void @print_integer(i32 %x9)
  ret i32 0
}

Step 5: Write a driver program

At this point in my progress, I had a compiled program called generate_llvm_ir. I could feed it a test program and it would output resulting LLVM IR code. I could compile this resulting code with clang. Then run the resulting executable to ensure expected output.

However, this required multiple steps.

I realized the current problem with my program was that my code generation was printing the LLVM IR code to stdout. Instead, I wondered if I could use some LLVM API instead.

In fact, in an earlier message, ChatGPT hinted at this. When it produced the generate_llvm_ir code, it included a message:

This example demonstrates generating textual LLVM IR, but you can also use the LLVM C++ API or the LLVM C API (https://llvm.org/doxygen/group__LLVMC.html) to generate IR programmatically. Using the API allows for more control over the IR generation process and better integration with the rest of the LLVM infrastructure, such as optimization passes and target-specific code generation.

However, I wasn’t sure if I’d have enough time to refactor to use this API.

Therefore, I decided to keep the textual LLVM IR and use a hacky solution.

I wrote a C program that executes the generate_llvm_ir and clang commands. It uses the former program to produce the LLVM IR code and write to a temp.ll file. Then, the temp.ll is fed as input to the latter command. The user provides the input file (to feed to generate_llvm_ir) and an output file (where clang writes the resulting executable):

./compiler <input-file> <output-file>

Not a very elegant solution, but it works (for now).

Step 6: Test your compiler

Because the language is not too overly complex and because I had limited time, I didn’t do extensive testing.

However, one annoyance while testing were the error messages. With my original code, I’d get messages like this on syntax errors:

Unexpected token "". Expected 4.

First, I modified the error message to output the string representation of a token instead of the enum integer:

Unexpected token "". Expected "semicolon".

Then, I modified the lexer to add starting position to each token. With this, I updated my error messages to:

Unexpected token "" at position 6. Expected "semicolon".

Much better. I could quickly find and fix these syntax errors in my sample programs.

Conclusions

Here are some general thoughts after completing this project:

I have a better understanding of compilers. However, I realize I’ve barely touched the surface. There is so much more to learn.
ChatGPT is an interesting coding partner. It is extremely helpful with ideating and debugging. Its code generation saves a lot of time, but you need to check the code it produces. There are subtle bugs in the code.
The C programming language still feels daunting to me. It seems there are myriad of ways to accomplish things. And it’s hard to determine which way is optimal. ChatGPT was helpful here.

What’s Next

I try to time box my projects. But if I have time in the future, here’s what I’d do:

Dive deeper into compilers. Specifically, check out the resources suggested in teachyourselfcs. This includes the “Dragon book” and Crafting Interpreters.
Add features to my language, such as functions/procedures, additional data types (strings, arrays, dictionaries, booleans), and control flow (if/else, while, for)
Leverage existing tools. Further leverage LLVM. Or try a tool like ANTLR