A Basic History of Programming Languages and how to Write Your Own

Published in

CodeX

12 min readJul 14, 2021

There are many reasons to create your own programming language. I suspect if you found this, then you may already have one in mind. At worst, writing a programming language will give you a richer understanding of how programming languages work. At best, you can create a useful tool that allows jobs to be performed more effectively. My goal is to explain the conceptual basis required to create your own language.

Creating a language is a process that has been perfected over decades. Before we had Turing complete computers, computations were mostly hard coded directly onto circuits. These were incredibly basic, and were not Turing complete because they were not re-programmable and thus could not solve every problem. Then came Turing complete computers that were programmable. A developer could feed the computer instructions and it would perform custom calculations. The first programming language, which is still used today in every single calculation, is binary. The processor is designed such that given specific instructions — electrical signals — it would perform certain operations such as moving values from memory into registers, arithmetic calculations, etc. Through a basic set of instructions that are hard-coded into the CPU, all possible problems can be solved using a combination of these instructions (and infinite computing power and memory).

These instructions are purely binary. In order to make programming easier, assembly languages were created. These took the binary instructions and gave them readable aliases like mov or add. You would write code in assembly and use a program called an assembler to convert between the aliases and binary machine code. Languages slowly evolved and became more complex, made easier with the invention of the compiler by Grace Hopper. It’s important to take a step back and see that the reason languages evolved this way was to abstract towards the problems being solved. The goal has always been to make the programming process abstract; to ignore the fact that we’re talking to a computer and focus on solving the problem at hand in a language closest to our natural languages.

More modern languages such as C are significantly more abstract than assembly languages; and yet, C is still an incredibly simple language. It allows complex mathematical expressions to be almost identical to how they would be naturally written. C also brings in type concepts, where variables can hold certain types and functions accept or return types. Types provide great utility in understanding what code is doing, and is intrinsically linked to the assembled foundation of C. Just as assembly code is assembled into binary, C code is compiled into assembly and then assembled into binary. It isn’t quite that simple because C allows a program to be split among multiple files, so there is also a linking stage and other intermediate steps for debugging and such. However, the fundamental thing to understand is that C did not replace assembly in computers; it masked it with a layer of abstraction. All computer code in modern day still runs in binary and almost certainly becomes assembly at some point.

If we skip far along, we arrive at languages like Python or JavaScript. These languages are significant because they are interpreted. Interpreted languages are not compiled into assembly and assembled into binary to be ran; they are traversed by another program and the execution is determined. If you are familiar with tree structures, I did not use the word traverse by mistake. There is a lot of history I could get into with programming languages, but I think this is enough to begin discussing how this process actually works.

To begin, an assembler can be as simple as text-replace program. Once you write one assembler in binary, you could rewrite it in the assembly language you are using and use the previously-generated machine code to allow it to function. some assemblers provide more features than simple text-replacement, but that depends on the feature set of the assembly language. I would presume most modern assembly languages have a good amount of higher level features than the ones first crafted. An assembler has to be build for a specific CPU architecture and code assembled on one CPU may not work on another, because different architectures use different instruction sets, and maybe even different memory sizes (consider 32-bit and 64-bit).

One of the first questions you will need to answer when creating a language is whether you want your language to be interpreted or compiled (or transpiled). I’ll go into detail about what all three of those mean so you can begin deciding:

A compiled language will be converted into assembly code. This means you’ll likely need to know or learn assembly. Compiling code takes some time, but once it is compiled, it can be loaded into memory and ran. This makes compiled languages the fastest. C++ for example is famous for its performance and optimization options. The problem is, if your program is intended to run on multiple CPU architectures such as x86, x64, and ARM, you will need to modify the assembly code for each of the architectures you plan to support and the program will need to be compiled for each of them.

An alternative to compiling your code is transpiling. A transpiler is a program that takes code in one language and converts it to code in another language. You’ll notice that every compiler is a transpiler, but not every transpiler is a compiler. A transpiler can take C code and produce mostly equivalent Java code, for example. If you want to avoid using assembly by hand, but want a compiled language, you can write a transpiler from your language into a compiled language like C. Then, you can use an existing C compiler to compile and assemble your code. The only real downside with a transpiler is that compilation will be significantly slower. Not only are you reading and rewriting the code twice, but you will be writing code into a human-readable format, which is much less condensed and overall less efficient. This may not matter much in practice, however.

One of the more modern options is to create an interpreter. This will be a program in a compiled language that will read through your code and determine the correct results. You can create an interpreter in an interpreted language, but at some point compiled code will be ran. Each level of inception will add a performance impact as well. It may sound more complicated than a compiler, but its actually simpler in a lot of ways. That’ll become more clear once I explain the actual process of constructing one. The good thing about interpreters is that they’re actually pretty easy to write and they allow for a more unified program. Compilation on different architectures potentially could produce different results in calculations, but your interpreter can unify those results. You may need to create an interpreter for each platform, but overall it is the favored method of languages that are intended to work on various kinds of computers. Java boasts about how many computers use it. That’s because Java is interpreted (Java is compiled to a different kind of java code, which is then interpreted by the java virtual machine (JVM)). The main downside of interpreted languages, however, is that they are generally slow. There can be optimizations made to make them perform almost the same as compiled languages in certain calculations (and possibly faster if the right optimization is taken), but very often calculations are performed extremely slowly and inefficiently. This is the reason the vast majority of modern games are still written in languages like C++.

Now, you should be able to better decide whether you want your language to be interpreted, compiled, or transpiled. However, don’t decide too quickly. A language is not inherently interpreted, compiler, or transpiled; your language could be all three if you wanted or any combination. Let’s look into what you’d conceptually need to do to actually start creating a language.

Before you jump into coding, you should really start by designing the syntax of your language. (You should also read through the rest of this article so you understand what you’ll be designing first). Do you want brackets and semicolons like C, do you want indentations and newlines like Python, or do you have something new in mind? What operators do you want to have? How do you want functions to work? Will types be explicitly declared or inferred? Will it be object oriented or not? This is probably the hardest step in creating a language. You have to take the time to decide how you want your language to feel when you use it. It’ll seem overwhelming because you have so many options. If you really cannot decide on some things, take a look at languages you like and take inspiration. I would suggest, however, to focus on readability and simplicity. I think that languages that read like written text and mathematics are ideal. Focus on usability, debugging, and general visual appeal. Or, if you want to, create a horrible mess of symbols; your language should be yours. I would recommend creating documentation for this syntax if you expect or want other people to actually use it. It also helps you remember what decisions you’ve made and why.

Now, you’ll have the ability to write code that does nothing. In order to make it do something, we will have a 3-step process. We will begin by creating a lexer or tokenizer. This is a program that reads through your source code and splits it into tokens. Every symbol and word is a token. Depending on your language, white space may be ignored here or may be a token itself. The lexer will produce a list of tokens in the order they appear in the source file. Let me write out a quick example in C:

int main() { return 0; }

This would produce a list of tokens something along the lines of this:

[“int”: type, “main”: identifier, “(“: symbol, “)”: symbol, “{“: symbol, “return”: keyword, “0”: int-literal, “;”:symbol, “}”: symbol]

Notice that every token has an associated type. The lexer will need to take some symbols as groups (such as identifiers; it must create whole worlds, or such as Python’s exponent operator **) and other symbols as singular ones. A few things you’ll probably want though are identifiers, symbols, string literals, integer literals, float literals, and keywords. The way you label them is really up to you, but labeling them helps break down the code and gains information about what that token might be doing.

Next, we will discuss the parser. The parser will take the list of tokens and produce an Abstract Syntax Tree (AST). An abstract syntax tree is a tree where operations are nodes and the order they appear in the tree defines the order of operations. You can perform a depth-first-search (DFS) on the tree to get the tokens in the correct order given the order of operations you want. Depth-first-search means you start at the root, check it, then go to the left node and then the right node. If you are unfamiliar with recursion or tree traversal, I recommend you do some research into it.

This step of generating an AST requires you have thought out the order of operations in your language (hence why I suggested reading through before designing). This order is more complicated than just arithmetic; you need to include function calls, variable referencing, etc. Try not to get too caught up on the names for these things because that is one of the harder things when it comes to the design. Arithmetic and boolean expressions are easy to identify, but what should other expressions be called? Once you get into it, you’ll understand what I mean. This is basically the process of creating the grammar of your language. I recommend creating a chart of some kind that displays how each type of expression relates to every other kind.

Once your grammar is defined, generating an AST from a list of tokens is pretty simple. Just as you identified symbols as specific types and used that information to create a token, you will now identify tokens as specific types of nodes or expressions and create nodes in a tree structure. Notice that along each step, some information is lost. In the lexer, all of the ignored white space is lost, and yet a lot of it can be inferred by the order of the tokens in the list and such. Similarly, that order of writing is now going to be lost; however, the order of execution will be imparted directly into the structure of the AST. This makes the final step much simpler.

Lastly, you will create a visitor, which is my generic name for a compiler, transpiler, or interpreter. It will visit each node of the AST (starting at the root, performing DFS), hence the name. In fact, every step before now is identical regardless of the type of visitor you choose to use. That’s a major reason as to why this method is so decent. You can create one, two, or all three visitors if you wish. This is also why an interpreter is so simple. If you come across a node such as ( (+) lc(5) rc(6) ) which is to say an addition node with left child integer-literal 5 and right child integer-literal 6, you simply identify that the node is addition and you visit the child nodes and add their result. It’s incredibly simple. Since you are always checking the value of child nodes, the entire tree will end up being traversed. It’s important to note that not all operations will be binary operations. Some can be unary, tertiary, or more if you like. Each grammar rule will have its own node and it can be designed as you wish. In order to create a compiler or transpiler, you will similarly traverse the tree starting at the root and each type of node will have its own function for compilation or transpilation.

As a summary of this entire process, we begin with a lexer that goes symbol by symbol in the source code and generates tokens of varying types. We then have a parser that goes token by token using the type and order to generate nodes in an Abstract Syntax Tree, where the order of the nodes describes the order of operations, and the type of node describes the type of grammar. Lastly, we have a visitor that performs a Depth First Search on the Abstract Syntax Tree and either generates code or interprets the result based on the types of nodes and the value of the tokens stored in them.

If you’ve gotten this far, take a breath and relax. With this information alone, you still may not be ready to create your language. Maybe you’ve begun designing your language and cannot make certain decisions yet. I unfortunately haven’t found many good tutorials for creating programming languages, but I would suggest to keep looking. If you really want to understand how this process works before designing your language, consider creating a visitor for a language like C or Python. These languages have well defined syntax and grammar you can find online and that’ll allow you to develop a practical understanding of lexers, parsers, and visitors. You could perhaps even end up modifying some of the code for your own language.

I hope this proves to be useful! I got interested in creating programming languages because I started mastering Python and found that I love the syntax, but wished that it had lower level features. That began my quest of creating a version of C with Python-like syntax called Sea. I haven’t worked on it in a while because I got right into programming before designing the syntax, so my code is an absolute mess. I plan on writing out a lot of my design process so that I can actually think through it and then I’ll return to programming it. Feel free to look over the code for inspiration or a practical understanding of this process. Perhaps in the future I will write a follow-up article explaining this process from a practical viewpoint rather than a conceptual one. Good luck creating your language!

A Basic History of Programming Languages and how to Write Your Own

Written by David MacDonald