Creating languages, simply put

Markus Rudolph
8 min readJul 8, 2023

--

After reading this article you will roughly know:

  • what you will need to build your own software language
  • what you can do to make the workflow with your language’s source files user-friendly
  • what software solution can help you to create a language

But this article is not about:

  • the benefits of creating software languages (I will add an extra article just for this purpose)
  • overwhelming you with code snippets

How are we using language?

Languages are our tools for communication and we use them every day. Language has several facettes.

  • You can speak to share your thoughts with another person (in the hope he or she understands).
  • You can even postpone and multiply the moment of understanding if you are writing down your thoughts on a sheet of paper and give it to some persons, even hundreds of years beyond your life time.
  • You maybe have seen deaf people talking using their hands: the sign language.
  • Even bees have their dance language to exchange information.
  • And do not forget all the specification and programming languages which can often be seen as a list of orders to a certain machine.

What is language made of?

If you inspect these examples in detail, you would observe several facts.

And to visualize these facts, let’s use the title of this section.

The example sentence

Observation #1: there is a sequence of „units”

A sequence of tokens
  • For speaking, these units are patterns in the air pressure creating syllables and words.
  • For writing, these units are nouns, verbs, but also punctuation.
  • For sign language, it is the poses and moves of the hands.
  • For bee dancing, these units are the moves of body parts.
  • And for programming languages, the units are names of variables, values, keywords.

This sequence of units is called lexis and the units are named lexemes or tokens. Lexemes are the building blocks of a language. A program that splits text into lexemes is called lexer or tokenizer.

Notice the illustration: each unit is more than just text, each unit has a classification: noun-likes are pink, verb-likes are green, punctuation is yellow.

Observation #2: The sequence has an inner hierarchy

A syntax tree

This hierarchy weights some units higher and some lower in a structure that we, computer scientists, call a tree. Trees are shaped by a grammar, a system of rules that describes how to build those trees.

Check out the illustration again! A question is a W-word, followed by some sentence „guts”, followed by a question mark. The guts have a predicate (is made of) and an object (a language). Finally, an object is a concatenation of an article (a) and a noun (language).

The corresponding rule system, a grammar, would look like this:

  • QUESTION ::= W-WORD SENTENCE-GUTS „?”
  • W-WORD ::= „What”
  • SENTENCE-GUTS ::= „is” OBJECT VERB
  • OBJECT ::= „a” „language”
  • VERB ::= „made” „of”

Grammars create a hierarchy on the sequence of lexemes. This kind of hierarchy can be found in all examples from above:

  • For speaking and writing, this hierarchy is represented by a grammar whose rules were found and later standardized (German, French, …)
  • Even for the sign language for deaf people has a grammar.
  • For bee dances, a bee tries to get attention first as a dance opener and then describes the details like distance and direction by varying speed of waggling and angle relative to the sun position.
    So, BEE-DANCE ::= ATTENTION LOCATION
  • For programming languages, we can say the same. Having a grammar is key to mimic a language! Grammars for programming languages are very strict and precise, compared to natural languages, where you need to play with a tolerance, so that you can also handle ambiguities.

Having a grammar is also assosiated with the term syntax and the tree shaped by the grammar is called abstract syntax tree or short AST. A program that converts a sequence of lexemes into an AST is called parser or syntax analyzer.

Observation #3: Not every hierarchy is valid

Validation rules filter out invalid tree

There are a lot of combinations for hierarchies of units that are grammatically correct, but not semantically. In contrast to lexis and syntax, which add more opportunities for new sentences, this phase takes care about semantics by removing entire trees from the language when a certain sub tree does not meet the conditions of a so-called validation rule.

  • For speaking, think of a spell that only gets casted when you emphasize it at the right places (Wingardium Leviosaaa!). Here, the validation rule is: when the last syllable is hold on a certain tone, then pass and move on to the next validation rule, otherwise reject the whole tree.
  • For writing, think of a mathematical equation as a true statement that makes sure that 1+2=3 is part of your language but not 1*2=3. Here, the validator has to test the truth by computing the left and the right side and comparing both results for equality.
  • For sign language, think of poses that make no sense, are impolite or are not possible to reach. One validator could be a check whether only the middle finger of one hand is shown. Depending on the receivers culture, the pose can be an insulting gesture.
  • For bees, one simple rule is, first get the attention, second, do the dance. The other way around makes no sense.
  • For programming languages, think of a variable in C or Javascript that is used, before it is defined.

This third observation describes what is called semantics. It requires knowledge about the domain. You have to forbid everything that is not allowed. If you are searching for a program that does this for by inputting a specification, I have to disappoint you, because there is no general approach, yet.

Language Frontend and Backend

As a software language engineer I am interested in creating a language for writers and developers, not for speakers or dancers. The three phases — lexis, syntax and semantics — are just the frontend of a language. I will tell you now what a language backend does.

A backend utilizes an abstract syntax tree (AST) which has passed all three phases and derives any artifact using the information provided by the AST. But let’s do not care so much about the backend, because when we put it simple, this is „only“ about generating a string containing the AST information produced by the frontend.

You can optimize this process, sacrifying simplicity and gaining performance. However, a language backend can also surprise you with challenging tasks.

Source text editing

It is not enough to read a valid AST and generate something afterwards in one run. This one-run process we can call a translation. Imagine your language has such a translator, how would you develop your input files? What is your flow?

You would change the file, save it, go to the terminal, translate your input file and then take care of any errors or being happy about the result. The shortest cycle would be a compile script that is executed when the input file was changed.

To make a language really sexy, you need also think about developing an editor. An editor that

  • suggests what can be written next in the context of your text cursor. This is sexy, because you do not want to guess your visible symbols and fail translation if it was not right.
  • shows semantic and syntactic errors by underlining them. This is sexy, because you do not need to trigger the translation to get feedback, it appears immediatly, as soon as something makes no sense.
  • finds usages of certain symbols. This is sexy, because files can become big, so some parts of your source gets out of sight after a while of coding.
  • highlights the units of your language for better readibility. This is sexy, because you can easier memorize patterns and can recognize familiar code faster.
  • suggest rewrites of the code, when there is a recommended way. This is sexy, because we as developers are not always up to date with all best practices. Having support on refactoring code is a really nice feature.

„Uh wow! I have to develop a frontend, a backend and an editor? That is a lot of work!“, you might say… let‘s give up.

And here come the good news…

There are these frameworks which can help you to handle a big amount of the tasks. They are called language workbenches. Here are some for creating textual languages (be aware that there also exist graphical solutions).

Languages workbenches

Do you remember the abstract syntax tree? It is involved — everywhere!

  • the leaves of the tree are the lexemes of the lexical phase
  • the tree itself was made in the syntactical phase
  • the semantic phase scans the tree to validate it
  • the generator phase gets the tree as input and outputs the desired artifact (like an executable of your program)

Looking at the following illustration, the syntax tree connects the two worlds of pure text (the sky) and the domain of your language (the ground).

The syntax tree is everywhere

In Langium for example you are starting your adventure by designing a grammar. It describes how your lexemes should look like, which kind of information they contain and how they are connected with each other. The grammar is then used to generate a bunch of helper modules, like the AST-holding data structures or the syntax highlighter.

Generating something from some grammar rules is the typical approach for parser generators like ANTLR. BUT Langium goes far beyond the normal parser providing. It has default implementations for all kinds of editor aspects:

  • Autocompletion
  • Formatting
  • Syntax and validation errors in real-time
  • Tracking of symbol references across of multiple documents

You get everything for free. All what was needed was the grammar. And even if some aspect does not match your desires, you are empowered to overwrite the default behavior with your custom one.

Want to read more?

If you want to read more from me just follow my profile links. I am helping people to learn writing their own languages. If you want me to help you understanding software language engineering, contact me.

If you need someone to create a language for you, I would recommend to talk with the team of TypeFox (my current employer and creator of the Langium project), or the community around Strumenta (here I am just a member).

Feedback is welcome

Anything that was hard to understand? Any topic you want to read next? You can write me an email!

--

--

Markus Rudolph

Passionated developer, language engineer and sometimes also an artist and song writer :-)