Creating Ginger: Part 2 — Lexer

In my previous post on Ginger, I wrote about how I defined the first draft of the lexicon and grammar for Ginger (if you don’t know what a lexicon or grammar is, I really encourage you to read my post on lexicons and grammars before continuing with this post). In this post I'm going to talk about how I built Ginger’s lexer (a.k.a. scanner or tokeniser) based on Ginger’s lexicon.

First, what is a lexer? The lexer takes, as input, a program’s source code, and provides, as output, a series of tokens that represent that same source code. As a simple example, consider the following program code written in Ginger:

Based on Ginger’s lexicon and Ginger’s scanner’s tokens, this program would be represented by the following ordered series of tokens: Int (int), Identifier (x), Identifier (x), Assignment (=), Number (1), Addition (+), Number (2), EndOfFile.

So how did I create Ginger’s lexer? The rest of this post will be dedicated to answering that question. Note: this post describes how I hand-rolled Ginger’s lexer — for working with a lexer generator, see here.

First I defined the set of tokens that are available in Ginger (which should be based on the lexemes in the lexicon). As of writing, Ginger’s lexicon was still fairly simple (the following is updated with Ginger’s current Lexicon, so the content of this post may be outdated by the time you read it ☺):

Ginger’s lexicon

Based on the above, the following tokens are defined in Ginger’s lexer:

Tokens in Ginger’s lexer

All lexemes defined in the lexicon have a matching token in Ginger. However, as you can see, there are more tokens defined in Ginger than there are lexemes in the lexicon. This is because extra tokens, e.g., EndOfFile, were added to make the process of parsing simpler. It is also worth noting that not all items in the lexicon are lexemes, e.g., upper-alphabetic-character and white-space; these are simply used to help make it simpler to define the lexicon and implement the lexer.

Having defined the set of tokens, I then defined the algorithm that would read, from top to bottom and left to right, each character of the program code, and extract the appropriate tokens as it did so. This algorithm is the most important and fundamental part of the lexer. The algorithm for Ginger is similar to the following:

  1. Lets define some useful variables and store the program code
The first step in the lexer’s algorithm

2. Let’s define a method to get the next token in the sequence

The second step in the lexer’s algorithm

3. Let’s keep getting tokens until we’ve iterated over all the program code

The third step in the lexer’s algorithm

A few notes on what I learned when creating Ginger’s scanner:

  • Don’t worry about signed, i.e., positive and negative, numbers at this point. If you come by a ‘-’ char, treat it as a subtraction and work it out later on, e.g., while parsing.
  • In general, don’t worry about the semantics, e.g., if a number is valid. We’ll handle all that kind of stuff later on in the process. So if you come cross a number like ‘04646’, be cool about it.
  • Don’t worry about the context, e.g., the order of tokens. We’ll handle that while parsing.
  • Don’t turn the token’s into objects, let them be something simple like an enum, and have your next() method return this simple type. If the caller needs more information, let them query the properties of the scanner, which should reflect the current state, i.e., the token that was just returned. Some useful properties are: token-value (the chars underlying the token, e.g., if the scanner came across the the chars ‘42’ and so returned a Number token, then the token-value would be ‘42’), row (the row number of the first char underlying the token), and column (the column number of the first char underlying the token)
  • Every now and then you may want to peek one char ahead.
  • While each call to next() should return the next token in the sequence, you will want to provide users the ability to peek(), i.e., get the next token without it affecting what is returned by subsequent calls to next(). So if you were to call peek() and then next(), you would receive the same token from both calls.

The current implementation of Ginger’s lexer can be found here.

Great, so now that we have defined our lexer, what’s the point? The tokens generated by the lexer are a major part of the input to a parser, which uses them to generate the program’s abstract syntax tree (AST). The purpose of the parser and the AST will be the topics of my next couple of posts.