Implementing a Programming Language in Swift — Part 3: The Lexer

Þorvaldur Rúnarsson
Swift2Go
Published in
3 min readFeb 4, 2019

This is the third part in a tutorial series on “Writing a Programming Language in Swift.” Be sure to check out Part 2. TLDR; Full Source Code

The core job of Lexer is to analyze our interpreter’s text input and reduce it to a collection of Tokens, which are nothing more than simple structs representing specific substrings from our input.

Each token represents a pattern of text within the text input. Recognizing this pattern is often done by using regular expressions. If you’re unfamiliar to regular expressions I recommend checking out this introduction by Paul Hudson and maybe play a little bit around with regular expressions on the RegExr playground, before continuing with this tutorial.

We start by declaring an enum for our tokens:

enum Token {    typealias Generator = (String) -> Token?    case op(String)    case number(Float)    case parensOpen    case parensClose    static var generators: [String: Generator] {        return [            "\\*|\\/|\\+|\\-|times|divided\\sby|plus|minus": { .op($0) },
"\\-?([0-9]*\\.[0-9]+|[0-9]+)": { .number(Float($0)!) },
"\\(": { _ in .parensOpen }, "\\)": { _ in .parensClose } ] }}

Here we have four different tokens matched using the following regular expressions:

  • Operator: “\*|\/|\+|\-”
  • Number: “\-?([0–9]*\.[0–9]+|[0–9]+)”
  • Open parentheses: “\(”
  • Close parentheses: “\)”

As you can see, each backslash in our regular expression needs to be escaped using another backslash when implemented in Swift. An annoyance for sure, but nothing we can’t handle, right?

Another interesting thing to observe would be that we have taken the approach of assuming all numbers are Floats. This is merely a compromise to maintain the simplicity of our language.

Now before we continue implementing our Lexer it would help to have some helper functions for String:

public extension String {    public func getPrefix(regex: String) -> String? {        let expression = try! NSRegularExpression(pattern: "^\(regex)", options: [])        let range = expression.rangeOfFirstMatch(in: self, options: [], range: NSRange(location: 0, length: self.utf16.count))        if range.location == 0 {            return (self as NSString).substring(with: range)        }        return nil    }    public mutating func trimLeadingWhitespace() {        let i = startIndex        while i < endIndex {            guard CharacterSet.whitespacesAndNewlines.contains(self[i].unicodeScalars.first!) else {                return            }            self.remove(at: i)        }    }}

Here, the function: func trimLeadingWhitespace(), does exactly what you have probably already guessed, it removes all leading whitespace from a String. The other, func getPrefix(regex: String) -> String? returns a substring matching the given regular expression, but only if that substring is a prefix of the string (self).

Now we are finally ready for implementing the lexical analysis for our language:

class Lexer {    let tokens: [Token]    private static func getNextPrefix(code: String) -> (regex: String, prefix: String)? {        let keyValue = Token.generators
.first(where: { regex, generator in
code.getPrefix(regex: regex) != nil }) guard let regex = keyValue?.key, keyValue?.value != nil else { return nil } return (regex, code.getPrefix(regex: regex)!) } init(code: String) { var code = code code.trimLeadingWhitespace() var tokens: [Token] = [] while let next = Lexer.getNextPrefix(code: code) { let (regex, prefix) = next code = String(code[prefix.endIndex...]) code.trimLeadingWhitespace() guard let generator = Token.generators[regex],
let token = generator(prefix) else {
fatalError() } tokens.append(token) } self.tokens = tokens }}

That’s really all there is to writing a Lexer for our language. This is all that we are doing:

First we initialize an empty token array.

Then try to match a known regex against the prefix of our working code string, we add the token representation of that prefix to our list of tokens.

If a match was found, we remove the added prefix from our working code, if not, we are done, and can simply store the list of tokens as a property on our lexer instance.

In the next tutorial we will move to the next phase, parsing.

Stay tuned!

P.S. Don’t forget to clap 👏 Also feel free to follow me here on Medium for notifications and discussions on future tutorials.

--

--

Þorvaldur Rúnarsson
Swift2Go

Full Stack Developer (TypeScript, React, Python, Node.js, Swift, ReactNative, Flutter), also... dad, football enthusiast & guitar enthusiast.