Parser Combinators with Scala flavor

S.Hosein Ayat
Sanjagh
Published in
4 min readDec 16, 2017

If you’ve ever faced a data extraction or text analysis task (specially the ones with lots of regex) you know that you’re doomed to debug a maze of code. Trying to write nested regex data extractors is hard, and you experience what hell looks like if you try to debug it! That is, if you don’t use proper tools.

Suppose that you have some simple Parsers that knows how to extract a specific data structure from a piece of text. A ‘Parser Combinator’ is a special higher-order function that accepts some Parser functions and creates a new, bigger Parser that combines input Parsers and thus, creates new, big, shiny parser that knows how to combine base Parsers.

Let’s go straight to the example, here we want to implement a log processor, which parse different kind of logs (which we know the structure) and extract desired data structure. Kind of like Grok .

(Checkout the github repository for sources)

Problem definition

Lets assume we want to parse the following log line (Which is a typical Nginx error log) :

2017/12/12 07:38:44 [error] 1170#1170: *1578974 open() "/address/to/file" failed (2: No such file or directory), client: 1.2.3.4, server: server.domain, request: "GET /file.txt HTTP/1.1", host: "server.domain"

Here we want to extract the following data structure :

case class NginxFileNotFound(date: Long, file: String, clientIP: String, server: String, urlPath: String)

Here,writing a simple REGEX may not be hard, but re-using it (or parts of it) is another matter. For example, we can define IP regex as simple as

val ipRegex = """(\d{1,3}\.){3}\d{1,3}""".r

We can add it in the middle of another regex, but how we can re-use it’s ability to extract some IP in the middle of a haystack of log ?

Parser Combinators to the rescue

This is how we process our logs in Sanjagh. First we need basic parsers. A parser consists of a special form of regular expression and a part which handles the data extracted from the regular expression. This is the simplest parser that you can imagine :

val ip: Parser[String] = """(\d{1,3}\.){3}\d{1,3}""".r ^^ identity

This is a simple parser that extracts a simple string from a text, to be more precise, it just validates the input, and returns the whole string as the result.

This parser DSL consists of two parts that are separated by a ‘^^’. The first part defines how to extract data from input string, and the second part (which is a closure called map) defines how to process the raw data extracted by the first part and creates the result (In this sample our map part does nothing.)

Note that in order to write the parser, we need to add ‘parser-combinators’ library to our project and create a trait which extends ‘RegexParsers’

It defines a parser that uses the regex we defined before to match our target text and a second part that tells what to do with the raw data that matched, which in this case just returns the extracted String. (identity is a predefined function in scala which returns the exact input)

Next, we’re going to implement a parser for ‘date time’ in our example, but this is a little tricky, since here we want to process the extracted string and create a Long date from it :

val nginxTimeFormat = new SimpleDateFormat("yyyy/MM/dd HH:mm:ss")val time: Parser[Long] = """\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}""".r ^^ {s => nginxTimeFormat.parse(s).getTime }

This time parser is quite like the ‘ip’ parser we’ve defined before. It has a regex as data validator and a closure to create the end result. The only difference is that the closure does something! It creates a time object (represented by a Long) from the input string.

Now we need two more helpers (which are like the ‘ip’ parser)

val host: Parser[String] = """(\w+\.)+\w+""".r ^^ identityval path: Parser[String] = """/?([a-zA-Z0-9.&=_%+-]+/)*[a-zA-Z0-9.&=_%+-]*""".r ^^ identity

Finally, with all these small parsers, we can implement our final parser :

val fileNotFound: Parser[NginxFileNotFound] = time ~ """\[error\] .+ open\(\) """".r ~ path ~
"""" failed \(2: No such file or directory\), client: """.r ~ ip ~
""", server: """.r ~ host ~ ", request: \"GET " ~ path ~ "HTTP/1.1\", host: \"" ~ host ~ "\"" ^^ {
case dateTime ~ _ ~ filePath ~ _ ~ clientIp ~ _ ~ hostName ~ _ ~ urlPath ~ _ ~ serverHostName ~ _ => NginxFileNotFound(dateTime, filePath, clientIp, hostName, urlPath)
}

Here, Tilde (~) is a sequential composition operator, it glues different parsers together to form up a bigger parser, it can use a simple string (static text), regex or another parser. The closure next to the ^^ is our map function that parses the data parts extracted from the first part. The map function has a parameter foreach parser left of ^^ operator.

Of course there are more fun operators (like ~> , <~, |, ~!) that can help you in various situation to build a perfect parser.

And that’s pretty much it! (to view the complete code checkout the github repository)

--

--

S.Hosein Ayat
Sanjagh
Editor for

Software engineer, Functional Programming enthusiast, Sanjagh co-founder & CTO