Life of a GraphQL Query — Lexing/Parsing

Christian Joudrey
6 min readJul 21, 2017

Last week I wrote about how I used Ruby’s TracePoint and Graphviz to generate graphs to familiarize myself with new codebases. I originally used this approach to learn the internals of the GraphQL Ruby gem.

I figured a good follow-up would be to share what I learned about the internals of this package. Even if you don’t know Ruby, these notes are still valuable to anyone wanting to learn about the internals of executing a GraphQL query.

When you execute a GraphQL query (say via an HTTPS request) one of the very first things the server needs to do is transform the query (currently a string) into something it understands. This is also known as lexing and parsing.

Lexing (or lexical analysis) is the process of breaking up a stream of characters (in our case a GraphQL query) into tokens.

The rules the GraphQL lexer uses to split up the query are defined in the GraphQL grammar.

Example lexing of a GraphQL query
Example lexing of a GraphQL query

While some implementations of GraphQL have hand-written lexers (i.e. graphql-js’ lexer), GraphQL Ruby uses a tool called Ragel to generate its lexer.

At the core of the lexer.rl Ragel file is a set of regular expressions:

%%{
machine graphql_lexer;

INT = '-'? ('0'|[1-9][0-9]*);
IDENTIFIER = [_A-Za-z][_0-9A-Za-z]*;
ON = 'on';
FRAGMENT = 'fragment';
TRUE = 'true';
FALSE = 'false';

The regular expressions are then used in the main scanner:

main := |*
INT => { emit(:INT, ts, te, meta) };
IDENTIFIER => { emit(:IDENTIFIER, ts, te, meta) };
ON => { emit(:ON, ts, te, meta) };
FRAGMENT => { emit(:FRAGMENT, ts, te, meta) };
TRUE => { emit(:TRUE, ts, te, meta) };
FALSE => { emit(:FALSE, ts, te, meta) };

As the lexer analyzes the query string, whenever it finds a token that matches a regular expression on the left, the action on the right is executed.

The emit method populates an array of tokens. Each token keeps track of a few things:

  • The name (what type of token was captured). i.e. :IDENTIFIER
  • The value (what the regular expression matched). i.e. author
  • The line and column the token was found on. i.e. line 2 column 3
Query (left) and tokens (right)
Query (left) and tokens (right)

You might notice that none of the whitespace from the original query shows up in the token list. This is because spaces and tabs are part of the ignored tokens in the GraphQL grammar.

Now that the query string is split up into tokens, the next step is to turn sequences of tokens into a more abstract representation of the language. This is the job of the parser.

Like the lexer, the rules the parser follows are defined in the GraphQL grammar.

I’ve found it incredibly valuable to glance over the GraphQL grammar. Not only did it help me better understand the language, but it helped me know the name of each part of the query which I’ve found to be essential for good team communication.

GraphQL grammar excerpt
GraphQL grammar excerpt

The grammar reads from top to bottom and is essentially a set of rules.

Here’s what the excerpt above is saying:

  • The root of all GraphQL queries is known as a Document.
  • A Document is one or many Definition.
  • A Definition is either a OperationDefinition or FragmentDefinition.
  • An OperationDefinition is either a SelectionSet or OperationType optionally followed by a Name optionally followed by VariableDefinitions and so forth.

A lot of the rules depend on other rules, but as you continue reading, the names of tokens start appearing. For example the rule for OperationType states that it can only be one of the following tokens: query, mutation or subscription.

Here’s a visual representation of the example query above once it is parsed out:

Just like the lexer, some implementations of GraphQL have hand-written parsers (i.e. graphql-js’ parser), but GraphQL Ruby uses a tool called Racc to generate its parser.

The rules in the parser.y Racc file look a lot like the grammar in the GraphQL spec with the exception that there is an action (Ruby code) attached to each rule. We'll come back to that shortly.

Here’s a simplified version of what the rules look like to parse arguments (i.e. (first: 10, sortBy: "something")).

arguments:
LPAREN RPAREN
| LPAREN arguments_list RPAREN

GraphQL Ruby uses the convention of capitalizing tokens in the parser rules, this is why LPAREN and RPAREN are uppercase above. They are the token names for the characters ( and ).

The above rule states arguments are either () or a list of arguments (i.e. arguments_list) between parentheses.

arguments_list:
argument
| arguments_list argument

An arguments_list is either a single argument or multiple arguments. You might notice recursion happening in the rule above in order to express multiples of one thing.

argument:
name COLON input_value

An argument is a name followed by a colon and an input_value.

name:
IDENTIFIER

A name is an identifier. Remember IDENTIFIER is a token the lexer emits when it finds something that matches this regular expression: [_A-Za-z][_0-9A-Za-z]*;.

input_value:
INT
| STRING
| TRUE
| FALSE

An input_value is either an integer, a string or a boolean.

Here’s the same example now with actions attached to the rules:

arguments:
LPAREN RPAREN { return [] }
| LPAREN arguments_list RPAREN { return val[1] }

arguments_list:
argument { return [val[0]] }
| arguments_list argument { val[0] << val[1] }

argument:
name COLON input_value { return Nodes::Argument.new(name: val[0], value: val[2]) }

input_value:
INT { return val[0].to_i }
| STRING { return val[0].to_s }
| TRUE { return true }
| FALSE { return false }

Actions can return values. The variable val used within the actions contains these said values for the rules that were matched.

If we were to parse our previous example (first: 10, sortBy: "something") with the rules above, we would obtain the following:

[
#<Nodes::Argument @name="first", @value="10">,
#<Nodes::Argument @name="sortBy", @value="something">,
]

Although this is a simplified version, the actual grammar rules used by GraphQL Ruby’s parser are similar.

The data structure above is known as an abstract syntax tree (AST).

Now you might be wondering what does the AST look like for our example query.

Here’s the original query:

Here’s the AST representation the parser would return for that query:

Pretty cool, uh?

In case you were wondering, the lexel.rl Ragel file and parser.y Racc file are both used in order to generate regular Ruby files lexer.rb and parser.rb by the build_parser Rake task.

The next step in the life of a GraphQL query is validation, but I’ll keep that for another post.

If you want to learn more about lexers, parsers, and abstract syntax trees (AST), I recommend getting a copy of Thorsten Ball’s incredible book: Writing an interpreter in Go.

Edit: Part two of this series is now available here.

--

--

Christian Joudrey

Production Engineering @Shopify. Rock climber and amateur piano player. Previously working on GraphQL @GitHub.