Flex tutorial

Ilya Rudyak
6 min readAug 1, 2016

--

This is a short tutorial about flex — a tool for generating a lexer. It’s primarily tailored for cs143 Compilers (Stanford university).

The typical problem with assignments in such courses — there are a lot of undocumented features. You also have to read GNU manuals etc. This is not a big problem when you are taking this course in a traditional environment when you’re able to ask your friends or TAs at any moment. But in case you’re on you own — sometimes it requires an enormous amount of time to figure out the way of doing some trivial things.

see presentation here

why flex?

Well, to build a Lexical Analyzer you have to follow these steps: Lexical Specification — Regular Expression — NFA — DFA — Table implementation of DFA.

Converting RE to NFA and NFA to DFA can be automated as we know. Flex is the tool that implements these algorithms. So we only have to specify RE and flex does everything else for us.

In terms of cs143 you have to fill in cool.flex file (with appropriate REs) and flex will generate cool-lex.cc — this is source code for the lexer. When you fill in cool.flex you may build the lexer:

$ make lexer

Again, (1) the only file you have to fill in is cool.flex; (2) after this you can build lexer with make; (3) now you may test your lexer:

$ ./lexer foo.cl

tokens

Before we can talk about flex file we have to understand tokens. You may read in the Dragon book that token is a pair of token name (or class) and an optional attribute value (a lexeme that is a match for some RE). In other words a token for an integer 10 is something like (Int, “10”), where “10” is a string.

Absolutely crucial for this assignment is to understand token classes — they are closely matched Lexical Structure of the language — see The Cool Reference Manual, 10 and Appendix A.

But what is a token in our case? What is return value of a lexer? It turns out that instead of token classes we have integer constants for tokens (specified in cool-parse.h, see appendix A):

#define CLASS 258
#define ELSE 259
#define FI 260
...

It’s interesting to note that we have only 3 tokens for special notations (DARROW, ASSIGN and LE). In all other cases we return actual char (see details in description of rules). There are some other approaches — in the Dragon book they use relop token class for part of special notations (The Dragon book, 2nd. ed. p. 140).

Lexer returns to parser this integer values. Attributes are not returned, they are placed into global union YYSTYPE cool_yylval. This type is defined in the same file cool-parse.h:

typedef union YYSTYPE
{
/* Line 1676 of yacc.c */
#line 33 "cool.y"
Boolean boolean;
Symbol symbol;
Program program;
Class_ class_;
Classes classes;
Feature feature;
Features features;
Formal formal;
Formals formals;
Case case_;
Cases cases;
Expression expression;
Expressions expressions;
char *error_msg;
/* Line 1676 of yacc.c */
#line 129 "cool.tab.h"
} YYSTYPE;
see here

how lexer is working?

This is much more clear now — when we’ve considered tokens and our global union (The Dragon book, 2nd. ed. p. 140):

It is a C function that returns an integer, which is a code for one of the possible token names. The attribute value, whether it be another numeric code, a pointer to the symbol table, or nothing, is placed in a global variable yylval (in our case — cool_yylval), which is shared between the lexical analyzer and parser, thereby making it simple to return both the name and an attribute value of a token.

see here

structure of a flex file

%{
Declarations
%}
Definitions
%%
Rules
%%
User subroutines

We have 4 parts (see a diagram above):

  • Declarations. You define here variables and constants (both provided and your own). You also have to add function prototypes in case you add some helper functions (in User subroutines section). See description of some useful variables in Appendix B.
  • Definitions. You can provide names for RE here. For example:
DARROW        =>
ASSIGN <-
LE <=

Useful RE to use for keywords — case insensitive search for a match:

// (?r-s:pattern): apply option ‘r’ and omit option ‘s’ while interpreting pattern. Options may be zero or more of the characters ‘i’, ‘s’, or ‘x’.(?i:ab7)        same as  ([aA][bB]7)// So we can add definitions like that:
CLASS (?i:class) // *NO* space between (?i: and RE)

We may use a character class to match special notations that have no token— as mentioned above there is only 3 tokens for them, in other cases we return actual char.

OTHER_SN      [{};:,\.()<=\+\-~@\*/]
  • Rules. This is the main part of this file. Here you provide rules for a lexer what to do in case it matches some RE.

As discussed before — we return a token (if any) and put some value into our global union:

{NUMBER}+ {
cool_yylval.symbol = inttable.add_string(yytext);
return (INT_CONST);
}

In case we have definition for RE we use { } notation, this can be useful for keywords:

// If we have definition for CLASS (see above) we put it in { }:{CLASS}       { return CLASS; }

In case of special notations we return token (if any) or actual char:

// We may use something like this:
"+" { return '+'; }
// But it's better to use
{OTHER_SN} { return *yytext; }

Rules for integers and identifiers are pretty straightforward — type identifiers starts with a capital letter and object ids with a lowercase letter (we don’t have at lexer stage Int etc. — just type id and object id):

[A-Z]{ALPHANUMERIC}*
[a-z]{ALPHANUMERIC}*

Be sure to place these REs below keywords — otherwise they will have priority and match keywords:

If it finds more than one match, it takes the one matching the most text (for trailing context rules, this includes the length of the trailing part, even though it will then be returned to the input). If it finds two or more matches of the same length, the rule listed first in the flex input file is chosen. (see here).

You also have to add REs to match empty lines, white space and everything else (see PA1.pdf, 4.4):

\n           { curr_lineno++; } /* advance line counter */
{WHITESPACE} { } /* skip whitespace */
. { ... } /* return an error */

We don’t consider REs for strings and comments in this post — they are the most difficult part of the assignment and we’re not going to uncover too much.

  • User subroutines. This section primarily relates to string so we omit description of possible functions.

how to test a lexer?

It seems that the best way to test lexer is to use input files generated by pa1-grading.pl (noy using test.cl). After you run it there will be generated grading directory that contains a lot of *.cool files to test your lexer.

$ perl pa1-grading.pl
$ ls grading | grep '.cool$' | wc -l
$ 63
$ ls | grep '.cool$' | grep comment
bothcomments.cool
comment_in_string.cl.cool
endcomment.cool
escaped_chars_in_comment.cl.cool
longcomment.cool
multilinecomment.cool
nestedcomment.cool
opencomment.cool
stringcomment.cool
twice_512_nested_comments.cl.cool
weirdcharcomment.cool
$
$ cat comment_in_string.cl.cool
"string (* 123 *) string"
"string \
(* 123 *)"
"string (* \
123 *)"

You may find description of these 63 tests in file cases:

all_else_true.cl.cool; 1; lowercase and uppercase combinations
arith.cool; 1; arith example
atoi.cool; 1; atoi example
backslash.cool; 1; various backslashes in strings
backslash2.cool; 1; more backslashes in strings
badidentifiers.cool; 1; bad identifiers
badkeywords.cool; 1; bad keywords
...

To test files just run your lexer on them :

$ ./lexer grading/all_else_true.cl.cool

Appendices

A. Tokens

This is a list of tokens that our lexer returns to parser — from cool-parser.h. We use them as return values for our rules in cool.flex. Attributes for tokens are stored in a global union cool_yylval.

see here

B. Variables

see here

--

--