DV1584 / DV1585

Compilation Technologies

Dr Andrew Moss

10:15 - 12:00 Friday 17th January, 2018

Room J1620

Lexical Analysis II

1. Nondeterminism

  • Consider lexing a simple language.
  • Four kinds of tokens.
  • Strings have overlapping prefixes.
  • e.g. inf is a valid identifier.
  • But the first two characters could be a keyword.
  • Which transition should the machine take on i?
  • Similar problem with 3 and 3.14.

2. Nondeterminism (continued)

  • Nondeterminism arises in many CS problems.
  • In a deterministic problem we know which decision to take.
  • Non-deterministic problem: one choice will turn out to be right.
  • Similarity to models of randomess.
Executing non-deterministic machines
  1. Guess ("speculate"). Record what we did, back-track if it doesn't work.
  2. Transform the problem into another form (determinisation).
  • The first technique is used for branch-prediction in processors.
  • We'll look at an example of the second technique.

3. Determinisation

  • In order to execute the NFA without speculation: convert to DFA.
  • This is called the powerset construction.
  • Where there is ambiguity on the NFA about the current state.
  • Introduce a state representing the set of possible states.
  • Details in textbook: section 3.7.1

4. Longest Valid Lexeme

  • Once the machine is determinstic we can still hit ambiguity: valid prefixes.
  • DFA will progress further, but then get stuck.
  • Fix: modify the DFA execution.
  • We must record the lexeme being scanned anyway - not much overhead.
  • Always run until it can get no further.
  • Walk backwards through the results, find longest match.

5. Implementing a regex matcher

Background reading: Classic implementation in C

Compact, elegant, minimal.

  • Input is the ascii rendering of regex (no parentheses).
  • One of many ways to implement regex.
  • Today I want to show you three other ways.
  1. Implementing a DFA as control-flow.
  2. Data-driven interpretation of DFA.
  3. Implemention using dynamic despatch in C++.
  • Please compare the different styles: core of an interpreter.

6. Example DFA for implementation

  • Use same DFA for each style: easier comparison.
  1. Start with regex: [0-9]*\.[0-9]+.
  2. Factor out the +to get [0-9]*\.[0-9][0-9]*
  3. Convert to a DFA
  4. Draw the graph as a jump table.

Table is almost code already...

7. Implementation 1: DFA as control-flow

  • DFA is a simple machine.
  • Can implement each state in C.
uint32_t position = 0; char *tape = "3.14"; bool nextHead(bool accepting) { if(tape[position+1]==0) return accepting; position++; return true; }
bool state0() { switch(head) { case '0': case '1': ... case '9': return nextHead(false) && state0(); case '.': return nextHead(false) && state1(); default: return false; } }

Implementation 1: DFA as control-flow

This is a direct implementation of the DFA: one particular DFA is being translated into C code. The logic encoded in the DFA (as the structure of the graph) is being converted to control-flow in the C code. Each state in the graph becomes a single procedure, each transition becomes the call inside a switch-case.

During execution the path through the graph will become the activation sequence of calls to procedures. The depth of the call-stack will be \\((n\\)), where \\((n\\)) is the length of the tape (input string). This creates the real possibility that the program will crash if the input is too long for the maximum depth of stack.

The return value for each procedure is a boolean: whether or not the remaining symbols on the tape matched against the machine from this point. These calls are threaded together recursively inside the implementation. This simplifies the calling sequence as we can reduce it to shortcut logic inside the return statement in each case.

8. Implementation 2: Data-driven table

  • Description of each state in a struct.
  • Flag for accepting.
  • Number of outgoing edges, array of transitions.
  • Graph description table corresponds to declaration.
  • Number of each state implied by index in machinearray.
  • Needs to match integers with target state in transitions.
struct state machine[] = { {false, 2, {0, "0123456789"}, {1, "." }}, {false, 1, {2, "0123456789"}}, {true, 1, {2, "0123456789"}}}; char *tape = "3.14"; struct state* cur=machine;

9. Implementation 2: Data-driven code

bool check() { for(int i=0; i<cur->num; i++) if( contains(cur->edge[i].labels, *tape) ) { cur = machine + cur->edge[i].target; tape++; return true; } return false; }
  • Decide the current state accepts a symbol from the tape.
  • Update curto take the transition.

Implementation 2: Data-driven code

bool check() { for(int i=0; i<cur->num; i++) if( contains(cur->edge[i].labels, *tape) ) { cur = machine + cur->edge[i].target; tape++; return true; } return false; }

The check procedure decides if the current state of the machine can accept a character. It relies on the global variable struct state *curpointing to the current state within the machine[]array. As in the flowchart description of the machine it needs to decide if there is an outgoing transition labelled with the current symbol on the tape.

We assume that containsis a simple function that checks if the single character is in the string passed as the first argument. If we find a transition with a label string containing the character then we take the transition by updating curto point to the correct index in the machine[]

bool match() { while( *tape !=0 ) { if( !check() ) return false; if( *tape==0 ) return cur->accept; return cur->accept; } }

To perform a complete match we wrap up the calls to checkwith code to check for the end of the input. We can use the accepting flag to decide if the machine finished in a valid state.

10. Implementation 3: Dynamic Despatch

  • Different version: OO representation of regex.
  • Equivalent to a form of parse tree.
  • Code is an explicit interpreter.
  • Start with an ABC to define interface: Regex class.
  • Return number of characters matched (negative on reject).
  • Constructors take any data to instantiate that particular regex: e.g. chars in [0-9]
class Regex { public: virtual int match(char const *) = 0; }; class CharClass : public Regex { public: std::string contents; CharClass( std::string c ) : contents(c) {} };

11. Implementation 3: character classes

CharClass digit( "0123456789"); int CharClass::match(char const *text) { return contents.find(*text) == std::string::npos ? -1 : 1; }
  • Call match interface with the remaining source string.
  • Character class can only match a single character in the source.
  • Decision: is next source character in the character class?
  • Result: either match one character, or reject.
  • Convention: negative indicates failure, non-negative is success

12. Implemention 3: non-atomic cases

  • A form of loop around the wrapped regex, can succeed 0 times.
int Star::match(char const *text) { int n, consumed = 0; while( (n = operand->match(text)) > 0 ) { consumed += n; text += n; } return consumed; }

Implemention 3: non-atomic cases

int Star::match(char const *text) { int n, consumed = 0; while( (n = operand->match(text)) > 0 ) { consumed += n; text += n; } return consumed; }

The star in regex is not an atomic regex, but a modifier (a unary operator) that calls another regex. If we assume that the operand member is initialised in the constructor to this other regex then we can implement the logic for a star as calling this contained regex as many times as it will succeed.

The match member in the Star class keeps track of the number of characters that the contained regex matches so that it can provide the same interface to its caller.

We need one other non-atomic case to show the example, this is a sequence of regexs implemented in a class called Seq. This corresponds to the \\((\cdot\\)) operator in the regular expressions. For simplicity (in expressing regexs in this form, rather than implementation simplicity) we store a list of contained regexs inside a single instance of the Seq class. This corresponds to a chain of concatenations in a regular expression.

int Seq::match(char const *text) { int chars,consumed = 0; for(auto c:cells) { if( (chars = c->match(text)) < 0) return -1; consumed += chars; text += chars; } return consumed; }

13. Implementation 3: Static Declaration

CharClass digit("0123456789"), dot("."); Seq number({new Star(&digit), &dot, &digit, new Star(&digit)});
  • The static data declaration forms a tree.
  • Each object is a piece of regex, implements the ABC.
  • The code that we've already seen will interpret it.
  • The constructor for the Seq class uses a feature new in C++11 (initializer_list).
  • Naturally recursive structure to interpret trees.
  • This tree would be the result of parsing the regex input.

14. Discussion

  • We've seen the same machine implemented in four styles.
  • Each solves the same problem:
  1. How do we express the semantics of the machine? As control-flow, as data..
  2. How do we implement the logic to execute a DFA? (i.e. turn the flow-chart into code)
  3. Do we target C, C++?

We looked at this in depth in order to do two things at once:

Understand how a DFA is executed.

See a preview of how an interpreter works.


We've seen four different forms (or styles) of solving the same problem: how to execute one particular DFA. The chosen DFA corresponds to a regex and must be expressed in some way. The classic version used an ASCII rendering of the regex as the input to specify what to do. This makes sense in the context of a grep tool as this is the user-interface to the tool. The three implementations that we've looked at handled this in different ways:

  1. To encode the regex structure directly into the control-flow of code. We can see this as the output of a "compiler-like" tool that is producing code to solve the problem. It is somewhat harder to read the original regex in this form as it has been mixed up with a template for executing each state.
  2. To encode the regex structure directly into a table. This captures the meaning of the particular regex as data that a program can manipulate, the code is then the implementation of all DFA in general, and is driven by the particular data that it sees in the table at run-time.
  3. To encode the regex structure directly into a tree, using an object-oriented formulation. A section of the flow-chart expressing the actions of a DFA is then extracted as the interface to an abstract base class (ABC) and implementations provided for each of the regex structures.

Both implementations 2 and 3 are interpreting the data that describes the DFA structure to execute the regex. The form in implementation 2 looks more like a low-level language (i.e. the state machine is encoded as a jump table describing the transitions between states). This is broadly equivalent to a low-level representation in assembly language. The tree in implementation 3 is an early example of what a parse-tree looks like and the style of interpreter is broadly equivalent to how we would interpret a high-level language. This is an early preview of what we will look at in lecture 6 when we see how to write such an interpreter that does execute a parse-tree.

Many different problems in CS are solved using one of these four styles because there is an inner kernel to the problem that is equivalent to interpretation. This observation has been made many times and it is most often refered to as:

Greenspun's 10th rule

Any sufficiently complicated C ... program contains an ad-hoc, informally specified, bug-ridden, slow implementation of half of Common Lisp.

15. Flex


The steps that we have seen so far can be combined into a tool. The input is a file that describes the syntax of a language - the regexes that match each token and a description of the kind of token that is output from the lexer. The output is code - a .c language file that implements the machine. If we compile this and link it into our program then it will act as a lexer for the language that we have defined.

The level of formality in introducing the regular expresions and machines may seem excessive at first - but this level of specification is what allows us to completely automate the process. This task has already been done for us and we will use the resulting tool "flex" (fast lexical analyser) to write lexers.

Flex is also the first compiler that we see on the course. It takes an input language, converts it into a intermediate representation to perform processing upon (NFA / DFA) and then outputs an equivalent program in another language (C).

16. Tools: Lex / Flex

  • Input file takes the format:
Definitions (names for regexs) %% Rules (processed by lex) %% Code (C copied directly into generated scanner)
  • To build the scanner code is a two-step process:
flex filename.lex # Outputs into a file called lex.yy.c gcc lex.yy.c -lfl # Build the scanner code

Tools: Lex / Flex

The input definition file for flex is split three parts. Definitions allow pieces of regex to be named for reuse in multiple rules. The rules themselves define the types of tokens that are matched. The final section allows helper procedures to be defined that will be copied directly into the generated output file.

The style of coding is probably unfamiliar to a student taught a modern way of coding. There is no output format (either explicitly or implicitly). Instead you specify a piece of code that is called when the regex is matched.

These fragments of code are injected directly into the implementation of the machine that Flex will generate. These pieces of code can do anything that you can access through a global variable. This is a powerful mechanism for specifying the output format, as you write the pieces of code that would generate the token stream. Contrast this to the modern approach of specifying a data-structure that would be used to output values to the caller of the lexer.

Building the lexer will normally be taken care of inside the Makefile for the project that you working on. The raw steps to use in the Makefile are calling Flex to build the .c output code, and then compiling the generated code against the rest of your project. If you place a main procedure in the final section of the lexer definition file then the whole program can be build using the commands on the slide.

17. A Flex example

As a first example we create a file hex.lex, thusly:

DIG [0-9a-f] %% 0x{DIG}+ %% int main(int argc, char **argv) { yylex(); }

A Flex example

The definition in the first section names a regex DIG. These named regex can be used in the rule section by placing them inside cursive brackets {DIG} In this simple case we could skip the definition and simply use the whole regex inside the rule 0x[0-9a-f]

We have skipped the piece of injected code for this first example, so the generated lexer will recognise the regex but not call any user supplied code when it does.

18. Executing the example

  • So now we build the scanner and test it:
$ flex hex.lex $ gcc yy.lex.c -lfl $ ./a.out # No output, but prompt changes (empty) blahBXY0xff+0x01211ffffG'), # We type this as input blahBXY+G # Printed as output ^-d # Type control-d to exit
  • So the default action looks like it deletes the recognised text.
  • Actually it is unrecognised characters echoed to stdout.
  • We close the input stream directly to exit.
  • Somewhat important to get at the tokens we want...

19. Retrieving the tokens

  • No output datatype or API.
  • But we can inject arbitrary code into state machine.
  • We are embedding code in the definition.
  • Common style for a code generator: avoids overhead of calls.
DIG [0-9a-f] %% 0x{DIG}+ { printf("Tok: %s\n", yytext); } %% int main(int argc, char **argv) { yylex(); }

20. Code insertion

  • Embedded code in definition
0x{DIG}+ { printf("Tok: %s\\n", yytext); }
  • Small fragment of generated program
case 1: YY_RULE_SETUP #line 3 "hex.lex" { printf("Tok: %s\n",yytext); } YY_BREAK
  • The char *yytextgives access to the token characters.
  • The #linemacro should affect error messages.

Code insertion

The generated code is large and ugly. It is 1749 lines of state machine that is designed to be read by a machine rather than a human. This is ok as we rarely need to read it. Once we understand how to use Flex it would be rare to look inside the generated code unless something very unusual goes wrong.

The #linemacro is for the C preprocessor. If we make a mistake inside the embedded code it helps gcc to format the error messages. Instead of being told that the error is on line 38232 of the output file it tells us where the syntax error is inside the definition file.

The cursive brackets around the embedded code are optional, but it is good practice to use them. If you declare any variables they will mask declaration inside the code that you cannot see instead of overwriting them.

After all of the conversion the bulk of the generated code is a very efficient switch-case jump table for the different states in the machine. The only part that we really need to access is the *yytextvariable to access the text that matched the regex.

21. Common constructs: whitespace

  • Insignificant whitespace breaks token that have overlapping character sets.
  • Whitespace is typically discarded.
  • In flex we can discard simply: no action.
[a-zA-z0-9]+ { proc(yytext); } [ \t\n]+ {}

Common constructs: whitespace

Consider how we can split up the input xis5plus7. Separating the constants from the text is easy as they use disjoint alphabets. But should the split sequence read x israther than xis? We can reason about this from context from a simple machine such as a DFA cannot.

Whitespace plays an important role in language: we can use it as an explicit OOB marker to split symbols. This matches exactly the way that it is used to split words in natural-language, and so people recognise it intuitively for this purpose.

When whitespace is used to change the boundaries in the token stream, but we do not care about the whitespace characters themselves we call it insignificant whitespace. Once we have used it recognise the token boundaries we do not need it any longer. It is typical for it to be included in a lexer definition so that we can recognise the other tokens correctly, but then throw away or discard the whitespace tokens themselves.

Typically this is handled by the use of separate channels between the lexer and the parser. Whitespace is placed on a discard channel where the parser will ignore it. Sometimes this discard channel is processed by tools, for example in doxygen the whitespace may be reused to quote fragment correctly in the generated documentation.

22. Common constructs: significant whitespace

  • Sometimes we care about the whitespace.
  • Languages such as Haskell or Python: indenting.
  • This is tricky to handle in a lexer.
  • Meaning of sequence changes.
  • Easiest to capture using an anchor.
  • Non-anchored case for insignificant whitespace.
  • Need an extra newline at start.

Common constructs: significant whitespace

Modern languages try to reflect the two-dimensional structure that we see when we read them, onto the meaning of what they represent. The functional language Haskell, and the OO language Python both use the size of the indent (amount of whitespace at the beginning of each line) to indicate the block structure within the language.

This is very intuitive for a human to read as it tends to be how we express block-structure even in languages that have explicit markers (such as ;and {..}in C++). But this structure is only implicit in the character stream (a one-dimensional structure) and we must recover it by noting the relationship between newlines and spaces.

An easy way to achieve this in a lexer is to use two rules to capture whitespace, a tighter rule that captures indents using an anchor at the beginning of the line and a looser rule that captures other insignificant whitespace. The different actions in the two rules can then insert a record of the indent size into the token stream or discard the whitespace as appropriate.

^[ ]+ { build_indent(yytext); } [ ]+ {}

An alternative way to solve this issue relies on putting more C++ code within the lexer action. Typically we record the line number and character offset for each token the lexer generates. This is so that we can emit decent syntax error messages to the user. To do this flex has a default action that is executed after each character.

%{ // Code to run each character xPos++; %} \n { xPos=0; yPos++; } [ ]+ { if(xPos==0) build_indent(yytext); }

The tradeoff here is partly performance and partly complexity. The second piece of code is clearly more complex and involves the interactions between more parts of the lexer. It also involves some stepping action (that may be more complex than the variable increment shown) being called on every character the lexer processes. The performance impact depends on whether or not we need to do this anyway (e.g. for error reporting) and how and the understandability issue depends on the the complexity of the syntax for the language. Both approaches are shown because each is better in different scenarios.

23. Common constructs: comments

  • Allow programmer to manually access discard channel.
  • Human readable text to be ignored by compiler.
  1. Line-based, e.g. //or #
  2. Enclosing, e.g. /* ... */
  • Both kinds have issues, complex cases in lexer.
  • Escaped newlines?
  • Should they nest?

24. Common constructs: comments II

  • No elegant solution.
  • Approach that work in some way.
  • Treat comment up to newline as single token.
  • Trick for skipping escapes
  • Ignore nested enclosing comments (C).
  • Hack something very ugly in...
  • Integrate lexer with parser..

25. Common constructs: strings with states

  • Reasoning about behaviour of matcher inside regex is hard.
  • Thinking up these tricks is harder than using them.
  • Alternative approach is to use states.
  • A state machine we can layer on-top of the one generated.
  • BEGIN is a macro, changes flags.
  • <list> is a guard.
  • Regex only matches in listed states.
<INITIAL>" { BEGIN(STRING); } <STR>" { BEGIN(INITIAL); } <INITIAL>[^#] { stuff...} <STR>[^"#] { other stuff...} <INITIAL,STR># { both cases.. }

Common constructs: strings with states

Strings tend to be a little bit more complex than comments, as most language have a larger set of escape sequences that are active within strings. We can approach them in the same way as comments and use a variation on the same trick. i.e. a regex with constant strings at the beginning and end with a larger choice in the middle of the various length escaping sequences.

This approach gets increasingly difficult as more parts of the syntax interact witho one another. There is an alternative. We can layer a state machine over the regular expressions. Flex will give us guards: filter regex so that they only match within the states that we have defined. There is a macro that we can use within the injected code actions to change the state in this user-defined state machine (BEGIN()). Flex will stitch this extra layer of states into the machine that it generates for us.

Now we can use an approach that scales up more easily to a complex language syntax by introducing two different states:

This allows us to defined two different actions for any regex case - what it should do inside the string and what it should do outside the string.

26. Common constructs: the lexer hack

  • Strings / comments are close to the boundary for lexing.
  • There are some things we cannot do.
  • Valid C code:
typedef unsigned long long Uint64; Uint64 x; int Uint64;
  • The same string Uint64 means two different things.
  • Typename, vaariable identifer.
  • Ideally these are two different token types.
  • But the lexer cannot decide without access to a table inside parser.

Common constructs: the lexer hack

It is worth remembering that the division between lexical analysis and parsing is somewhat arbitrary. It exists to simplify the problem:

  1. Identify a stream of tokens from the stream of characters.
  2. Use this to generate syntax errors if the characters do not form valid tokens.
  3. Use parsing on the stream of tokens to find more complex errors.

This division made more sense 30-40 years ago when memory contraints were much tighter and it was difficult to fit both the lexer and the parser in memory at the same time. Today we have more memory than we know what to do with, and the division may not simplify the problem as much as we hoped.

An active are of research is "scanner-less parsers" where the low-level lexing definitions are combined directly into the language grammar for parsing. This can be seen in tools such as ANTLR. We stick to two separate tools on the course for pedagogic reasons: the first time that you write a parse it makes sense to look at two smaller problems. I've designed the assignment so that you can handle strings / comments directly inside the lexer and there is no lexer hack to worry about.

27. Summary of lexical definitions

28. Summary

  • Two lectures on lexical analysis.
  • Multiple representations of same entity:
  1. Regex
  2. Regular expressions
  3. DFA / NFA machine
  • Flex implementation of translation.
  • Switching between representations to solve problems.