10:15 - 12:00 Wednesday 17th January, 2018

- We look at lexing over two lectures.

- Problem and Examples.
- Definition of regular expressions and DFA
- NFA, conversion back to DFA.
- Implementation styles for DFA.
- How to use Flex (tool that does the above).

- Lab Task 1: Write a lexer for bash.

- DV1584
- Complete in lab, checked off by TA for rollcall.
- DV1585
- No hand-in.

- Syntax of POP3 as an example.
- Boxes drawn around logical entities (messages).
- Structure is split in two ways.

- Newlines / words (low-level).
- Implicit message boundaries (depends on exact words).

- Structure roughly models human language: paragraphs, sentences, words.

We use the POP3 protocol from email as a first example of a language structure. The raw text inside each line is what is transmitted within the protocol. The colouring of the lines and the S: / C: prompts are simply for readability to show what the server and client send to each other. Parsing the syntax means recovering the boxes drawn on top of the sequence.

The boxes show the logical entities in the protocol, the messages being communicated. There are two kinds of structure with the language.

- A low-level structure made up of newlines, spaces and the words inbetween them.
- A high-level structure that associates lines with the different messages in the protocol.

The syntax of PDF files is more complex than POP3. It is also an example of a *serialised format* (i.e. a file format for data we store on the disk). The complexity of the format increases the number of different boxes, or logical entities, that can be represented in the file. The meaning (or *semantics*) of each box is also more complex as many different types of data can be embedded inside the PDF (bitmaps, fonts, graphical transformations, simple drawing programs).

As in the POP3 example there is a simpler low-level structure of the boundary cases between the individual words in the document. Again, white-space characters (newlines, tabs and spaces) are used to separate different words. The words can be recognised as different kinds of token by the presence of specific characters (e.g. the variations of punctuation shown in the diagram).

- General model: boxes in boxes.

Sequencing

Linear sequences of boxes. Either a fixed number or variable.

Choice

Different types of boxes, indicated by labels / shapes / colours etc.

Nesting

Putting boxes inside one another. (**parsing, ignore for now**)

- In lexical analysis we want the structure of the innermost boxes: where are the boundaries between them in the text?

The model that we use to draw the logical boxes over the top of the text has three ways of arranging the boxes:

- Sequences of boxes: when we have identified the boundaries between individual words they form linear sequences in the text. These can include fixed-length words, or variable-length words. This depends on the particular language that we try to recognise.
- Choices of boxes: in a sequence we may recognise different kinds of boxes. For example, in a programming language we may recognise a "word" that consists of digits as a numerical constant, or a "word" that consists of letters as the identifier of a variable. In the syntax for the language we can express choices between different kinds of boxes that can appear at a particular point in a language.
- Nesting of boxes: the more complex logical entities are composed of simples kinds. For example a function in a language may be made of individual statements. Those individual statements may be made of individual expressions, identifiers and constants.

For this lecture and the next we look at lexical analysis: recognising the boundaries between the simplest (atomic) boxes that have no internal structure. This means that we look at how to express sequences and choice in a language syntax. The problem of nesting is left until we look at parsing in lectures 4 and 5.

- There are three approaches to defining boundaries.

- Restrict the band: OOB markers.
- Disjoint alphabets.
- Explicit length markers.

- Languages may mix these approaches together.
- In Information Theory approach 3 is called "framing".
- The other two approaches are examples of "self-framing".

The underlying context for recognising sequences of symbols is called Information Theory. This is a rich and deep area that we can only touch upon briefly in the course. The set of symbols that can be transferred is called the *band*, and there are two techniques in coding that are relevant in defining and recognising syntax.

If we think of the set of symbols that we use in a language as an alphabet then common choices are ASCII or UTF-8 as the alphabet. A particular atomic box or **token** does not need to allow every possible symbol in the alphabet. For example, identifiers of variables may allow the alphanumeric characters ([0-9a-zA-Z]. When we see consecutive symbols in this restricted subset of the alphabet they are encoding **in-band** information. When we eoncounter a symbol outside of this set (e.g. a character such as + we have encountered an **out-of-band (OOB)** character and we can use this to infer that the token is finished and another has begun.

When neighbouring tokens in the language use different restricted sets of the alphabet we are using an encoding technique called *disjoint alphabets* to break the stream of symbols into tokens. A similar technique is to reserve particular characters that cannot appear within tokens and use them to break tokens (e.g. spaces as markers and the characters between spaces as words). These are called *OOB markers*.

A more complex technique is encoding a length marker in the stream and using it to count a number of symbols that follow in the stream. This technnique is used heavily in communication protocols, but less so in file formats as it makes the machine that recognises the language significantly more complex (and difficult to verify).

- Looking (only) at the statement structure.
- We see examples of boundary definitions.
**Keywords**are character sequences that cannot clash.- OOB with respect to the normal sentence contents.
- Cursive braces / semicolons example of OOB.
- Illegal within sentences, form explicit sentence boundaries.

- We have three different ways to represent the same thing.

- Regex
- Linear equation (regular expression)
- A machine (either a DFA or an NFA)

- There exist
*bijections*(1:1 mappings) between these classes. - Regex is a syntax to render regular expressions as plain text.
- The machine is executable - it has semantics.
- Each machine implements a particular regular expression.
- If we execute the machine it
**outputs a single bit**.

True | The regular expression matches the regular expression |

False | The regular expression rejects the regular expression |

Problem: Matching

Input is a pattern, and a source text. Output is boolean: does the pattern match a prefix of the source?

Problem: Searching

Input is a pattern, and a source text. Ouput is a (possibly empty) list of positions where the pattern matches in the source.

- The pattern describes a language: family of strings.
- A string is a match if it is in the set.
- Example: alphanumeric strings (i.e. identifiers).

Infinite set of strings: "a" "b"... "ax3"... "blah17"...

- I assume you know how to use regex, either:

- Background knowledge - quite easy, useful.
- Introduction to Linux course.

- If not please read this lecture.
- I will go through the definition of regular expressions.
- Maths that underlie regex.
- Normally used over characters, but can be defined over other alphabets.

pushl %ebp;movl %esp,%ebp;.*;leave (matching instruction-strings)
body/(span/style|div/id)* (matching node-string paths in html-trees)

- A regular language is a linear algebra defined over a ring.
- The set \\((\Sigma\ \\))is called the alphabet.
- The language is a set of strings made of the symbols in \\((\Sigma\\))

e.g. (using ASCII as \\((\Sigma\ \\)))

- \\((L = \{ \mathrm{hello}, \mathrm{world} \}\ \\))(language with two strings)
- \\((L = \{ \mathrm{x}, \mathrm{xx}, \mathrm{xxx} \ldots \}\ \\))(infinite language with strings made of xs)
- \\((L = \{ \mathrm{1}, \mathrm{2} \ldots \}\ \\))(language with ASCII renderings of integers)

- The algebra define three operators over languages

Concatenation | \\((L\cdot K\\)) |

Union | \\((L\cup K\\)) |

Kleene Star | \\((L^{*}\\)) |

- Standard concatenation of strings, e.g. \\((\mathrm{xy}\cdot\mathrm{z}=\mathrm{xyz}\\))
- Concatenation of a pair of languages is a Cartesian Product.
- Every combination of concatentating a string from the two languages.

Concatenation (of languages)

Let \\((X \cdot Y = \{ x\cdot y \; | \; \forall x \in X, \; \forall y \in Y \}\\))

e.g. (again using ASCII as \\((\Sigma\ \\)))

- \\((\{ \mathrm{use}, \mathrm{pain} \} \cdot \{ \mathrm{ful}, \mathrm{less} \} = \{ \mathrm{useful}, \mathrm{useless}, \mathrm{painful}, \mathrm{painless} \}\\))
- \\((\{ \mathrm{x}, \mathrm{xx} \ldots \} \cdot \{ \mathrm{y}, \mathrm{yy}, \ldots \} = \{ \mathrm{xy}, \mathrm{xyy}, \ldots \mathrm{xxy}, \mathrm{xxyy} \ldots \}\\))

- The other operator is union ("choice").
- If we take the union of two sets: we merge them into one.
- Union of two language is all of the strings in both.

Union of language is set union

Let \\((X\cup Y = \{ x | x \in X \ \lor\ x \in Y \}\\))

- For convenience, we will write strings as languages.
- These are simply singleton sets, e.g. \\((\mathrm{hej} \rightarrow \{ \mathrm{hej} \}\\))
- This means that we can write equations over languages / individual strings.

e.g. \\((L \cdot \mathrm{xx} \cup K\ \\))(strings in L with "xx" or strings in K)

- A ring is structure that behaves like \\((+,\times\ \\))over the integers.

Concatenation is multiplicitive

The multiplicative unit (one) is \\((\epsilon\ \\))(an empty string).

- \\((L \cdot \epsilon = \epsilon \cdot L = L\\))
- Compare with integers: \\((1\times x=x\times 1=x\\))
- So we can think of exponents (a unary operator) over languages.

e.g.

- \\((L^{3} = L \cdot L \cdot L\ \\))(any three strings in L)
- \\((L^0 = \{ \epsilon \}\ \\))(zero-th power is an empty string)

Oops

It was a long day recording and editing... this seems to have slipped through my cuts in premiere...

- The Kleene star is a unary operator.

Kleene star is zero-or-more exponent

Let \\((L^{*} = L^0 \cup L^1 \cup L^2 \ldots\\))

- Think of it as a
*wildcard*exponent - any power of the language. - It evaluates to any-length sequence of strings in the language.
- Single string case: \\((\mathrm{xy}^{*} = \{xy\}^{*} = \{\epsilon\} \cup \{ \mathrm{xy} \} \cup \{ \mathrm{xyxy} \ldots \}\ \\))
- So it matches our expectation of how stars behave in regex.
- Why is it important that regular expression have this algebraic structure...?

- We know that \\((L\cup K\ \\))is defined as set-union.

- Associativity, commutativity and standard identities apply.
- Additive operator, unit (zero) is \\((\emptyset\ \\))hence \\((L \cup \emptyset = \emptyset \cup \ L = L\\))

For regular expressions to behave as linear equations, we require:

Distributivity properties

- Left-distributive: \\((n\cdot (x\cup y) = n\cdot x \cup n\cdot y\ \\))
- Right-distributive: \\(((x\cup y)\cdot n = x\cdot n \cup y\cdot n\\))

- Informally these both make sure, e.g. "string n followed by either string x or y is either n then x, or n then y"
- Both easy to prove formally (by rewriting in set-theory).

- Manipulating equations is normal and familiar.
- Regex and DFA can be both be converted to regular expressions.
- We can operate on regular expressions as normal equations.
- Gives us another tool to use.
- Later, grammar is much weirder.
- Some familiar with regular expression now helps later.

- The alphabet is the set of characters (typically ASCII or UTF-8).
- We drop the multiplication operator, e.g. \\((x\cdot y\ \\))<-> xy
- We write \\((\cup\ \\))as the vertical pipe |
- Single-character choices are so common we use square-brackets as a shortcut, e.g. x|y|zis [xyz]
- So we can map back and forth directly...

[a-zA-Z]*[0-9]* | \\(((\mathrm{a}\cup \mathrm{b} \ldots \mathrm{Z})^{*} \cdot (\mathrm{0} \dots \mathrm{9})^{*}\\)) |

0x[0-9a-f]* | \\((\mathrm{0} \cdot \mathrm{0} \cdot (\mathrm{0} \cup \ldots \mathrm{f} )\\)) |

([1-9][0-9]*,)*[1-9][0-9]* | \\((((\mathrm{1} \cup \ldots \mathrm{9})\cdot(\mathrm{0} \ldots \mathrm{9})^{*}\mathrm{,})^{*} \ldots\\)) |

- Deterministic Finite Automata.
- Typically drawn as diagrams.
- Circles are
**states**, label inside. - State labels are relatively unimportant.
- Integers are conventional.
- Smallest number is the
**starting state**. - Arrows are
**labelled transitions**between states. - Arrows labels are symbols from alphabet \\((\Sigma\\))

- The states and transitions define the behaviour of the machine (which regular expression).
- To execute it we need state.

- Which
*one*of the states is currently the active. - An input tape with string of \\((\Sigma\ \\))symbols to match.

- Tape is not random-access.
- Read
*only*single cell under the head, advance one position.

- We can translate regular expressions to fragments of diagram.
- Few cases to know about.
- Glue fragments together into a DFA.
- Approach seems informal -
*correct by construction*. - Concatenation turns into sequence of states.
- One transition per symbol, state between.

- Union is choice.
- Add multiple transitions with first label.
- DFA must have unique labels on edges from a state.
- Implies first symbol in each choice is unique.
- Final state in each case merges to following part of expression.
- Merging transitions have same labels.

- Stars turn into loops in the machine.
- *zero-or-more* so can accept immediately.
- When concatenated this means state can be skipped (outgoing edge as well as loop).
- When compound expression with a star, loop is over the embedded graph.

- Style of DFA described corresponds to
**greedy**regex. - Can run into problems during conversion.

i.e. conversion described is not a surjective map (complete)

- Playing with equivalences on the algebraic form can fix.