DV1584 / DV1585

Compilation Technologies

Dr Andrew Moss

10:15 - 12:00 Monday 15th January, 2018

Room J1270

Introductory Lecture

1. Overview


This course is designed to fit in the curriculum after your introductory programming courses. You are expected to understand programming in C++ at a basic level. This course is designed to expand your understanding of programming in C++ concretely (and other languages more abstractly). You will be introduced to Lua, where we will focus on a small subset of the language that is easy to compile.

It is helpful if you have completed the Introduction to Linux course as we will be working within a linux environment, and some background knowledge of regular expressions is useful. If not, the materials are online for you to review.


This semester we will look at the definition of Programming Languages: one of the fundamental subjects within Computer Science. We look at how the syntax of a language is expressed. This is the low-level textual structure of the language, and defines which source input strings describe valid programs. The syntax defines which kinds of structure and language constructs can be used by the programmer when they write programs. This syntax is checked by a piece of software called a parser.

Writing software that can process complex textual structures, and reduce them to crisp representations, is a general skill in programming. It can be used directly in programs that process other programs as input: interpreters, compilers and analysers such as linting tools or documentation generators. It can also be used directly in programs that operate on textual data, e.g. in writing loading / saving routines for different file formats. Indirectly the same skills can be used in writing any program that operates on complex data where the I/O is as complex as a textual stream.


Once we know that a source program has valid syntax the next step is to decide what it means, the semantics of the program. Typically this defines how the program should be executed, what kind of output it produces and what effects it produces in its environment. We look at the meaning of typical constructs in languages, and how to implement the code that executes them in an interpreter. Almost all non-trivial programming problems have a poorly specified interpreter embedded within them. Learning how to write a real interpreter makes it easier to extract these types of problems and solve them well.

It is typical to have a rudimentry understanding of recursion early in a career as a programmer. One of the outcomes of this course is the extensive experience in formulating parsing and interperation problems recursively. This valuable experience improves programming skills across a range of problems and languages.

We look at how different data-types and forms of control-flow are implemented in programming languages. This leads to a deeper level of understanding in fundamental programming concepts. In summary - students are expected to start the course with the basic knowledge of programming acquired during the introductory courses. We will extend this to a more refined understanding of programming in general.

Low-level programming

The final way that the course rounds off programming education is looking at the code-generation part of a compiler. We will see how the basic data-types and control-flow structures map into x86-64 assembly language. Some previous exposure to assembly language programming is recommended. This will be extended with details of the modern x86-64 programming model, and we will show how to interface it directly with C++ code using inline assembly. This is a valuable real-world skill that is useful in systems programming (e.g. Operating Systems and Drivers), IoT and other types of embedded software, and writing high-performance software.

Relationship to other areas

This subject relates to other subjects in the Computer Science curriculum:

Computer Science in general teaches the tower of abstractions that we use to tame the complexity of a modern computer system. We assume that students have already studied Computer Organisation and Operating Systems. This course closes the final holes in the lower levels of the tower - how programs are defined and executed. This shows how the mathematical essence of the subject connects to the real-world practical issues of building working software.

2. Translators

  • Program Transformation is concerned with programs that translate programs.
  • Translators are meta-programs.
  • Interpretation of a program executes it - we observe the effects.
  • Compilation of a program produces a program.
  • Static analysers give us information.


Program Transformation is the field of Computer Science that includes the study of compilers. It is generally concerned wth the study of Program Translators. These are a form of meta-program: programs that operate upon other programs.

In general, when we execute a program we care about three things:

  1. The input data, that changes the behaviour of the program.
  2. The output data, that the program produced when executed.
  3. The observable parts of the program behaviour - the effects that it has on the computer that is executing it, e.g. changing parts of the file-system or launching other programs.

An interpreter is a program that execute a source program. It consumes any input data that the source program would consume. Any output that would be made by the source program is made by the interpreter on its behalf. Similarly if the source program would cause any effects on the computer, the interpreter will produce them. In some sense the interpreter is impersonating the behaviour of the source program for us. This impersonation is a process called simulation that we will discuss later.

A compiler translates a program from one language into another. Typically the source program is written in a language that is not native to the computer that that we wish to execute the program upon. The target program is (typically) native to the computer, e.g. it may be a program binary written in machine code. We may think of this process as pretending to execute the program in a way that records each of the decisions that an interpreter makes based only on the source code. Then rendering these decisions into the form of an assembly language program. This is a process that we will discuss in detail over the last six lectures in the course.

An analyser produces some form of useful information about a program that relates to its execution. For example, a linting tool may recognise dubious patterns in the code and report them. A bounds checker may attempt to make sure that all array accesses are within appropriate bounds. Other tools may produce information about the performance of the code, or how much memory it uses. We can think of this as using a similar process to a compiler - pretending to make the decisions required to execute the program and recording the results of those decisions.

3. What is a program?

What is a program?

A series of discrete actions that causes effects when executed.

  • We are familiar with this in the imperative paradigm.
  • Code: loops, conditional branches and expressions.
  • Data: 1D array of bits, treated as different types.
What is a language?

A way of expressing the actions that an be evaluated objectively.

4. C++ example

  • Simple running example.
  • Urgh - ugly layout style!
  • What does it do?
  • g++ t.cc && ./a.out
  • If we made the source nicer...
  1. Readable variable names
  2. Spacing / braces.
  • ... is it a different program?
#include <stdlib.h> #include <math.h> #include <iostream>
int main(int argc, char **argv) { int i,f,n=atoi(argv[1]); int u= (int)sqrt(n)+1; bool *ps = new bool[n]; for(i=0;i<n;i++) ps[i]=true; for(i=2;i<n;i++) for(f=2;f<u;f++) if(i%f == 0) ps[i]=false; for(i=2;i<n;i++) std::cout << i << ": " << ps[i] << std::endl; }

5. Lua example

  • Different language: is it the "same" program?
  • Same algorithm: Sieve of Eratosthenes
  • Same logical view of data / different concrete bits in memory.
  • Same logical view of control / different concrete constructs.
primes = {} for i=1,arg[1] do primes[i] = true end ubound = math.floor( math.sqrt(arg[1]) ) + 1 for i=1,arg[1] do for f=2,ubound do if i%f == 0 then primes[i] = false end end end for i=2,arg[1] do print(i,primes[i]) end

6. Both examples

Both examples

There is something common within both programs - even though they are expressed in different languages. We can visualise a "psuedo code" structure that exists in both programs. The blue parts show some of the control within the program, compressed slightly for clarity on the slide. The middle blue box could be expanded further into smaller pieces that define the two loops.

The yellow boxes show an abstract view of some kind of state - in this case only an array of true / false values. We can think of the other state in the program (the upper bound and loop indices) as auxillery data that is used to manipulate this representation of the output the program is creating.

Time flows loosely from top-to-bottom although the loops mean that some parts of the program will be repeated. The visible effects are not shown directly, we rely implicitly on you having some background knowledge of what "print" is likely to mean. This visualisation is not very precise:

  1. In order to be clear it must be simple - details are missing.
  2. The details are important for clarity.

7. The definitional problem

  • It is hard to define "Program" and "Language" without circularity.
  • To do so we introduce a Machine.
  • Defined in Set Theory / Arithmetic.
  • Program is a series of changes in the Machine.
  • Language is simply the set of programs.

The definitional problem

It is difficult to define a program without making reference to the language it is written in. It is difficult to define a language without reference to its programs. There is some circularity between the two concepts. It is surprisingly difficult to define either properly without falling back on their underlying mathematical definitions. One of the reasons that compilers are studied in a CS curriculum is to give you a taste of the discrete maths and logic that forms the foundation of our subject.

Mathematics is what happens when you give an arbitrary set of rules to an intelligent agent and ask them two kinds of questions:

  1. What kind of result can occur if we follow these rules?
  2. What kind of result cannot occur if we follow these rules?

i.e. for an arbitrary set of concepts expressed as hard rules, what is the shape of the possible applications of the rules? This places a high burden on the mathematician in the process. They must infer exactly what is and what is not possible using information hidden inside the definitions. Extracting these outcomes is an interpretive process: the agent and their perception of the rules interacts with the structure of the rules. This is why mathematics is generally performed by consensus: results are regarded as stronger when more minds have intepreted them.

If we keep in mind this view of mathematics as a game of achieving consensus by intelligent interpretation, communication and validation then we can see the origins of Computer Science as finding the answer to one particular question:

What if the agent is a very particular kind of stupid?

Imagine that the agent is perfect in one respect: the things that they can do, they can do perfectly. They will always reproduce the same answers to the same questions when given the same information. But in another respect they are defective: there is no intelligence behind their answers. They can only answer the simplest kind of question, but when they do so they can provide perfect answers.

This particular kind of agent is called a Machine. Machines arise by considering two properties of agents that interact with systems of rules:

  1. How much information about the system they can extract from the defining rules.
  2. How much we can trust the answers they provide.

The first property is minimized, while the second is maximised. This combination is chosen because it allows the machines to be physically realised - we can actually build and execute them.

8. What is a program ?

  • The execution of a program on a machine creates observable effects.
  • We can think of it as a mapping from input to output.
  • Some output may be explicit, e.g. writing values to a terminal.
  • Some output may be implicit, e.g. delays in time.
  • The definition of the program is relative to a machine.
  • Computers (physical hardware) are only one type of machine.

9. Machine

  • Mathematical formalism.
  • Input is a set of discrete values.
  • Output is a set of discrete observable effects.
  • Steps are discrete transitions: simple enough to perform mechanically (no "magic").
  • "Physically realizable": we could build it.
  • Constained by Information Theory / Physics

10. Program equivalance

When are two things equivalent?

When we can replace one with the other and not notice a change.

  • When are two programs equivalent?
  • It depends on what kind of changes we care about.
  1. Behavioural equivalence - same inputs produce same outputs.
  2. Data equivalence - data in memory goes through same states.
  3. Exact equivalence - exact mapping between execution steps.
  • These range from weakest (1) to strongest definitions.
  • Stronger definition imply the weaker, e.g. repeating same steps will perform same changes to data.

11. Simulation

  • For very simple Machines: a circuit can simulate it.
  • A hardware implementation.
  • Hard to do if the Machine is more complex.
  • As long as each execution step can be described exactly...
  • ... we can write a program to perform it ...
  • ... simpler Machine is programmed to simulate the more complex.


The diagram shows Machine A acting as a simulator for Machine B. This means that programs written for Machine B can be executed inside a simulator program executing on Machine A. The simulator will need code that implements the different instructions inside the simulated machine. The memory of the simulated machine must be represented within the simulator program - this is easier when the two forms of storage are similar, e.g. if both machines use an array of bits as storage then the memory of the simulated machine becomes a simple allocation of storage inside the simulator machine. The program of the simulated machine needs to be encoded in some form so that it can be stored in the memory of the simulator.

If a Machine is very simple then we can build a circuit that will simulate the behaviour of the Machine. For any program in the language of the Machine, the hardware circuit will map the possible inputs to visible outputs. Typically a Machine this simple will have a very small constant-sized state, and so we can build a very strong simulation that preserves the sequence of changes to to this data that any program will perform. Examples of this are processors. The Machine is a definition of how the instructions in the instruction-set affect the data in the registers, and when they cause transmission of signals to an external memory. This becomes much harder to get right as the complexity of the Machine increases. For all of the complexity of a modern processor (several billion transistors) the language that it simulates is relatively simple: a few kilobytes of internal storage and a set of operations to perform arithmetic and logic unpon that storage.

More complex languages feature variable sized data-structures with richer sets of operations to perform upon them. It is difficult to design a hardware implementation of a more complex language without the design collapsing under the weight of its own complexity.

Instead we simulate a more complex language in software: we write a program that implements the different execution steps in the more complex language. This implementation is precise enough that we can simulate the exact changes to the data in the complex machine, and observe changes to this data step-by-step in the more complex language.

12. Difference between compilers and interpreters

  • Trivially, speed...
  • The number of simulations that we run.
  • Interpretation is a simulator running a simulator.
  • Compilation is the red dashed relation.
  • Convert from a program of one machine (language) into an equivalent program of another.
  • Translation is a one-time cost: do not pay overhead of extra simulation.

13. Difference between compilers and interpreters II

  • Compilation is global.
  • Interpretation is local.
  • The meaning ( semantics ) are fixed at "binding times".
  • "compile-time" and "run-time".
  • To change compiled code we need to recompile (the whole program).
  • To change interpreted code we update the simulation between steps.
  • "line by line" -> local steps in the machine simulation.

14. Binding time

  • Physically realisable implies no infinite regress in the steps.
  • We know the steps terminate - we can do things between them...
  • Run another program (embedding).
  • Change the code (modding).
  • Incremental changes, without restarting the host application.
  • Even after shipping - end-user modifications / scripting.

15. Front End

16. Language Power

  • A language is a set of strings.
  • All interesting language are infinite.
  • Size is not a useful measure.

How precisely can we define the set?

  • More precise
  • We need to define which strings are in / not in the language.

17. Language Classes

  • We evaluate rules to recognise the language.
  • Complexity of rules determines the language power.
  • Simplest case has no nesting: lexer is collection of regex.
  • Context-free languages include nesting, machine is simpler than full computer (parsing).
  • Hardest bits are written directly in C++ - code to check / alter parse-tree.

18. Why we use a parse-tree

  • A proof that the source parsed.
  • Simplest data-structure that represents nesting.
  • Containment of boxes: which boxes are within other boxes.
  • Nesting in a tree: which nodes are beneath others.
  • Two representations of the same information.
  • Hence: the tree records the containment relation.
  • Shows structure in the source.

19. Why we use a lexer

  • Innermost boxes are atomic.
  • No internal structure.
  • Can be constant strings.

e.g. if { 'z'.

  • Can be formed from simple patterns.

e.g. 0x[0-9a-f]* [a-zA-Z][0-9a-zA-Z]*

  • Finding these strings first makes parsing simpler.

20. What is executability

  • Program and state are represented by some kinds of structure.
  • Local decision: execute each piece of without the rest.
  • Every step is guaranteed to finish ("physically realisability").
  • Programs with recursion have a weaker guarantee.
  • Execution of step will terminate if the step in the program does.

21. Course Structure

  • Course in two parts.
  1. Front-end: lexing, parsing, interpretation.
  2. Back-end: code-generation.
  • Examination is by viva.
  • Roll-call for dv1584 is lexer lab.
  • Dv1585 has no roll-call: use lexer in parser submission.
  • This is to balance 6cp vs 7.5cp, allow CivIng to manage time against parallel courses.

22. Lab 1: Lexing

Learn how to use flex.

Tool to lex source into a token stream.

  • Lectures cover two parts of the problem:
  1. Theory: how DFA / NFA are defined and work.
  2. Practice: how to use the flex tool, syntax.
  • Lab is a practical tutorial: lexer for bash.
  • Tutorial does not explain every step.
  • Closing the gaps is how you learn to use the tool.
  • Problem is similar to the assessment: generalise what you learn.
  • There are differences: test your understanding.

23. Lab 2: Parsing

Learn how to use bison.

Tool to convert a grammar specification into a pushdown automaton.

  • Again, lectures cover two parts of what you need:
  1. Theory: what are grammars and how are they defined, how do they relate to pushdown automata, how are they executed in a parser.
  2. Practice: common idioms in grammars, how they are used in parsers, techniques for working with grammars.
  • Tutorial in the lab is writing a parser for some of the bash shell.

24. Lab 3: Interpreter

Implement an interpreter in C++.

Show a style of implementation for interpreting parse-trees.

  • Lectures introduce:
  1. Theory: how interpreters works.
  2. Practice: particular style of implementing one in C++.
  • Lab tutorial shows how to convert your parser to a working shell.
Interpreter Assignment

Generalise what you have learned, same tasks on a part of the Lua language: lexer + parser + interpreter.

25. Programming (very abstract view)

  • As programmers we tackle many different problems.
  • But, really we only do two things.
  1. Solve a problem by choosing from pieces of functionality.
  2. Change the representation of the problem.
  • Typically we do the second to choose from a different set of functionality.

26. Programming

  • Activity 1 is our normal day-to-day (default) operation.
  • Reading APIs, plugging together functions, relatively easy.
  • Activity 2 requires experience of how different abstractions work.
  • This is typically harder to acquire: we need to sweat a little.
  • By understanding more abstractions, we write better code.
Motivation for the course

Code is really just another form of data. Learning how to represent it and manipulate changes the way that we see all programming problems.

  • It is one of the most important abstractions that we rely upon.