Avoiding The Needless Multiplication Of Forms

Automated Testing In Docker

Dr Andrew Moss

2015-09-12

The first batch of student assignments has been graded, and so this is the penultimate post in this series. Today I want to write a little about the experience of testing via docker. The final post will be some time later as I want to build a queuing system for tests - the long-term idea is that students will have access to the testing facility before them submission, rather than a teacher using it before grading.

Battery consumption on the mac.

Using docker on the mac rapidly makes the Terminal application the most energy intensive. Opening it up inside Activity Monitor shows that the VBoxHeadless process is running and consuming power in-between test uses. The simple solution is virtually suspend/resume the machine:
$ VBoxManage vmcontrol pause default ... $ VBoxManage vmcontrol resume default

Underspecification in the assignment.

Students seem to have a natural genius for spotting parts of the assignment that are underspecified or ambiguous, then as a weakly coordinated group they spread out among the possibilities that are just within the specification and their work samples the boundary condition of what is allowed. It's like watching a particle filter explore environmental constraints. Anyway, the compiler assignment they were given in this year forgot to mention some things:

False positives on plagiarism.

One of the onerous duties of a teacher is that we need to be aware of plagiarism and spot when work has been duplicated. We must also avoid false positives as the process for handling student plagiarism does not handle abort cleanly (people are humans and tend to get pissy when accussed of something they have not done). In this batch it looked like there was a clear case of copied. Very clear cut. Almost too easy to spot.
Experience indicates that the more obvious something is the harder we have to work to verify that it is true. So in this case the individual tests had to be rerun manually and the code examined against them to check what was happening. Luckily I have a high background level of disbelief so after checking minutely within the testing environment it was time to dismantle the testing environment and manually unpack and stage the submissions to verify that the testing environment had not magically made their codes look the same. In this case this level of technical paranoia was justified as the testing environment had indeed altered their codes to make them look the same. Bad evil docker.
So what happened? If we recall the file-sytem inside a container is a union-fs that stacks parts of the FS tree together from more basic pieces. Docker tries to be smart about this so it uses a cache of those previous pieces. Reruning the same build instructions from the same source files (which have been reloaded with a fresh submission) does not propagate dependencies - Docker is not make. It does not check if the source has changed, it has a different set of rules for when to rebuild dependencies and when to reuse the cached FS layer. Bad docker.
When tools misbehave we just turn them off, and at least Docker is flexible:
$ docker build --no-cache -t=current_test .
This disables all caching and ensures that we build the container only from the submitted work and the pieces of the base image - never caching state from another submission. Phew, narrowly avoided disaster.

Return codes.

As far as I can tell Docker may be flakey and crappy with processing the return-codes of RUN commands within build scripts. Or not. The caching issue above may have contributed to executing the wrong executable. It is hard to pin down an exact failure, and the various bug-reports against Docker for this type of issue have been closed. This could have been a ghost issue as a result of fixing the caching problem above. Either way it does seem safer to write out the return-codes as a result of the test rather than using them to decide if the test continues:
RUN ./comp /testcases/1.lua >redirect.dot; echo $? >/testresults/1.retcode

Performance

Well, it ain't quick. But on the other hand it is much faster than typing in the tests manually. Each run takes about 15 seconds at the moment (scraping the various filenames together and handling every variation in the underspecified assignment takes many many minutes). It is definitely slower than something I would stick behind a CGI gateway and give students access to at the moment. One speed-up will be attained by smashing together various RUN commands into a single shell command separated by semicolons. But it will still be slow enough to be some kind of batch system. Also, having up to 50 students hitting the testing server at the same time may introduce a latency of 10-15 minutes.

Code that breaks

There are generally two ways that student submission fail to terminate cleanly, either they crash or they hang. Programs that seg-fault are shown in the build output. Need to find a way to record this automatically, at the moment it is manual. Something very nice that is normally hard to do is to recover a stack-trace for the failure.
Running an interactive container on the current_test image allows the installation of gdb, manually flip on debug info in the student makefile and then replicate the crash inside a debugger. This allowed me to email stack traces to students that submitted crashing code. I am suitably happy with this :) I also need to think of a way to automate this process.
Hanging code is a bit harder and I suspect that I will find a cpu quota option in Docker somewhere and put a cpu-timelimit in the assignment specification. This will interact with the performance issues above and affect the maximum latency on a submission queue.

Conclusions

Grading student work is quite a subjective process that depends on an objective testing of the artefact they submit. The testing process is quite error-prone and fragile, much more so that many teachers would like to admit. For the testing process to really be objective it must be made robust; reproducibility of the testing used to inform a grading decision is vital to minimising the subjectivity of the decision. My initial experiences from trialling this approach suggest that it has great potential. I estimate that when students get access to the testing mechanism it could have great benefits for them.