They do it with mirrors, you know - that sort of thing
While comfort-watching the indomitable Joan Hickson
as Agatha Christie’s Miss Marple
in The Body in the Library,
it occurred to me that Miss Marple would have been a formidable debugger. Since
returning from holiday one, two, three weeks ago, I’ve been mostly
straightening out and finalising the final Relocatable OCaml PR. A frustrating
task, because I know these things will take weeks and have little to show for at
the end, so one spends the entire time feeling it should be finished by now.
It’s just about there, when this little testsuite failure popped up:
List of failed tests:
tests/lib-unix/common/cloexec.ml
tests/warnings/mnemonics.mll
In both cases there was a similar, very strange-looking error:
the file '/home/runner/work/ocaml/ocaml/testsuite/tests/lib-unix/_ocamltest/tests/lib-unix/common/cloexec/ocamlc.byte/cloexec_leap.exe' is not a bytecode executable file
and
the file '/home/runner/work/ocaml/ocaml/testsuite/tests/warnings/_ocamltest/tests/warnings/mnemonics/ocamlc.byte/mnemonics.byte' is not a bytecode executable file
Fatal error: exception File "mnemonics.mll", line 55, characters 2-8: Assertion failed
Now, as it happens, the diagnosis of what was happening was relatively quick for me. I’ve dusted off and thrown around so many obscure bits of the runtime system on so many diverse configurations and platforms with Relocatable OCaml that it’s resulted in a lot of other bugs being fixed before the main PRs, some bugs fixed with the main PRs and then a pile of follow-up work with the additional parts. There’s one particularly long-standing bug on Windows:
C:\Users\DRA>where ocamlc.byte
C:\Users\DRA\AppData\Local\opam\default\bin\ocamlc.byte.exe
C:\Users\DRA>where ocamlc.byte.exe
C:\Users\DRA\AppData\Local\opam\default\bin\ocamlc.byte.exe
C:\Users\DRA>ocamlc.byte.exe --version
5.2.0
C:\Users\DRA>ocamlc.byte --version
unknown option --version
Strange, huh: ocamlc.byte.exe
does one thing and ocamlc.byte
does another!
The precise diagnosis of what’s going on there is nearly a novel in itself. The
fix is quite involved, and is at the “might get put into PR 3; might be left for
the future” stage. The failures across CI were just the Unix builds which use
the stub launcher for bytecode (it’s an obscure corner of startup which lives in
stdlib/header.c
and which has received a pre-Relocatable overhaul in ocaml/ocaml#13988).
There are so many bits to Relocatable OCaml that I have a master script that
puts them all together and then backports them: the CI failure was only on the
“trunk” version of this, the 5.4, 5.3 and 5.2 versions passing as normal. The
backports don’t include the “future” work, so that quickly pointed me at the
work sitting in dra27/ocaml#190.
Both those failures are from tests which themselves spawn executables as part of the test. What was particularly strange was mnemonics because that doesn’t call itself, rather it calls the compiler:
let mnemonics =
let stdout = "warn-help.out" in
let n =
Sys.command
Filename.(quote_command ~stdout
ocamlrun [concat ocamlsrcdir "ocamlc"; "-warn-help"])
in
assert (n = 0);
That’s invoking the ocamlc
bytecode binary from the root of the build tree
passing it as an argument directly to runtime/ocamlrun
in the root of the
build tree. The fact that ocamlrun is then displaying a message referring to
mnemonics.byte
is very strange, but was down to a bug in my fix for this other
issue. The core of the bug-fix is that the stub launcher, having opened the
bytecode image to find its RNTM
section so it can search for the runtime to
call now leaves the file descriptor open and hands its number over to ocamlrun
as part of the exec
call (works on Windows as well). The problem was the
cleanup from this in ocamlrun
itself, where that environment is reset having
been consumed:
#if defined(_WIN32)
_wputenv(L"__OCAML_EXEC_FD=");
#elif defined(HAS_SETENV_UNSETENV)
unsetenv("__OCAML_EXEC_FD=");
#endif
There’s a stray =
at the end of the Unix branch there 🫣 Right, problem solved
and, were I Inspector Slack, I should have zipped straight round to Basil
Blake’s gaudy cottage, handcuffs at the ready.
But what about the second murder? Which, in this case, is why the heck hadn’t
this been seen before? That’s the kind of thing that terrifies me with a fix
like this: the bug is obvious, but was something else being masked and, more to
the point, have I just changed something which introduced a different bug
which happened to cause this one to be visible. At this point, I made a note,
closed my laptop, and returned to my knitting (no, wait, that was Miss Marple).
Then the penny dropped: the compiler’s being configured here with
--with-target-sh=exe
(on Unix, that means that bytecode executables
intentionally avoid shebang-style scripts and use the stub), which should mean
that those two tests are compiled using the stub. Except that because we test
the compiler in the build tree, previously the compiler picks up
stdlib/runtime-launch-info
which is the build version of that header, not
the target version. However, one of the refactorings I’ve done in c60e4aaf
stops using runtime-launch-info
this way (I introduced that header in ocaml/ocaml#12751
as part of OCaml 5.2.0). A side-effect of that change is that
stdlib/runtime-launch-info
is actually the target version of the header, and
the root bytecode compiler is now behaving as we’d always been expecting it
to that test, using target configuration defined in utils/config.ml
… and so
only now revealing this latent bug in my fix.
“They do it with mirrors, you know-that sort of thing-if you understand me.” Inspector Curry did not understand. He stared and wondered if Miss Marple was quite right in the head.