Elementary, my dear Watson
Overall, I enjoy bug-hunting. Getting there can be really quite tortuous, via
the fairly typical “oh my goodness, why do I do this?”, but there’s something
inordinately satisfying about arriving at a complete and convincing explanation
not only for what is going wrong but also for why it was only sometimes wrong.
I find the satisfaction is in inverse proportion to the size of the fix then
required: the best of these fixes have pages of explanation followed by a
miniscule diff. A couple of old personal scalps of mine include the path from a
random failure in a debugger test (see ocaml/ocaml#9043)
only seen with specific-code-alignment-on-one-CI-worker to a missing GC root
registration fixed in ocaml/ocaml#9051
(and which I’d allowed to become masked in the past by automatically allowing
re-running ‘unstable’ tests in ocaml/ocaml#401;
a technique I’d strongly recommend never using again!) and the relationship
between Unix.fork
only failing on every other invocation in the OCaml REPL
on Cygwin, and an unstable topological sort (see cygwin/cygwin#226f694). My personal favourite is not mine, though, which is the wonderful
alignment of compiler planets required in ocaml/ocaml#10071.
Obligatory xkcd.
Anyway, something I find to be a common pitfall when debugging these things is the temptation to start fixing the area of code one thinks is wrong, even if there isn’t a logical explanation for why the code being “fixed” is in any way related to the actual failure being seen. Which brings us back to the mystery of the failing cloexec test from last week.
This test was creating a file tmp.txt
, opening it various ways and marking the
handles as inheritable. My previous changes had switched the test to use execv
and the resulting exec
’d process was expected to delete this file, except it
was occasionally getting:
Fatal error: exception Sys_error("tmp.txt: Permission denied")
from these lines:
if Sys.argv.(1) = "execv" then
Sys.remove "tmp.txt"
On Windows, the file can’t be deleted until all open handles to it are closed
(that’s just one of those things). Pretty sure OCaml’s Unix library was
functioning correctly (avoid temptation to debug that; polish halo). So what
scenario could cause this file still be open. The odd thing here is how execv
is implemented on Windows, as it’s a library function rather than a kernel
function:
PROCESS_INFORMATION process_info;
BOOL const create_process_status = traits::create_process(
const_cast<Character*>(file_name),
command_line.get(),
nullptr,
nullptr,
TRUE,
creation_flags,
environment_block.get(),
nullptr,
&startup_info,
&process_info);
__crt_unique_handle process_handle(process_info.hProcess);
__crt_unique_handle thread_handle(process_info.hThread);
if (!create_process_status)
{
__acrt_errno_map_os_error(GetLastError());
return -1;
}
if (mode == _P_OVERLAY)
{
// Destroy ourselves:
_exit(0);
}
(the code is excerpted from ucrt/exec/spawnv.cpp
in UCRT 10.0.22621.0,
available in Microsoft Visual Studio). There is a small window between that
create_process
call and the subsequent _exit(0)
where tmp.txt
would be
open in both the “ancestor” process and the exec
’d process. Our CI systems
run on small VMs, and the testsuite is run in parallel, so the machine is very
much overloaded. Maybe it was just about possible for the overloaded Windows
scheduler to run the exec
’d process all the way to completion before actually
giving the original process enough chance to get to _exit
for its handles to
be closed?
That led to commit 55e9cd44: but it wasn’t enough. The test was still failing. At least that wasn’t wasted work - the race condition is real, and the locking added ensured that scenario couldn’t arise, but it certainly seems that it wasn’t the issue being seen.
Still managing to resist the temptation to debug the Unix library further (halo looking very shiny by this time), the next candidate is to blame Windows Defender. On Windows, virus scanners are given the opportunity to scan files after they’ve been closed - unfortunately, virus scanner are programs, just like anything else, which means that they do then have to open the files. That’s a known problem then when trying to delete those files soon after. While it seems odd that we could be seeing such a delay for a file with no content, recall that the testsuite runs in parallel. It could be that the file is open and waiting to be scanned, and that’s taking time because other tests (with executables, for example), are swamping the system.
The sledgehammer approach for this is to add a retry loop around the delete
call. In most cases, this is a perfectly adequate approach - it usually only
takes a few milliseconds. But that got me thinking as to what a suitable number
of seconds to wait would be - the machine’s overloaded, after all. So I tried
something different in commit 78fc5774
by using the slightly obscure FILE_FLAG_DELETE_ON_CLOSE
flag for CreateFileW
.
Using this flag requires all instances of the file to be opened with
FILE_SHARE_DELETE
, but fortunately that flag is already exposed in OCaml’s
Unix library (and that doesn’t affect the test) and the test already needs some
custom C code, so a little bit more won’t hurt. The idea is simple: on Windows,
when the program begins, we open the file with this special flag and then just
discard the handle. When the test ends, Windows will automatically close the
file handle and when it does so, it’ll be Windows responsibility to ensure the
file is cleaned up.
That got it! And it turned out that it was definitely worth worrying about the time that it would be necessary to “spin” - our precheck system reports tests which are slow-running, and that file deletion can take 12-20 seconds on our poor overloaded machines!
Well, that was most of the failing logs dealt with, but there was still this failure left:
#19: open
-#20: closed
+#20: open
#21: closed
As it happened, I’d seen this in a couple of places, and the only pattern was that a descriptor which was expected to be closed was open. I added some annotations in the test to relate the descriptor number to the function being tested. My previously gleaming halo was getting a bit tarnished at this point: could it actually be a bug in the Unix library?! It still didn’t feel it could be though - why was the failure sporadic? Why was it seemingly only ever just the one closed handle? Why, for that matter, was it only closed handles?
However, staring in frustration at some self-evidently correct C code, the penny
suddenly dropped. The way both channels and file descriptors are implemented in
Windows OCaml is tremendously complicated, but for the most part in the Unix
library, the CRT is bypassed, and we call Windows API functions directly
(e.g. we use CreateFileW
rather than _wopen
)
for Unix.openfile
). All the close-on-exec stuff was obviously setting the
Windows HANDLE
to be correctly inheritable (or nothing would be working), but
what was happening now with the file descriptors?
It turns out that while _open_osfhandle
does check for some properties of the HANDLE
it’s given automatically, it does
not call GetHandleInformation
to see if it’s been marked as inheritable.
That led to the third commit aa67aa6,
which checks the HANDLE
itself and if it is not meant to be inheritable,
also passes the O_NOINHERIT
flag to _open_osfhandle
.
Bug found, problem solved, right? Does this fix explain what was being seen on
the test failures? It turns out it does. We have a whole series of Windows
HANDLE
values, some of which are marked inheritable and some of which are not.
Previously, we were then acquiring a CRT file descriptor for all of those
HANDLE
values. However, because OCaml was never passing O_NOINHERIT
to
_open_osfhandle
, all of those file descriptors would be marked as inheritable.
What then happens comes back to those lovely internal details of how exec
is
actually implemented. All of those file descriptors are not marked
O_NOINHERIT
which means that the CRT will pass their Windows HANDLE
values
to the exec
’d process. However, because those HANDLE
values are not marked
as inheritable, the new process may reuse them. In other words, we
accidentally ended up with exactly the problem that switching to CRT file
descriptors was meant to have eliminated - Windows has re-used the HANDLE
value, because it wasn’t in use.
Bingo! Bugs fixed - and the explanation (hopefully) explains all the evidence. Elementary, my dear Watson…