Elementary, my dear Watson

Overall, I enjoy bug-hunting. Getting there can be really quite tortuous, via the fairly typical “oh my goodness, why do I do this?”, but there’s something inordinately satisfying about arriving at a complete and convincing explanation not only for what is going wrong but also for why it was only sometimes wrong. I find the satisfaction is in inverse proportion to the size of the fix then required: the best of these fixes have pages of explanation followed by a miniscule diff. A couple of old personal scalps of mine include the path from a random failure in a debugger test (see ocaml/ocaml#9043) only seen with specific-code-alignment-on-one-CI-worker to a missing GC root registration fixed in ocaml/ocaml#9051 (and which I’d allowed to become masked in the past by automatically allowing re-running ‘unstable’ tests in ocaml/ocaml#401; a technique I’d strongly recommend never using again!) and the relationship between Unix.fork only failing on every other invocation in the OCaml REPL on Cygwin, and an unstable topological sort (see cygwin/cygwin#226f694). My personal favourite is not mine, though, which is the wonderful alignment of compiler planets required in ocaml/ocaml#10071. Obligatory xkcd.

Anyway, something I find to be a common pitfall when debugging these things is the temptation to start fixing the area of code one thinks is wrong, even if there isn’t a logical explanation for why the code being “fixed” is in any way related to the actual failure being seen. Which brings us back to the mystery of the failing cloexec test from last week.

This test was creating a file tmp.txt, opening it various ways and marking the handles as inheritable. My previous changes had switched the test to use execv and the resulting exec’d process was expected to delete this file, except it was occasionally getting:

Fatal error: exception Sys_error("tmp.txt: Permission denied")

from these lines:

  if Sys.argv.(1) = "execv" then
    Sys.remove "tmp.txt"

On Windows, the file can’t be deleted until all open handles to it are closed (that’s just one of those things). Pretty sure OCaml’s Unix library was functioning correctly (avoid temptation to debug that; polish halo). So what scenario could cause this file still be open. The odd thing here is how execv is implemented on Windows, as it’s a library function rather than a kernel function:

    PROCESS_INFORMATION process_info;
    BOOL const create_process_status = traits::create_process(
        const_cast<Character*>(file_name),
        command_line.get(),
        nullptr,
        nullptr,
        TRUE,
        creation_flags,
        environment_block.get(),
        nullptr,
        &startup_info,
        &process_info);

    __crt_unique_handle process_handle(process_info.hProcess);
    __crt_unique_handle thread_handle(process_info.hThread);

    if (!create_process_status)
    {
        __acrt_errno_map_os_error(GetLastError());
        return -1;
    }

    if (mode == _P_OVERLAY)
    {
        // Destroy ourselves:
        _exit(0);
    }

(the code is excerpted from ucrt/exec/spawnv.cpp in UCRT 10.0.22621.0, available in Microsoft Visual Studio). There is a small window between that create_process call and the subsequent _exit(0) where tmp.txt would be open in both the “ancestor” process and the exec’d process. Our CI systems run on small VMs, and the testsuite is run in parallel, so the machine is very much overloaded. Maybe it was just about possible for the overloaded Windows scheduler to run the exec’d process all the way to completion before actually giving the original process enough chance to get to _exit for its handles to be closed?

That led to commit 55e9cd44: but it wasn’t enough. The test was still failing. At least that wasn’t wasted work - the race condition is real, and the locking added ensured that scenario couldn’t arise, but it certainly seems that it wasn’t the issue being seen.

Still managing to resist the temptation to debug the Unix library further (halo looking very shiny by this time), the next candidate is to blame Windows Defender. On Windows, virus scanners are given the opportunity to scan files after they’ve been closed - unfortunately, virus scanner are programs, just like anything else, which means that they do then have to open the files. That’s a known problem then when trying to delete those files soon after. While it seems odd that we could be seeing such a delay for a file with no content, recall that the testsuite runs in parallel. It could be that the file is open and waiting to be scanned, and that’s taking time because other tests (with executables, for example), are swamping the system.

The sledgehammer approach for this is to add a retry loop around the delete call. In most cases, this is a perfectly adequate approach - it usually only takes a few milliseconds. But that got me thinking as to what a suitable number of seconds to wait would be - the machine’s overloaded, after all. So I tried something different in commit 78fc5774 by using the slightly obscure FILE_FLAG_DELETE_ON_CLOSE flag for CreateFileW. Using this flag requires all instances of the file to be opened with FILE_SHARE_DELETE, but fortunately that flag is already exposed in OCaml’s Unix library (and that doesn’t affect the test) and the test already needs some custom C code, so a little bit more won’t hurt. The idea is simple: on Windows, when the program begins, we open the file with this special flag and then just discard the handle. When the test ends, Windows will automatically close the file handle and when it does so, it’ll be Windows responsibility to ensure the file is cleaned up.

That got it! And it turned out that it was definitely worth worrying about the time that it would be necessary to “spin” - our precheck system reports tests which are slow-running, and that file deletion can take 12-20 seconds on our poor overloaded machines!

Well, that was most of the failing logs dealt with, but there was still this failure left:

 #19: open
-#20: closed
+#20: open
 #21: closed

As it happened, I’d seen this in a couple of places, and the only pattern was that a descriptor which was expected to be closed was open. I added some annotations in the test to relate the descriptor number to the function being tested. My previously gleaming halo was getting a bit tarnished at this point: could it actually be a bug in the Unix library?! It still didn’t feel it could be though - why was the failure sporadic? Why was it seemingly only ever just the one closed handle? Why, for that matter, was it only closed handles?

However, staring in frustration at some self-evidently correct C code, the penny suddenly dropped. The way both channels and file descriptors are implemented in Windows OCaml is tremendously complicated, but for the most part in the Unix library, the CRT is bypassed, and we call Windows API functions directly (e.g. we use CreateFileW rather than _wopen) for Unix.openfile). All the close-on-exec stuff was obviously setting the Windows HANDLE to be correctly inheritable (or nothing would be working), but what was happening now with the file descriptors?

It turns out that while _open_osfhandle does check for some properties of the HANDLE it’s given automatically, it does not call GetHandleInformation to see if it’s been marked as inheritable.

That led to the third commit aa67aa6, which checks the HANDLE itself and if it is not meant to be inheritable, also passes the O_NOINHERIT flag to _open_osfhandle.

Bug found, problem solved, right? Does this fix explain what was being seen on the test failures? It turns out it does. We have a whole series of Windows HANDLE values, some of which are marked inheritable and some of which are not. Previously, we were then acquiring a CRT file descriptor for all of those HANDLE values. However, because OCaml was never passing O_NOINHERIT to _open_osfhandle, all of those file descriptors would be marked as inheritable.

What then happens comes back to those lovely internal details of how exec is actually implemented. All of those file descriptors are not marked O_NOINHERIT which means that the CRT will pass their Windows HANDLE values to the exec’d process. However, because those HANDLE values are not marked as inheritable, the new process may reuse them. In other words, we accidentally ended up with exactly the problem that switching to CRT file descriptors was meant to have eliminated - Windows has re-used the HANDLE value, because it wasn’t in use.

Bingo! Bugs fixed - and the explanation (hopefully) explains all the evidence. Elementary, my dear Watson…