Everything is a file; except when it’s not

Some titles make more sense than others. One of my oldest contributions to OCaml was a complete overhaul of Unix.stat et al in ocaml/ocaml#462 which formed part of OCaml 4.03. As part of the work on msvs-detect in late 2015, I’d ended up with a Windows 7 VM which had every single version of Visual Studio back to Visual Studio 6.0. Visual Studio (and Visual C++ before that) has always included the source code for the C Runtime Library (CRT), and as a side-effect of having all these installed Visual Studios, I was able to construct a Git repository showing the evolution of the CRT code over each release (sadly, the licence doesn’t allow this to be pushed publicly). This was particularly useful for studying how the behaviour of the stat implementation had changed over time, particularly with reference to Windows Vista’s symlinks. Anyway, that particular bit of work left me with a habit of often reaching for the CRT whenever something weird’s happening, and that’s led naturally to a fairly detailed bug-fix - and outline for more bug-fixes - in OCaml.

As part of my ongoing work on Relocatable OCaml, I wanted to have a test to check that file descriptors set with keep-on-exec were being correctly passed to Windows processes. While looking through the tests already present for the Unix library, I happened upon testsuite/tests/lib-unix/common/cloexec.ml and in particular this interesting comment in its preamble:

This test is temporarily disabled on the MinGW and MSVC ports,
because since fdstatus has been wrapped in an OCaml program,
it does not work as well as before.
Presumably this is because the OCaml runtime opens files, so that handles
that have actually been closed at execution look open and make the
test fail.

The test actually formed part of ocaml/ocaml#650 in OCaml 4.05, which added ?cloexec parameters to various Unix functions. The test looked perfect for my needs, but the comment above had been added a year or so later when the test was upgraded to use ocamltest instead. The bug I was actually trying to fix in Relocatable OCaml - and which had caused all the spelunking into the Microsoft CRT code - was to do with how file descriptors are physically inherited by processes. On a Unix system, the CRT and the kernel are quite closely related as a consequence of the relationship between the Single Unix Specification and the C Standard, but on Windows the heritage is a bit more complicated. Various functions - the exec and spawn functions included - are very much user-level functions implemented over different kernel primitives, rather than being either direct syscalls, or at most very thin wrappers around direct syscalls.

Windows doesn’t have file descriptors (“FDs”), rather it has HANDLEs. Although Windows doesn’t follow the Unix “Everything is a file” philosophy, the values for HANDLE sort of do (mainly because they’re pointers/indexes into kernel information structures for the process). The original version of this test, knowing that Windows doesn’t really have FDs, had passed the HANDLE values instead. The crux of this test is to pass a series of FD values on the command line to a small auxiliary program and have it test which ones are still open - the close-on-exec ones, should obviously not be in use. This works on Unix, because while the OCaml runtime may open some files during startup, they are all closed by the time the program itself is running, so the state of the FDs should be unaltered.

On Windows, with it’s HANDLE version instead, this had worked fine in the original test where the checker being invoked was a simple C program, but it had hit problems when that simple C program was changed to a simple OCaml program. I realised that the instability here was that whereas any FDs which were opened by the runtime would be closed by the time the program ran, the same was not true for HANDLEs which were not files. This was a slight variation on the comment in the test - the point is that the HANDLEs which occasionally appeared open were not in fact files at all, and so perfectly allowed to be still open.

But… Windows does in fact have support for inheriting FD values across exec calls. There’s a lovely survey of the mechanism in Windows which is present for this, as an undocumented part of CreateProcess, and the code which does it can be seen in the Universal CRT sources in exec/spawnv.cpp and lowio/ioinit.cpp. Our implementation of Unix.create_process is implemented directly in terms of Windows API calls, which completely breaks this mechanism (that’s filed away for the future: we should reimplement our create_process function in terms of the CRT’s own spawn function in order not to break this). However, the Unix.exec functions call the CRT equivalents directly. These functions are normally pretty useless on Windows, because they work by spawning a new process and then immediately terminating the current one, which means you can’t block or retrieve the actual exit status. Luckily ocamltest already has some magic added in ocaml/ocaml#1739 which means that it doesn’t continue until every process created by the item being tested has itself terminated. The success or failure of this test is determined by the output it produces, rather than the exit status, so for the first time ever for me, Unix.execv was actually able to be used on Windows!

The switch allowed most of the special-case code for Windows to be removed in the C portion of the test - we’re just dealing with FDs in the same way as on Unix. However, given that Unix.create_process is presently “broken” on Windows (inasmuch as it doesn’t actually pass the FD values to the new process), I made the test work for both mechanisms, to record the “TODO” item for fixing Unix.create_process on Windows at some point.

Finally, I was able to adapt the test for what I needed in Relocatable OCaml, but the changes made up to this point were good to go upstream, and formed ocaml/ocaml#13879. It’s a testsuite fix only, and it got merged quite quickly (thank you Gabriel!).

Everything was rosy. Except that when I was preparing Relocatable OCaml for last week’s Developer’s meeting in Paris, I spotted that several of my test runs on our “precheck infrastructure” were failing that test. Searching logs further, I found that since my PR had been merged, the test was sporadically failing. Mostly, the failure was:

Fatal error: exception Sys_error("tmp.txt: Permission denied")

which looked suspiciously like Windows Defender or some such was getting in the way. Irritating, but a known issue to have to fix. What was however not so good was an instance of:

 #19: open
-#20: closed
+#20: open
 #21: closed

A descriptor which was meant to be closed was open?! Something more complex clearly still going on. But that’s for next time…