Everything is a file; except when it's not
Some titles make more sense than others. One of my oldest contributions to OCaml
was a complete overhaul of Unix.stat
et al
in ocaml/ocaml#462 which formed part
of OCaml 4.03. As part of the work on msvs-detect
in late 2015, I’d ended up with a Windows 7 VM which had every single version of
Visual Studio back to Visual Studio 6.0.
Visual Studio (and Visual C++ before that) has always included the source code
for the C Runtime Library (CRT), and as a side-effect of having all these
installed Visual Studios, I was able to construct a Git repository showing the
evolution of the CRT code over each release (sadly, the licence doesn’t allow
this to be pushed publicly). This was particularly useful for studying how the
behaviour of the stat
implementation had changed over time, particularly with
reference to Windows Vista’s symlinks. Anyway, that particular bit of work left
me with a habit of often reaching for the CRT whenever something weird’s
happening, and that’s led naturally to a fairly detailed bug-fix - and outline
for more bug-fixes - in OCaml.
As part of my ongoing work on Relocatable OCaml, I wanted to have a test to
check that file descriptors set with keep-on-exec were being correctly passed
to Windows processes. While looking through the tests already present for the
Unix library, I happened upon testsuite/tests/lib-unix/common/cloexec.ml
and in particular this interesting comment in its preamble:
This test is temporarily disabled on the MinGW and MSVC ports,
because since fdstatus has been wrapped in an OCaml program,
it does not work as well as before.
Presumably this is because the OCaml runtime opens files, so that handles
that have actually been closed at execution look open and make the
test fail.
The test actually formed part of ocaml/ocaml#650
in OCaml 4.05, which added ?cloexec
parameters to various Unix functions. The
test looked perfect for my needs, but the comment above had been added a year or
so later when the test was upgraded to use ocamltest
instead. The bug I was
actually trying to fix in Relocatable OCaml - and which had caused all the
spelunking into the Microsoft CRT code - was to do with how file descriptors are
physically inherited by processes. On a Unix system, the CRT and the kernel are
quite closely related as a consequence of the relationship between the Single
Unix Specification and the C Standard, but on Windows the heritage is a bit more
complicated. Various functions - the exec
and spawn
functions included -
are very much user-level functions implemented over different kernel primitives,
rather than being either direct syscalls, or at most very thin wrappers around
direct syscalls.
Windows doesn’t have file descriptors (“FDs”), rather it has HANDLE
s.
Although Windows doesn’t follow the Unix “Everything is a file”
philosophy, the values for HANDLE
sort of do (mainly because they’re
pointers/indexes into kernel information structures for the process). The
original version of this test, knowing that Windows doesn’t really have FDs, had
passed the HANDLE
values instead. The crux of this test is to pass a series of
FD values on the command line to a small auxiliary program and have it test
which ones are still open - the close-on-exec ones, should obviously not be in
use. This works on Unix, because while the OCaml runtime may open some files
during startup, they are all closed by the time the program itself is running,
so the state of the FDs should be unaltered.
On Windows, with it’s HANDLE
version instead, this had worked fine in the
original test where the checker being invoked was a simple C program, but it had
hit problems when that simple C program was changed to a simple OCaml program.
I realised that the instability here was that whereas any FDs which were opened
by the runtime would be closed by the time the program ran, the same was not
true for HANDLE
s which were not files. This was a slight variation on the
comment in the test - the point is that the HANDLE
s which occasionally
appeared open were not in fact files at all, and so perfectly allowed to be
still open.
But… Windows does in fact have support for inheriting FD values across
exec
calls. There’s a lovely survey
of the mechanism in Windows which is present for this, as an undocumented part
of CreateProcess
,
and the code which does it can be seen in the Universal CRT sources in
exec/spawnv.cpp
and lowio/ioinit.cpp
. Our implementation of
Unix.create_process
is implemented directly in terms of Windows API calls,
which completely breaks this mechanism (that’s filed away for the future: we
should reimplement our create_process
function in terms of the CRT’s own
spawn
function in order not to break this). However, the Unix.exec
functions
call the CRT equivalents directly. These functions are normally pretty useless
on Windows, because they work by spawning a new process and then immediately
terminating the current one, which means you can’t block or retrieve the actual
exit status. Luckily ocamltest
already has some magic added in ocaml/ocaml#1739
which means that it doesn’t continue until every process created by the item
being tested has itself terminated. The success or failure of this test is
determined by the output it produces, rather than the exit status, so for the
first time ever for me, Unix.execv
was actually able to be used on Windows!
The switch allowed most of the special-case code for Windows to be removed in
the C portion of the test - we’re just dealing with FDs in the same way as on
Unix. However, given that Unix.create_process
is presently “broken” on
Windows (inasmuch as it doesn’t actually pass the FD values to the new process),
I made the test work for both mechanisms, to record the “TODO” item for fixing
Unix.create_process
on Windows at some point.
Finally, I was able to adapt the test for what I needed in Relocatable OCaml, but the changes made up to this point were good to go upstream, and formed ocaml/ocaml#13879. It’s a testsuite fix only, and it got merged quite quickly (thank you Gabriel!).
Everything was rosy. Except that when I was preparing Relocatable OCaml for last week’s Developer’s meeting in Paris, I spotted that several of my test runs on our “precheck infrastructure” were failing that test. Searching logs further, I found that since my PR had been merged, the test was sporadically failing. Mostly, the failure was:
Fatal error: exception Sys_error("tmp.txt: Permission denied")
which looked suspiciously like Windows Defender or some such was getting in the way. Irritating, but a known issue to have to fix. What was however not so good was an instance of:
#19: open
-#20: closed
+#20: open
#21: closed
A descriptor which was meant to be closed was open?! Something more complex clearly still going on. But that’s for next time…