Beware: java.nio.file.WatchService is subtly broken on Linux

This blog describes a bug that I reported to Oracle a month or so ago but still doesn't seem to have made it's way through to the official tracker.

The problem is that on Linux, file system events that should be being delivered by WatchService events can be silently discarded or be delivered against the wrong WatchKey. So for example, it's possible to register two directories, A and B, with a WatchService waiting for ENTRY_CREATE events, then create a file A/C but get an event with the WatchKey for B and WatchEvent.context C.

The reason for this is a bug in the JDK's LinuxWatchService. This class wraps an inotify instance, and also a thread that spins using poll to wait for either for:

  • A file system event to be delivered on the inotify FD, or
  • A byte to arrive on a FD corresponding to a pipe which is owned by the LinuxWatchService

Whenever a registration request is made by the user of the LinuxWatchService, the request is enqueued and then a single byte is written to the other end of this pipe to wake up the background thread, which will then make the actual registration with the kernel.

The core loop of this background thread is where the bug lies. The loop body looks like this:

// wait for close or inotify event
nReady = poll(ifd, socketpair[0]);
 
// read from inotify
try {
    bytesRead = read(ifd, address, BUFFER_SIZE);
} catch (UnixException x) {
    if (x.errno() != EAGAIN)
        throw x;
    bytesRead = 0;
}
 
// process any pending requests
if ((nReady > 1) || (nReady == 1 && bytesRead == 0)) {
    try {
        read(socketpair[0], address, BUFFER_SIZE);
        boolean shutdown = processRequests();
        if (shutdown)
            break;
    } catch (UnixException x) {
        if (x.errno() != UnixConstants.EAGAIN)
            throw x;
    }
}
 
// iterate over buffer to decode events
int offset = 0;
while (offset < bytesRead) {
    long event = address + offset;
    int wd = unsafe.getInt(event + OFFSETOF_WD);
    int mask = unsafe.getInt(event + OFFSETOF_MASK);
    int len = unsafe.getInt(event + OFFSETOF_LEN);
 
    // Omitted: the code that actually does something with the inotify event
}
// wait for close or inotify event
nReady = poll(ifd, socketpair[0]);

// read from inotify
try {
    bytesRead = read(ifd, address, BUFFER_SIZE);
} catch (UnixException x) {
    if (x.errno() != EAGAIN)
        throw x;
    bytesRead = 0;
}

// process any pending requests
if ((nReady > 1) || (nReady == 1 && bytesRead == 0)) {
    try {
        read(socketpair[0], address, BUFFER_SIZE);
        boolean shutdown = processRequests();
        if (shutdown)
            break;
    } catch (UnixException x) {
        if (x.errno() != UnixConstants.EAGAIN)
            throw x;
    }
}

// iterate over buffer to decode events
int offset = 0;
while (offset < bytesRead) {
    long event = address + offset;
    int wd = unsafe.getInt(event + OFFSETOF_WD);
    int mask = unsafe.getInt(event + OFFSETOF_MASK);
    int len = unsafe.getInt(event + OFFSETOF_LEN);

    // Omitted: the code that actually does something with the inotify event
}

The issue is that two read calls are made by this body — once with the inotify FD ifd, and once with the pipe FD socketpair[0]. If data happens to be available both via the pipe and via inotify, then the read from the pipe will corrupt the first few bytes of the inotify event stream! As it happens, the first few bytes of an event denote which watch descriptor the event is for, and so the issue usually manifests as an event being delivered against the wrong directory (or, if the resulting watch descriptor is not actually valid, the event being ignored entirely).

Note that this issue can only occur if you are registering watches while simultaneously receiving events. If your program just sets up some watches at startup and then never registers/cancels watches again you probably won't be affected. This, plus the fact that it is only triggered by registration requests and events arriving very close together, is probably why this bug has gone undetected since the very first release of the WatchService code.

I've worked around this myself by using the inotify API directly via JNA. This reimplementation also let me solve a unrelated WatchService "feature", which is that WatchKey.watchable can point to the wrong path in the event that a directory is renamed. So if you create a directory A, start watching it for EVENT_CREATE events, rename the directory to B, and then create a file B/C the WatchKey.watchable you get from the WatchService will be A rather than B, so naive code will derive the incorrect full path A/C for the new file.

In my implementation, a WatchKey is invalidated if the directory is watches is renamed, so a user of the class has the opportunity to reregister the new path with the correct WatchKey.watchable if they so desire. I think this is much saner behaviour!