Beware: java.nio.file.WatchService is subtly broken on Linux

This blog describes a bug that I reported to Oracle a month or so ago but still doesn't seem to have made it's way through to the official tracker.

The problem is that on Linux, file system events that should be being delivered by WatchService events can be silently discarded or be delivered against the wrong WatchKey. So for example, it's possible to register two directories, A and B, with a WatchService waiting for ENTRY_CREATE events, then create a file A/C but get an event with the WatchKey for B and WatchEvent.context C.

The reason for this is a bug in the JDK's LinuxWatchService. This class wraps an inotify instance, and also a thread that spins using poll to wait for either for:

• A file system event to be delivered on the inotify FD, or
• A byte to arrive on a FD corresponding to a pipe which is owned by the LinuxWatchService

Whenever a registration request is made by the user of the LinuxWatchService, the request is enqueued and then a single byte is written to the other end of this pipe to wake up the background thread, which will then make the actual registration with the kernel.

The core loop of this background thread is where the bug lies. The loop body looks like this:

// wait for close or inotify event

try {
} catch (UnixException x) {
if (x.errno() != EAGAIN)
throw x;
}

// process any pending requests
try {
boolean shutdown = processRequests();
if (shutdown)
break;
} catch (UnixException x) {
if (x.errno() != UnixConstants.EAGAIN)
throw x;
}
}

// iterate over buffer to decode events
int offset = 0;
long event = address + offset;
int wd = unsafe.getInt(event + OFFSETOF_WD);
int len = unsafe.getInt(event + OFFSETOF_LEN);

// Omitted: the code that actually does something with the inotify event
}
// wait for close or inotify event

try {
} catch (UnixException x) {
if (x.errno() != EAGAIN)
throw x;
}

// process any pending requests
try {
boolean shutdown = processRequests();
if (shutdown)
break;
} catch (UnixException x) {
if (x.errno() != UnixConstants.EAGAIN)
throw x;
}
}

// iterate over buffer to decode events
int offset = 0;
long event = address + offset;
int wd = unsafe.getInt(event + OFFSETOF_WD);
int len = unsafe.getInt(event + OFFSETOF_LEN);

// Omitted: the code that actually does something with the inotify event
}

The issue is that two read calls are made by this body — once with the inotify FD ifd, and once with the pipe FD socketpair[0]. If data happens to be available both via the pipe and via inotify, then the read from the pipe will corrupt the first few bytes of the inotify event stream! As it happens, the first few bytes of an event denote which watch descriptor the event is for, and so the issue usually manifests as an event being delivered against the wrong directory (or, if the resulting watch descriptor is not actually valid, the event being ignored entirely).

Note that this issue can only occur if you are registering watches while simultaneously receiving events. If your program just sets up some watches at startup and then never registers/cancels watches again you probably won't be affected. This, plus the fact that it is only triggered by registration requests and events arriving very close together, is probably why this bug has gone undetected since the very first release of the WatchService code.

I've worked around this myself by using the inotify API directly via JNA. This reimplementation also let me solve a unrelated WatchService "feature", which is that WatchKey.watchable can point to the wrong path in the event that a directory is renamed. So if you create a directory A, start watching it for EVENT_CREATE events, rename the directory to B, and then create a file B/C the WatchKey.watchable you get from the WatchService will be A rather than B, so naive code will derive the incorrect full path A/C for the new file.

In my implementation, a WatchKey is invalidated if the directory is watches is renamed, so a user of the class has the opportunity to reregister the new path with the correct WatchKey.watchable if they so desire. I think this is much saner behaviour!

Asynchronous and non-blocking IO

This post aims to explain the difference between asynchronous and non-blocking IO, with particular reference to their implementation in Java. These two styles of IO API are closely related but have a number of important differences, especially when it comes to OS support.

Asynchronous IO

Asynchronous IO refers to an interface where you supply a callback to an IO operation, which is invoked when the operation completes. This invocation often happens to an entirely different thread to the one that originally made the request, but this is not necessarily the case. Asynchronous IO is a manifestation of the "proactor" pattern.

One common way to implement asynchronous IO is to have a thread pool whose threads are used to make the normal blocking IO requests, and execute the appropriate callbacks when these return. The less common implementation approach is to avoid a thread pool, and just push the actual asynchronous operations down into the kernel. This alternative solution obviously has the disadvantage that it depends on operating system specific support for making async operations, but has the following advantages:

• The maximum number of in-flight requests is not bounded by the size of your thread pool
• The overhead of creating thread pool threads is avoided (e.g. you need not reserve any memory for the thread stacks, and you don't pay the extra context switching cost associated with having more schedulable entities)
• You expose more information to the kernel, which it can potentially use to make good choices about how to do the IO operations — e.g. by minimizing the distance that the disk head needs to travel to satisfy your requests, or by using native command queueing.

Operating system support for asynchronous IO is mixed:

• Linux has at least two implementations of async IO:
• POSIX AIO (aio_read et al). This is implemented on Linux by glibc, but other POSIX systems (Solaris, OS X etc) have their own implementations. The glibc implementation is simply a thread pool based one — I'm not sure about the other systems.
• Linux kernel AIO (io_submit et al). No thread pool is used here, but it has quite a few limitations (e.g. it only works for files, not sockets, and has alignment restrictions on file reads) and does not seem to be used much in practice.

There is a good discussion of the *nix AIO situation on the libtorrent blog, summarised by the same writer on Stack Overflow here. The experience of this author was that the limitations and poor implementation quality of the various *nix AIO implementations are such that you are much better off just using your own thread pool to issue blocking operations.

• Windows provides a mechanism called completion ports for performing asynchronous IO. With this system:
1. You start up a thread pool and arrange for each thread to spin calling GetQueuedCompletionStatus
2. You make IO requests using the normal Windows APIs (e.g. ReadFile and WSARecv), with the small added twist that you supply a special LPOVERLAPPED parameter indicating that the calls should be non-blocking and the result should be reported to the thread pool
3. As IO completes, thread pool threads blocked on GetQueuedCompletionStatus are woken up as necessary to process completion events

Windows intelligently schedules how it delivers GetQueuedCompletionStatus wakeups, such that it tries to roughly keep the same number of threads active at any time. This avoids excessive context switching and scheduler transitions — things are arranged so that a thread which has just processed a completion event will likely be able to immediately grab a new work item. With this arrangement, your pool can be much smaller than the number of IO operations you want to have in-flight: you only need to have as many threads as are required to process completion events.

In Java, support for asynchronous IO was added as part of the NIO2 work in JDK7, and the appropriate APIs are exposed by the AsynchronousChannel class. On *nix, AsynchronousFileChannel and AsynchronousSocketChannel are implemented using the standard thread pool approach (the pools are owned by an AsynchronousChannelGroup). On Windows, completion ports are used — in this case, the AsynchronousChannelGroup thread poll is used as the GetQueuedCompletionStatus listeners.

Non-blocking IO

Non-blocking IO refers to an interface where IO operations will return immediately with a special error code if called when they are in a state that would otherwise cause them to block. So for example, a non-blocking recv will return immediately with a EAGAIN or EWOULDBLOCK error code if no data is available on the socket, and likewise send will return immediately with an error if the OS send buffers are full. Generally APIs providing non-blocking IO will also provide some sort of interface where you can efficiently wait for certain operations to enter a state where invoking the non-blocking IO operation will actually make some progress rather than immediately returning. APIs in this style are implementations of the reactor pattern.

No OS that I know of implements non-blocking IO for file IO, but support for socket IO is generally reasonable:

• Non-blocking read and writes are available via the POSIX O_NONBLOCK operating mode, which can be set on file descriptors (FDs) representing sockets and FIFOs.

• POSIX provides select and poll which let you wait for reads and writes to be ready on several FDs. (The difference between these two is pretty much just that select lets you wait for a number of FDs up to FD_SETSIZE, while poll can wait for as many FDs as you are allowed to create.)

Select and poll have the major disadvantage that when the kernel returns from one of these calls, you only know the number of FDs that got triggered — not which specific FDs have become unblocked. This means you later have to do a linear time scan across each of the FDs you supplied to figure out which one you actually need to use.

• This limitation motivated the development of several successor interfaces. BSD & OS X got kqueue, Solaris got /dev/poll, and Linux got epoll. Roughly speaking, these interfaces lets you build up a set of FDs you are interested in watching, and then make a call that returns to you a list those of FDs in the set that were actually triggered.

There's lots of good info about these mechanisms at the classic C10K page. If you like hearing someone who clearly knows what he is talking about rant for 10 minutes about syscalls, this Bryan Cantrill bit about epoll is quite amusing.

• Unfortunately, Windows never got one of these successor mechanisms: only select is supported. It is possible to do an epoll-like thing by kicking off an operation that would normally block (e.g. WSARecv) with a specially prepared LPOVERLAPPED parameter, such that you can wait it to complete using WSAWaitForMultipleEvents. Like epoll, when this wait returns it gives you a notion of which of the sockets of interest caused the wakeup. Unfortunately, this API won't let you wait for more than 64 events — if you want to wait for more you need to create child threads that recursively call WSAWaitForMultipleEvents, and then wait on those threads!

• The reason that Windows support is a bit lacking here is that they seem to expect you to use an asynchronous IO mechanism instead: either completion ports, or completion handlers. (Completion handlers are implemented using the windows APC mechanism and are a form of callback that don't require a thread pool — instead, they are executed in the spare CPU time when the thread that issued the IO operation is otherwise suspended, e.g. in a call to WaitForMultipleObjectsEx).

In Java, non-blocking IO has been exposed via SelectableChannel since JDK4. As I mentioned above, OS support for non-blocking IO on files is nonexistant — correspondingly, Java's SocketChannel extends SelectableChannel, but FileChannel does not.

The JDK implements SelectableChannel using whatever the platform-appropriate API is (i.e. epoll, kqueue, /dev/poll, poll or select). The Windows implementation is based on select — to ameliorate the fact that select requires a linear scan, the JDK creates a new thread for every 1024 sockets being waited on.

Conclusions

Let's say that you want to do Java IO in a non-synchronous way. The bottom line is:

• If you want to do IO against files, your only option is asynchronous IO. You'll need to roll it yourself with JDK6 and below (and the resulting implementation won't be as concurrent as you expect Windows). On the other hand, with Java 7 and up you can just use the built-in mechanisms, and what you'll get is basically as good as the state-of-the-art.

• If you want to do IO against sockets, an ideal solution would use non-blocking IO on *nix and asynchronous IO on Windows. This is obviously a bit awkward to do, since it involves working with two rather different APIs. There might be some project akin to libuv that wraps these two mechanisms up into a single API you can write against, but I don't know of it if so.

The Netty project is an interesting data point. This high performance Java server is based principally on non-blocking IO, but they did make an abortive attempt to use async IO instead at one point — it was backed out because there was no performance advantage to using async IO instead of non-blocking IO on Linux. Some users report that the now-removed asynchronous IO code drastically reduces CPU usage on Windows, but others report that Java's dubious select-based implementation of Windows non-blocking IO is good enough.

The Haskell community has built up a great resource: the Hackage Haskell package database, where we recently hit the 500-package mark!

One of those 500 packages was mine, I added another to their number just an hour ago, and I've got two more in the oven. Given, then, that I'm starting to maintain a few packages, I went to the trouble of automating the Hackage release process, and in this post I'm going to briefly walk through setting up this automated environment.

1. Install cabal-upload from Hackage. I'm afraid that at the time of writing this is not perfectly simple because it won't build with GHC 6.8 or above: this can be fixed with a new .cabal file, however, which I've made available here. (Edit: I've just noticed that this functionality seems to have been added to Cabal itself! You may just be able to use cabal upload. However, I'm not sure what the right config file location is for the next step).
3. Copy the following shell script into a file called release in the root of your project (the same directory as the Setup.lhs file):
#!/bin/bash
#

echo "Have you updated the version number? Type 'yes' if you have!"

if [ "$version_response" != "yes" ]; then echo "Go and update the version number" exit 1 fi sdist_output=runghc Setup.lhs sdist if [ "$?" != "0" ]; then
echo "Cabal sdist failed, aborting"
exit 1
fi

# Want to find a line like:
# Source tarball created: dist/ansi-terminal-0.1.tar.gz

# Test this with:
# runghc Setup.lhs sdist | grep ...
filename=echo $sdist_output | sed 's/.*Source tarball created: .*/\1/' echo "Filename:$filename"

if [ "$filename" = "$sdist_output" ]; then
echo "Could not find filename, aborting"
exit 1
fi

# Test this with:
# echo dist/ansi-terminal-0.1.tar.gz | sed ...
version=echo $filename | sed 's/^[^0-9]*.tar.gz$/\1/'
echo "Version: $version" if [ "$version" = "$filename" ]; then echo "Could not find version, aborting" exit 1 fi echo "This is your last chance to abort! I'm going to upload in 10 seconds" sleep 10 git tag "v$version"

if [ "$?" != "0" ]; then echo "Git tag failed, aborting" exit 1 fi # You need to have stored your Hackage username and password as directed by cabal-upload # I use -v5 because otherwise the error messages can be cryptic 🙂 cabal-upload -v5$filename

if [ "$?" != "0" ]; then echo "Hackage upload failed, aborting" exit 1 fi # Success! exit 0 4. When you're ready to release something, simply run the shell script! Not only will this package up your project and upload it to Hackage, it will also add a version tag to your Git repository (obviously you should change this bit if you are using another VCS!). If you would like to follow my continuing adventures in Haskell open source, please check out my GitHub profile! Patches gratefully accepted 🙂 Leaving Windows Behind So, I finally switched to Mac a month ago. I've had a Mac laptop since last summer and have been very pleased with the experience, so since Windows Vista has been giving me huge amounts of trouble (e.g. see my last post on getting Cygwin to work, though I won't go into the full gamut of issues I had here) I decided to go for an Apple desktop machine too. Happily, Steve Jobs has heard my cries of Windows-inflicted pain and ordered his minions to release a new revision of this baby: Beautiful, isn't it? With 8 cores of Xeon love, it's no slouch in the performance department either. Salivation-inducing hardware aside, it comes with OS X, which is so much better than Vista that its simply not even funny. Overall it's fair to say that I've been very pleased with my purchase 🙂 There have been some problems switching, of course. I have Parallels Desktop installed so that I can still develop using C# and I will probably end up installing Office 2007 on there at some point as well, but for pretty much everything else I've been able to find an acceptable or beyond-acceptable alternative for OS X. Here are some of my favourites:  LaunchBar is a very neat application that you can use to quickly access many things on your Mac. For instance, if I want to play the album "Twin Cinema" in iTunes, I just press Option-Space, type "twin" into the box that comes up and press enter: fast and convenient. Similarly, if I wanted to open TextMate I simply press Option-Space and type "mate". Of course, there are loads more things you can do with it such as running any AppleScript you like or make a Google search.. the list goes on. LaunchBar learns over time what abbreviations you want to associate with an action, and hence it becomes so natural that so you soon find it hard to live without it! Unfortunately it is payware, but it's certainly well worth the price tag. Plot is a really nice graphing application. On Windows I was using Gnuplot, which is doubtless powerful but insanely hard to use. Plot just works and supports pretty much every feature I need. The graphs it outputs look very professional: see for yourself. LyX is what I'm using instead of Office (NeoOffice, the OS X OpenOffice port, is too sluggish for words so I'm trying to avoid it). It's a nice friendly interface onto an OS X LaTeX distribution that makes the common case fast while still letting you access the full power of LaTeX when you need it. The application is actually nominally cross platform but I had numerous problems with crashes and weird behavior in the Windows version that have yet to occur on Mac. I bought 1Password (it actually came as part of the MacHeist deal) to replace my long-time Windows password manager Password Manager XP. I have no complaints: on the contrary, 1Passwords integration with Firefox and the OS is much more reliable and complete than Password Manager XP ever managed. What's more, they are about to release a service called my1Password that will let me get web-enabled access to my passwords from any location and platform! I'm happy as a clam about this as it's proven impossible to find a decent cross platform desktop password manager application. I should give a shout out to Clipperz here as they have had a decent implementation of this for a while, but the lack of integration with my main password manager (so I have to maintain two lists) and minuscule password limit have put me off using it regularly. UPDATE: Marco Barulli from Clipperz has responded to what I said here: please read this post to get the full story. Time Machine, oh Time Machine, how did I ever get backups done before I had you? The answer is: with great difficulty. On Windows I set up a scheduled task to use SyncBack to clone my hard disk to another server. Unfortunately, this was pretty unreliable (partly because I was backing up onto a Linux file system that had an imperfect emulation of Windows security and didn't seem to support Unicode properly) and also meant that I only had a backup of the most recent version of my filesystem. With Time Machine everything is seamless and I can go back weeks or months in time to see my files at any point, all from within the Finder! Awesome! And finally, maybe you don't find the OS X Terminal very exciting, but for someone who has wasted many hours struggling with Cygwin and its numerous problems (e.g. the awkward attempt to reconcile the Windows and Unix permission models) it is a godsend to finally have a real Unix shell available 🙂 I haven't even mentioned some perennial favourites like Transmission, Perian or AppFresh, but my time is limited! If you really feel the need to peek into all the applications I have installed, take a peek at my iusethis profile. Overall my switching experience has been almost entirely painless and has certainly made me more productive and satisfied with my machine. Here's to many more happy years with Apple computers! Using Cygwin on Windows Vista as an Administrator I've just managed to fix a particularly annoying issue I was having with Cygwin under Vista with UAC enabled. Essentially, when I ran the Cygwin bash shell as an Administrator I was seeing the following: • Bash started in /bin/ • The bash shell had no knowledge of Unix paths (i.e. the current working directory was /c/Cygwin/bin) • The root file system was just my C: drive • My .bashrc and .bash_profile were not being executed All of this made it fairly hard to actually do anything. However, I was given a clue when I started up Poderosa (a nice GUI terminal emulator) and tried to open a Cygwin console: there was an error message about not being able to access "HKEY_CURRENT_USER\Software\Cygnus Solutions\Cygwin\mounts v2\/". This key contains the mount information for the root directory of the Cygwin instance, so it looked like the root of the problem. I then opened up Process Monitor to see what happened when, running under the Administrator user, Poderosa tried to access that key. It got "NAME NOT FOUND" as expected. I then tried the same thing as a normal user, and found that the response code as "REPARSE" instead! The log looked like this: What, do you ask, is all that VirtualStore nonsense after the REPARSE has occurred? Well, that REPARSE is actually redirecting Poderosas read of the LOCAL_MACHINE registry key to some keys in the CURRENT_USER hive: this is a new "feature" of Windows Vista designed to ensure that old applications that assume they can write to LOCAL_MACHINE still work. Any such writes by a process without the appropriate permissions are redirected to the VirtualStore instead, and then read back later transparently by this REPARSE mechanism. However, in their wisdom Microsoft have decided that this redirection is disabled when applications are run as an Administrator from a UAC protected account. What must have happened is that I accidentally installed or modified Cygwin using the Setup program as a normal user, and it's writes to HKLM were hence redirected to VirtualStore, which is then not consulted when you try and run Cygwin proper as an Administrator. My fix was simply to: 1. Export the VirtualStore branch as a ".reg" file from the registry editor 2. Open the resulting file in Notepad and replace the text "HKEY_CURRENT_USER\Software\Classes\VirtualStore\MACHINE" with "HKEY_LOCAL_MACHINE" 3. Import the modified .reg file using the registry editor again After that process the user-private data written by Cygwin setup was accessible machine-wide, solving my problem. I hope that this post will be able to help out someone else with similar problems! Building YAWS For Windows Inspired by Yariv's Blog, where he talks about a framework for building web applications in Erlang, and my so far abortive attempts to get into Erlang, I decided to give it another go with Erlyweb. Erlyweb depends on YAWS (Yet Another Web Server), however, and this proved to be a bit of a pain to install since I'm being difficult and using Windows on my development machine. So, in order to help any other lost souls who try to duplicate this feat in the future, I'm recording the process (tested against YAWS 1.68): 1. Install Erlang (obviously), and make sure it is in your PATH 2. Install Cygwin with the Perl and GNU Make packages at minimum 3. Unpack the latest YAWS release into your home directory 4. Now, the first trickiness: there is a small error in the YAWS makefile, so open up the yaws-X.XX\src\Makefile and for the mime_types.erl target change the first command to be not$(ERL) but "$(ERL)". The quotes mean that for those of us with Erlang installed in a path with spaces in the name (such as Windows users who put it in Program Files) the erl executable will actually be found. If you don't follow this step you'll end up with some error like: /cygdrive/c/Program Files/Erlang/5.5.4/bin/erl -noshell -pa ../ebin -s mime_type_c compile make[1]: /cygdrive/c/Program: Command not found 5. Follow the same process to add quotes around$(ERLC) in www\shopingcart\Makefile and www\code\Makefile (somewhat weirdly, every other uses of $(ERL) and$(ERLC) have been quoted for us, suggesting this is just something they overlooked, rather than that running on Windows is a blasphemy)
6. Whack open a Bash Cygwin shell and cd into the yaws-X.XX directory
7. Do the usual "./configure; make" dance
8. Open up the newly created yaws file in the bin subdirectory and change the last line so that $erl is in quotes, i.e. from this:${RUN_ERL} "exec $erl$XEC"

To this:

${RUN_ERL} 'exec "$erl" \$XEC'

9. From this point on I'm going to assume you need to do a local install: if you want to do your own thing, you can follow the instructions here, but you may need to adapt them based on what I'm going to talk about below. Anyway, run "make local_install" do to the install if you are following along at home
10. Now, this is where it can get a bit confusing: although we just built YAWS under Cygwin, since we have a Windows binary of Erlang the paths in yaws.conf (which should have appeared in your home directory) must be Windows paths, but the makefile used Unix ones. Go in and fix all of those (for me, this meant putting "C:/Cygwin" in front of all of them)