This post aims to explain the difference between asynchronous and non-blocking IO, with particular reference to their implementation in Java. These two styles of IO API are closely related but have a number of important differences, especially when it comes to OS support.

Asynchronous IO

Asynchronous IO refers to an interface where you supply a callback to an IO operation, which is invoked when the operation completes. This invocation often happens to an entirely different thread to the one that originally made the request, but this is not necessarily the case. Asynchronous IO is a manifestation of the “proactor” pattern.

One common way to implement asynchronous IO is to have a thread pool whose threads are used to make the normal blocking IO requests, and execute the appropriate callbacks when these return. The less common implementation approach is to avoid a thread pool, and just push the actual asynchronous operations down into the kernel. This alternative solution obviously has the disadvantage that it depends on operating system specific support for making async operations, but has the following advantages:

The maximum number of in-flight requests is not bounded by the size of your thread pool
The overhead of creating thread pool threads is avoided (e.g. you need not reserve any memory for the thread stacks, and you don’t pay the extra context switching cost associated with having more schedulable entities)
You expose more information to the kernel, which it can potentially use to make good choices about how to do the IO operations — e.g. by minimizing the distance that the disk head needs to travel to satisfy your requests, or by using native command queueing.

Operating system support for asynchronous IO is mixed:

Linux has at least two implementations of async IO:
- POSIX AIO (aio_read et al). This is implemented on Linux by glibc, but other POSIX systems (Solaris, OS X etc) have their own implementations. The glibc implementation is simply a thread pool based one — I’m not sure about the other systems.
- Linux kernel AIO (io_submit et al). No thread pool is used here, but it has quite a few limitations (e.g. it only works for files, not sockets, and has alignment restrictions on file reads) and does not seem to be used much in practice.
There is a good discussion of the *nix AIO situation on the libtorrent blog, summarised by the same writer on Stack Overflow here. The experience of this author was that the limitations and poor implementation quality of the various *nix AIO implementations are such that you are much better off just using your own thread pool to issue blocking operations.
Windows provides a mechanism called completion ports for performing asynchronous IO. With this system:
1. You start up a thread pool and arrange for each thread to spin calling GetQueuedCompletionStatus
2. You make IO requests using the normal Windows APIs (e.g. ReadFile and WSARecv), with the small added twist that you supply a special LPOVERLAPPED parameter indicating that the calls should be non-blocking and the result should be reported to the thread pool
3. As IO completes, thread pool threads blocked on GetQueuedCompletionStatus are woken up as necessary to process completion events
Windows intelligently schedules how it delivers GetQueuedCompletionStatus wakeups, such that it tries to roughly keep the same number of threads active at any time. This avoids excessive context switching and scheduler transitions — things are arranged so that a thread which has just processed a completion event will likely be able to immediately grab a new work item. With this arrangement, your pool can be much smaller than the number of IO operations you want to have in-flight: you only need to have as many threads as are required to process completion events.

In Java, support for asynchronous IO was added as part of the NIO2 work in JDK7, and the appropriate APIs are exposed by the AsynchronousChannel class. On *nix, AsynchronousFileChannel and AsynchronousSocketChannel are implemented using the standard thread pool approach (the pools are owned by an AsynchronousChannelGroup). On Windows, completion ports are used — in this case, the AsynchronousChannelGroup thread poll is used as the GetQueuedCompletionStatus listeners.

If you happen to be stuck on JDK6, your only option is to ignore completion ports and roll your own thread pool to dispatch operations on e.g. standard synchronous FileChannels. However, if you do this you may find that you don’t actually get much concurrency on Windows. This happens because FileChannel.read(ByteBuffer, long) is needlessly crippled by taking a lock on the whole FileChannel. This lock is needless because FileChannel is otherwise a thread-safe class, and in order to make sure your positioned read isn’t interfering with the other readers you don’t need to lock — you simply need to issue a ReadFile call with a custom position by using one of the fields of the LPOVERLAPPED struct parameter. Note that the *nix implementation of FileChannel.read does the right thing and simply issues a pread call without locking.

Non-blocking IO

Non-blocking IO refers to an interface where IO operations will return immediately with a special error code if called when they are in a state that would otherwise cause them to block. So for example, a non-blocking recv will return immediately with a EAGAIN or EWOULDBLOCK error code if no data is available on the socket, and likewise send will return immediately with an error if the OS send buffers are full. Generally APIs providing non-blocking IO will also provide some sort of interface where you can efficiently wait for certain operations to enter a state where invoking the non-blocking IO operation will actually make some progress rather than immediately returning. APIs in this style are implementations of the reactor pattern.

No OS that I know of implements non-blocking IO for file IO, but support for socket IO is generally reasonable:

Non-blocking read and writes are available via the POSIX O_NONBLOCK operating mode, which can be set on file descriptors (FDs) representing sockets and FIFOs.
POSIX provides select and poll which let you wait for reads and writes to be ready on several FDs. (The difference between these two is pretty much just that select lets you wait for a number of FDs up to FD_SETSIZE, while poll can wait for as many FDs as you are allowed to create.)

Select and poll have the major disadvantage that when the kernel returns from one of these calls, you only know the number of FDs that got triggered — not which specific FDs have become unblocked. This means you later have to do a linear time scan across each of the FDs you supplied to figure out which one you actually need to use.
This limitation motivated the development of several successor interfaces. BSD & OS X got kqueue, Solaris got /dev/poll, and Linux got epoll. Roughly speaking, these interfaces lets you build up a set of FDs you are interested in watching, and then make a call that returns to you a list those of FDs in the set that were actually triggered.

There’s lots of good info about these mechanisms at the classic C10K page. If you like hearing someone who clearly knows what he is talking about rant for 10 minutes about syscalls, this Bryan Cantrill bit about epoll is quite amusing.
Unfortunately, Windows never got one of these successor mechanisms: only select is supported. It is possible to do an epoll-like thing by kicking off an operation that would normally block (e.g. WSARecv) with a specially prepared LPOVERLAPPED parameter, such that you can wait it to complete using WSAWaitForMultipleEvents. Like epoll, when this wait returns it gives you a notion of which of the sockets of interest caused the wakeup. Unfortunately, this API won’t let you wait for more than 64 events — if you want to wait for more you need to create child threads that recursively call WSAWaitForMultipleEvents, and then wait on those threads!

In Java, non-blocking IO has been exposed via SelectableChannel since JDK4. As I mentioned above, OS support for non-blocking IO on files is nonexistant — correspondingly, Java’s SocketChannel extends SelectableChannel, but FileChannel does not.

The JDK implements SelectableChannel using whatever the platform-appropriate API is (i.e. epoll, kqueue, /dev/poll, poll or select). The Windows implementation is based on select — to ameliorate the fact that select requires a linear scan, the JDK creates a new thread for every 1024 sockets being waited on.

Conclusions

Let’s say that you want to do Java IO in a non-synchronous way. The bottom line is:

If you want to do IO against files, your only option is asynchronous IO. You’ll need to roll it yourself with JDK6 and below (and the resulting implementation won’t be as concurrent as you expect Windows). On the other hand, with Java 7 and up you can just use the built-in mechanisms, and what you’ll get is basically as good as the state-of-the-art.
If you want to do IO against sockets, an ideal solution would use non-blocking IO on *nix and asynchronous IO on Windows. This is obviously a bit awkward to do, since it involves working with two rather different APIs. There might be some project akin to libuv that wraps these two mechanisms up into a single API you can write against, but I don’t know of it if so.

The Netty project is an interesting data point. This high performance Java server is based principally on non-blocking IO, but they did make an abortive attempt to use async IO instead at one point — it was backed out because there was no performance advantage to using async IO instead of non-blocking IO on Linux. Some users report that the now-removed asynchronous IO code drastically reduces CPU usage on Windows, but others report that Java’s dubious select-based implementation of Windows non-blocking IO is good enough.