Category Archives: Windows

Asynchronous and non-blocking IO

This post aims to explain the difference between asynchronous and non-blocking IO, with particular reference to their implementation in Java. These two styles of IO API are closely related but have a number of important differences, especially when it comes to OS support.

Asynchronous IO

Asynchronous IO refers to an interface where you supply a callback to an IO operation, which is invoked when the operation completes. This invocation often happens to an entirely different thread to the one that originally made the request, but this is not necessarily the case. Asynchronous IO is a manifestation of the "proactor" pattern.

One common way to implement asynchronous IO is to have a thread pool whose threads are used to make the normal blocking IO requests, and execute the appropriate callbacks when these return. The less common implementation approach is to avoid a thread pool, and just push the actual asynchronous operations down into the kernel. This alternative solution obviously has the disadvantage that it depends on operating system specific support for making async operations, but has the following advantages:

  • The maximum number of in-flight requests is not bounded by the size of your thread pool
  • The overhead of creating thread pool threads is avoided (e.g. you need not reserve any memory for the thread stacks, and you don't pay the extra context switching cost associated with having more schedulable entities)
  • You expose more information to the kernel, which it can potentially use to make good choices about how to do the IO operations — e.g. by minimizing the distance that the disk head needs to travel to satisfy your requests, or by using native command queueing.

Operating system support for asynchronous IO is mixed:

  • Linux has at least two implementations of async IO:
    • POSIX AIO (aio_read et al). This is implemented on Linux by glibc, but other POSIX systems (Solaris, OS X etc) have their own implementations. The glibc implementation is simply a thread pool based one — I'm not sure about the other systems.
    • Linux kernel AIO (io_submit et al). No thread pool is used here, but it has quite a few limitations (e.g. it only works for files, not sockets, and has alignment restrictions on file reads) and does not seem to be used much in practice.

    There is a good discussion of the *nix AIO situation on the libtorrent blog, summarised by the same writer on Stack Overflow here. The experience of this author was that the limitations and poor implementation quality of the various *nix AIO implementations are such that you are much better off just using your own thread pool to issue blocking operations.

  • Windows provides a mechanism called completion ports for performing asynchronous IO. With this system:
    1. You start up a thread pool and arrange for each thread to spin calling GetQueuedCompletionStatus
    2. You make IO requests using the normal Windows APIs (e.g. ReadFile and WSARecv), with the small added twist that you supply a special LPOVERLAPPED parameter indicating that the calls should be non-blocking and the result should be reported to the thread pool
    3. As IO completes, thread pool threads blocked on GetQueuedCompletionStatus are woken up as necessary to process completion events

    Windows intelligently schedules how it delivers GetQueuedCompletionStatus wakeups, such that it tries to roughly keep the same number of threads active at any time. This avoids excessive context switching and scheduler transitions — things are arranged so that a thread which has just processed a completion event will likely be able to immediately grab a new work item. With this arrangement, your pool can be much smaller than the number of IO operations you want to have in-flight: you only need to have as many threads as are required to process completion events.

In Java, support for asynchronous IO was added as part of the NIO2 work in JDK7, and the appropriate APIs are exposed by the AsynchronousChannel class. On *nix, AsynchronousFileChannel and AsynchronousSocketChannel are implemented using the standard thread pool approach (the pools are owned by an AsynchronousChannelGroup). On Windows, completion ports are used — in this case, the AsynchronousChannelGroup thread poll is used as the GetQueuedCompletionStatus listeners.

If you happen to be stuck on JDK6, your only option is to ignore completion ports and roll your own thread pool to dispatch operations on e.g. standard synchronous FileChannels. However, if you do this you may find that you don't actually get much concurrency on Windows. This happens because FileChannel.read(ByteBuffer, long) is needlessly crippled by taking a lock on the whole FileChannel. This lock is needless because FileChannel is otherwise a thread-safe class, and in order to make sure your positioned read isn't interfering with the other readers you don't need to lock — you simply need to issue a ReadFile call with a custom position by using one of the fields of the LPOVERLAPPED struct parameter. Note that the *nix implementation of FileChannel.read does the right thing and simply issues a pread call without locking.

Non-blocking IO

Non-blocking IO refers to an interface where IO operations will return immediately with a special error code if called when they are in a state that would otherwise cause them to block. So for example, a non-blocking recv will return immediately with a EAGAIN or EWOULDBLOCK error code if no data is available on the socket, and likewise send will return immediately with an error if the OS send buffers are full. Generally APIs providing non-blocking IO will also provide some sort of interface where you can efficiently wait for certain operations to enter a state where invoking the non-blocking IO operation will actually make some progress rather than immediately returning. APIs in this style are implementations of the reactor pattern.

No OS that I know of implements non-blocking IO for file IO, but support for socket IO is generally reasonable:

  • Non-blocking read and writes are available via the POSIX O_NONBLOCK operating mode, which can be set on file descriptors (FDs) representing sockets and FIFOs.

  • POSIX provides select and poll which let you wait for reads and writes to be ready on several FDs. (The difference between these two is pretty much just that select lets you wait for a number of FDs up to FD_SETSIZE, while poll can wait for as many FDs as you are allowed to create.)

    Select and poll have the major disadvantage that when the kernel returns from one of these calls, you only know the number of FDs that got triggered — not which specific FDs have become unblocked. This means you later have to do a linear time scan across each of the FDs you supplied to figure out which one you actually need to use.

  • This limitation motivated the development of several successor interfaces. BSD & OS X got kqueue, Solaris got /dev/poll, and Linux got epoll. Roughly speaking, these interfaces lets you build up a set of FDs you are interested in watching, and then make a call that returns to you a list those of FDs in the set that were actually triggered.

    There's lots of good info about these mechanisms at the classic C10K page. If you like hearing someone who clearly knows what he is talking about rant for 10 minutes about syscalls, this Bryan Cantrill bit about epoll is quite amusing.

  • Unfortunately, Windows never got one of these successor mechanisms: only select is supported. It is possible to do an epoll-like thing by kicking off an operation that would normally block (e.g. WSARecv) with a specially prepared LPOVERLAPPED parameter, such that you can wait it to complete using WSAWaitForMultipleEvents. Like epoll, when this wait returns it gives you a notion of which of the sockets of interest caused the wakeup. Unfortunately, this API won't let you wait for more than 64 events — if you want to wait for more you need to create child threads that recursively call WSAWaitForMultipleEvents, and then wait on those threads!

  • The reason that Windows support is a bit lacking here is that they seem to expect you to use an asynchronous IO mechanism instead: either completion ports, or completion handlers. (Completion handlers are implemented using the windows APC mechanism and are a form of callback that don't require a thread pool — instead, they are executed in the spare CPU time when the thread that issued the IO operation is otherwise suspended, e.g. in a call to WaitForMultipleObjectsEx).

In Java, non-blocking IO has been exposed via SelectableChannel since JDK4. As I mentioned above, OS support for non-blocking IO on files is nonexistant — correspondingly, Java's SocketChannel extends SelectableChannel, but FileChannel does not.

The JDK implements SelectableChannel using whatever the platform-appropriate API is (i.e. epoll, kqueue, /dev/poll, poll or select). The Windows implementation is based on select — to ameliorate the fact that select requires a linear scan, the JDK creates a new thread for every 1024 sockets being waited on.

Conclusions

Let's say that you want to do Java IO in a non-synchronous way. The bottom line is:

  • If you want to do IO against files, your only option is asynchronous IO. You'll need to roll it yourself with JDK6 and below (and the resulting implementation won't be as concurrent as you expect Windows). On the other hand, with Java 7 and up you can just use the built-in mechanisms, and what you'll get is basically as good as the state-of-the-art.

  • If you want to do IO against sockets, an ideal solution would use non-blocking IO on *nix and asynchronous IO on Windows. This is obviously a bit awkward to do, since it involves working with two rather different APIs. There might be some project akin to libuv that wraps these two mechanisms up into a single API you can write against, but I don't know of it if so.

    The Netty project is an interesting data point. This high performance Java server is based principally on non-blocking IO, but they did make an abortive attempt to use async IO instead at one point — it was backed out because there was no performance advantage to using async IO instead of non-blocking IO on Linux. Some users report that the now-removed asynchronous IO code drastically reduces CPU usage on Windows, but others report that Java's dubious select-based implementation of Windows non-blocking IO is good enough.

Rpath emulation: absolute DLL references on Windows

When creating an executable or shared library on Linux, it’s possible to include an ELF RPATH header which tells the dynamic linker where to search for the any shared libraries that you reference. This is a pretty handy feature because it can be used to nail down exactly which shared library you will link against, without leaving anything up to chance at runtime.

Unfortunately, Windows does not have an equivalent feature. However, it does have an undocumented feature which may be enough to replace your use of rpath if you are porting software from Linux.

Executables or DLLs or Windows always reference any DLLs that they import by name only. So, the import table for an executable will refer to kernel32.dll rather than C:\Windows\kernel32.dll. Window’s dynamic loader will look for a file with the appropriate name in the DLL search path as usual. (For full details on DLL import tables and more, you can check out my previous in depth post.)

However, Window’s dynamic loader will, as a completely undocumented (and presumably unsupported) feature, also accept absolute paths in the import table. This is game-changing because it means that you can hard-code exactly which DLL you want to refer to, just like you would be able to with rpath on Linux.

Demonstration

To demonstrate this technique, we’re going to need code for a DLL and a referring EXE:

$ cat library.c
#include <stdio.h>

__declspec(dllexport) int librarycall(void) {
        printf("Made library call!\n");
        return 0;
}

$ cat rpath.c
__declspec(dllimport) int librarycall(void);

int main(int argc, char **argv) {
        return librarycall();
}

If we were building a DLL and EXE normally, we would do this:

<code>gcc -c library.c
gcc -shared -o library.dll library.o
gcc -o rpath rpath.c -L./ -llibrary</code>

This all works fine:

<code>$ ./rpath
Made library call!</code>

However, as you would expect, if you move library.dll elsewhere, the EXE will fail to start:

<code>$ mv library.dll C:/library.dll
$ ./rpath
/home/Max/rpath/rpath.exe: error while loading shared libraries: library.dll: cannot open shared object file: No such file or directory</code>

Now let’s work some magic! If we open up rpath.exe in a hex editor, we see something like this:

Let’s just tweak that a bit to change the relative path to library.dll to an absolute path. Luckily there is enough padding to make it fit:

The EXE will now work perfectly!

<code>$ ./rpath
Made library call!</code>

In practice

Knowing that this feature exists is one thing. Actually making use of it in a reliable way is another. The problem is that to my knowledge no linkers are capable of creating a DLL or EXE which include an absolute path in their import tables. Sometimes we will be lucky enough that the linker creates an EXE or DLL with enough padding in it for us to manually edit in an absolute path, but with the method above there is no guarantee that this will be possible.

In order to exploit this technique robustly, we’re going to use a little trick with import libraries. Instead of using GCC’s ability to link directly to a DLL, we will generate an import library for the DLL, which we will call library.lib:

<code>$ dlltool --output-lib library.lib --dllname veryverylongdllname.dll library.o</code>

When you use dlltool you either need to write a .def file for the DLL you are creating an import library for, or you need to supply all the object files that were used to create the DLL. I’ve taken the second route here and just told dlltool that the our DLL was built from library.o.

Now we have an import library, we can do our hex-editing trick again, but this time on the library. Before:

And after (note that I have null-terminated the new absolute path):

The beauty of editing the import library rather than the output of the linker is that using the --dllname option we can ensure that the import library contains as much space as we need to fit the entire absolute path of the DLL, no matter how long it may be. This is the key to making robust use of absolute paths in DLL loading, even if linkers don’t support them!

Now we have the import library, we can link rpath.exe again, but this time using the import library rather than library.dll:

<code>$ gcc -o rpath rpath.c library.lib
$ ./rpath
Made library call!</code>

Yes, it really is using the DLL on the C: drive:

<code>$ mv C:/library.dll C:/foo.dll
$ ./rpath
/home/Max/rpath/rpath.exe: error while loading shared libraries: C:\library.dll: cannot open shared object file: No such file or directory</code>

Conclusion

I haven’t seen this technique for using absolute paths for DLL references anywhere on the web, so it doesn’t seem to be widely known. However, it works beautifully on Windows 7 and probably on all other versions of Windows as well.

I may apply these techniques to the Glasgow Haskell Compiler in order to improve the support for Haskell shared objects on Windows: more information on this topic can be found on the GHC wiki.

Everything You Never Wanted To Know About DLLs

I've recently had cause to investigate how dynamic linking is implemented on Windows. This post is basically a brain dump of everything I've learnt on the issue. This is mostly for my future reference, but I hope it will be useful to others too as I'm going to bring together lots of information you would otherwise have to hunt around for.

Without further ado, here we go:

Export and import directories

The Windows executable loader is responsible for doing all dynamic loading and symbol resolution before running the code. The linker works out what functions are exported or imported by each image (an image is a DLL or EXE file) by inspecting the .edata and .idata sections of those images, respectively.

The contents of these sections is covered in detail by the PE/COFF specification.

The .edata section

This section records the exports of the image (yes, EXEs can export things). This takes the form of:

  • The export address table: an array of length N holding the addresses of the exported functions/data (the addresses are stored relative to the image base). Indexes into this table are called ordinals.
  • The export name pointer table: an array of length M holding pointers to strings that represent the name of an export. This array is lexically ordered by name, to allow binary searches for a given export.
  • The export ordinal table: a parallel array of length M holding the ordinal of the corresponding name in the export name pointer table.

(As an alternative to importing an image's export by its name, it is possible to import by specifying an ordinal. Importing by ordinal is slightly faster at runtime because the dynamic linker doesn't have to do a lookup. Furthermore, if the import is not given a name by the exporting DLL, importing by ordinal is the only way to do the import.)

How does the .edata section get created in the first place? There are two main methods:

  1. Most commonly, they start life in the object files created by compiling some source code that defines a function/some data that was declared with the __declspec(dllimport) modifier. The compiler just emits an appropriate .edata section naming these exports.

  2. Less commonly, the programmer might write a .def file specifying which functions they would like to export. By supplying this to dlltool --output-exp, an export file can be generated. An export file is just an object file which only contains a .edata section, exporting (via some unresolved references that will be filled in by the linker in the usual way) the symbols named in the .def file. This export library must be named by the programmer when he comes to link together his object files into a DLL.

In both these cases, the linker collects the .edata sections from all objects named on the link line to build the .edata for the overall image file. One last possible way that the .edata can be created is by the linker itself, without having to put .edata into any object files:

  1. The linker could choose to export all symbols defined by object files named on the link line. For example, this is the default behaviour of GNU ld (the behaviour can also be explicitly asked for using –-export-all-symbols). In this case, the linker generates the .edata section itself. (GNU ld also supports specifying a .def file on the command line, in which case the generated section will export just those things named by the .def).

The .idata section

The .idata section records those things that the image imports. It consists of:

  • For every image from which symbols are imported:

    • The filename of the image. Used by the dynamic linker to locate it on disk.

    • The import lookup table: an array of length N, which each entry is either an ordinal or a pointer to a string representing the name to import.

    • The import address table: an array of N pointers. The dynamic linker is responsible for filling out this array with the address of the function/data named by the corresponding symbol in the import lookup table.

The ways in which .idata entries are created are as follows:

  1. Most commonly, they originate in a library of object files called an import library'. This import library can be created by usingdlltool` on the DLL you wish to export or a .def file of the type we discussed earlier. Just like the export library, the import library must be named by the user on the link line.

  2. Alternatively, some linkers (like GNU ld) let you specify a DLL directly on the link line. The linker will automatically generate .idata entries for any symbols that you must import from the DLL.

Notice that unlike the case when we were exporting symbols, __declspec(dllimport) does not cause .idata sections to be generated.

Import libraries are a bit more complicated than they first appear. The Windows dynamic loader fills the import address table with the addresses of the imported symbols (say, the address of a function Func). However, when the assembly code in other object files says call Func they expect that Func to name the address of that code. But we don't know that address until runtime: the only thing we know statically is the address where that address will be placed by the dynamic linker. We will call this address __imp__Func.

To deal with this extra level of indirection, the import library exports a function Func that just dereferences __imp__Func (to get the actual function pointer) and then jmps to it. All of the other object files in the project can now say call Func just as they would if Func had been defined in some other object file, rather than a DLL. For this reason, saying __declspec(dllimport) in the declaration of a dynamically linked function is optional (though in fact you will get slightly more efficient code if you add them, as we will see later).

Unfortunately, there is no equivalent trick if you want to import data from another DLL. If we have some imported data myData, there is no way the import library can be defined so that a mov $eax, myData in an object file linked against it writes to the storage for myData in that DLL. Instead, the import library defines a symbol __imp__myData that resolves to the address at which the linked-in address of the storage can be found. The compiler then ensures that when you read or write from a variable defined with __declspec(dllimport) those reads and writes go through the __imp_myData indirection. Because different code needs to be generated at the use site, __declspec declarations on data imports are not optional.

Practical example

Theory is all very well but it can be helpful to see all the pieces in play.

Building a DLL

First, lets build a simple DLL exporting both functions and data. For maximum clarity, we'll use an explicit export library rather instead of decorating our functions with declspec(dllexport) or supply a .def file to the linker.

First lets write the .def file, library.def:

<code>LIBRARY library
EXPORTS
   function_export
   data_export      DATA
</code>

(The DATA keyword and LIBRARY line only affects how the import library is generated, as explained later on. Ignore them for now.)

Build an export file from that:

<code>$ dlltool --output-exp library_exports.o -d library.def
</code>

The resulting object basically just contains an .edata section that exports the symbols _data_export and _function_export under the names data_export and function_export respectively:

<code>$ objdump -xs library_exports.o

...

There is an export table in .edata at 0x0

The Export Tables (interpreted .edata section contents)

Export Flags                    0
Time/Date stamp                 4e10e5c1
Major/Minor                     0/0
Name                            00000028 library_exports.o.dll
Ordinal Base                    1
Number in:
        Export Address Table            00000002
        [Name Pointer/Ordinal] Table    00000002
Table Addresses
        Export Address Table            00000040
        Name Pointer Table              00000048
        Ordinal Table                   00000050

Export Address Table -- Ordinal Base 1

[Ordinal/Name Pointer] Table
        [   0] data_export
        [   1] function_export

Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .text         00000000  00000000  00000000  00000000  2**2
                  ALLOC, LOAD, READONLY, CODE
  1 .data         00000000  00000000  00000000  00000000  2**2
                  ALLOC, LOAD, DATA
  2 .bss          00000000  00000000  00000000  00000000  2**2
                  ALLOC
  3 .edata        00000070  00000000  00000000  000000b4  2**2
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
SYMBOL TABLE:
[  0](sec -2)(fl 0x00)(ty   0)(scl 103) (nx 1) 0x00000000 fake
File
[  2](sec  4)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000028 name
[  3](sec  4)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000040 afuncs
[  4](sec  4)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000048 anames
[  5](sec  4)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000050 anords
[  6](sec  4)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000054 n1
[  7](sec  4)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000060 n2
[  8](sec  1)(fl 0x00)(ty   0)(scl   3) (nx 1) 0x00000000 .text
AUX scnlen 0x0 nreloc 0 nlnno 0
[ 10](sec  2)(fl 0x00)(ty   0)(scl   3) (nx 1) 0x00000000 .data
AUX scnlen 0x0 nreloc 0 nlnno 0
[ 12](sec  3)(fl 0x00)(ty   0)(scl   3) (nx 1) 0x00000000 .bss
AUX scnlen 0x0 nreloc 0 nlnno 0
[ 14](sec  4)(fl 0x00)(ty   0)(scl   3) (nx 1) 0x00000000 .edata
AUX scnlen 0x70 nreloc 8 nlnno 0
[ 16](sec  0)(fl 0x00)(ty   0)(scl   2) (nx 0) 0x00000000 _data_export
[ 17](sec  0)(fl 0x00)(ty   0)(scl   2) (nx 0) 0x00000000 _function_export


RELOCATION RECORDS FOR [.edata]:
OFFSET   TYPE              VALUE
0000000c rva32             .edata
0000001c rva32             .edata
00000020 rva32             .edata
00000024 rva32             .edata
00000040 rva32             _data_export
00000044 rva32             _function_export
00000048 rva32             .edata
0000004c rva32             .edata


Contents of section .edata:
 0000 00000000 c1e5104e 00000000 28000000  .......N....(...
 0010 01000000 02000000 02000000 40000000  ............@...
 0020 48000000 50000000 6c696272 6172795f  H...P...library_
 0030 6578706f 7274732e 6f2e646c 6c000000  exports.o.dll...
 0040 00000000 00000000 54000000 60000000  ........T...`...
 0050 00000100 64617461 5f657870 6f727400  ....data_export.
 0060 66756e63 74696f6e 5f657870 6f727400  function_export.
</code>

We'll fulfil these symbol with a trivial implementation of the DLL, library.c:

int data_export = 42;

int function_export() {
    return 1337 + data_export;
}

We can put it together into a DLL:

<code>$ gcc -shared -o library.dll library.c library_exports.o
</code>

The export table for the DLL is as follows, showing that we have exported what we wanted:

<code>The Export Tables (interpreted .edata section contents)

Export Flags                    0
Time/Date stamp                 4e10e5c1
Major/Minor                     0/0
Name                            00005028 library_exports.o.dll
Ordinal Base                    1
Number in:
        Export Address Table            00000002
        [Name Pointer/Ordinal] Table    00000002
Table Addresses
        Export Address Table            00005040
        Name Pointer Table              00005048
        Ordinal Table                   00005050

Export Address Table -- Ordinal Base 1
        [   0] +base[   1] 200c Export RVA
        [   1] +base[   2] 10f0 Export RVA

[Ordinal/Name Pointer] Table
        [   0] data_export
        [   1] function_export
</code>

Using the DLL

When we come to look at using the DLL, things become a lot more interesting. First, we need an import library:

<code>$ dlltool --output-lib library.dll.a -d library.def
</code>

(The reason that we have an import library but an export object is because using a library for the imports allows the linker to discard .idata for any imports that are not used. Contrariwise ,he linker can never discard any .edata entry because any export may potentially be used by a user of the DLL).

This import library is rather complex. It contains one object for each export (disds00000.o and disds00001.o) but also two other object files (distdt.o and disdh.o) that set up the header and footer of the import list. (The header of the import list contains, among other things, the name of the DLL to link in at runtime, as derived from the LIBRARY line of the .def file.)

<code><br />$ objdump -xs library.dll.a
In archive library.dll.a:

disdt.o:     file format pe-i386

...

Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .text         00000000  00000000  00000000  00000000  2**2
                  ALLOC, LOAD, READONLY, CODE
  1 .data         00000000  00000000  00000000  00000000  2**2
                  ALLOC, LOAD, DATA
  2 .bss          00000000  00000000  00000000  00000000  2**2
                  ALLOC
  3 .idata$4      00000004  00000000  00000000  00000104  2**2
                  CONTENTS, ALLOC, LOAD, DATA
  4 .idata$5      00000004  00000000  00000000  00000108  2**2
                  CONTENTS, ALLOC, LOAD, DATA
  5 .idata$7      0000000c  00000000  00000000  0000010c  2**2
                  CONTENTS, ALLOC, LOAD, DATA
SYMBOL TABLE:
[  0](sec -2)(fl 0x00)(ty   0)(scl 103) (nx 1) 0x00000000 fake
File
[  2](sec  1)(fl 0x00)(ty   0)(scl   3) (nx 1) 0x00000000 .text
AUX scnlen 0x0 nreloc 0 nlnno 0
[  4](sec  2)(fl 0x00)(ty   0)(scl   3) (nx 1) 0x00000000 .data
AUX scnlen 0x0 nreloc 0 nlnno 0
[  6](sec  3)(fl 0x00)(ty   0)(scl   3) (nx 1) 0x00000000 .bss
AUX scnlen 0x0 nreloc 0 nlnno 0
[  8](sec  4)(fl 0x00)(ty   0)(scl   3) (nx 1) 0x00000000 .idata$4
AUX scnlen 0x4 nreloc 0 nlnno 0
[ 10](sec  5)(fl 0x00)(ty   0)(scl   3) (nx 1) 0x00000000 .idata$5
AUX scnlen 0x4 nreloc 0 nlnno 0
[ 12](sec  6)(fl 0x00)(ty   0)(scl   3) (nx 1) 0x00000000 .idata$7
AUX scnlen 0x7 nreloc 0 nlnno 0
[ 14](sec  6)(fl 0x00)(ty   0)(scl   2) (nx 0) 0x00000000 __library_dll_a_iname


Contents of section .idata$4:
 0000 00000000                             ....
Contents of section .idata$5:
 0000 00000000                             ....
Contents of section .idata$7:
 0000 6c696272 6172792e 646c6c00           library.dll.

disdh.o:     file format pe-i386

...

Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .text         00000000  00000000  00000000  00000000  2**2
                  ALLOC, LOAD, READONLY, CODE
  1 .data         00000000  00000000  00000000  00000000  2**2
                  ALLOC, LOAD, DATA
  2 .bss          00000000  00000000  00000000  00000000  2**2
                  ALLOC
  3 .idata$2      00000014  00000000  00000000  00000104  2**2
                  CONTENTS, ALLOC, LOAD, RELOC, DATA
  4 .idata$5      00000000  00000000  00000000  00000000  2**2
                  ALLOC, LOAD, DATA
  5 .idata$4      00000000  00000000  00000000  00000000  2**2
                  ALLOC, LOAD, DATA
SYMBOL TABLE:
[  0](sec -2)(fl 0x00)(ty   0)(scl 103) (nx 1) 0x00000000 fake
File
[  2](sec  6)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000000 hname
[  3](sec  5)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000000 fthunk
[  4](sec  1)(fl 0x00)(ty   0)(scl   3) (nx 1) 0x00000000 .text
AUX scnlen 0x0 nreloc 0 nlnno 0
[  6](sec  2)(fl 0x00)(ty   0)(scl   3) (nx 1) 0x00000000 .data
AUX scnlen 0x0 nreloc 0 nlnno 0
[  8](sec  3)(fl 0x00)(ty   0)(scl   3) (nx 1) 0x00000000 .bss
AUX scnlen 0x0 nreloc 0 nlnno 0
[ 10](sec  4)(fl 0x00)(ty   0)(scl   3) (nx 1) 0x00000000 .idata$2
AUX scnlen 0x14 nreloc 3 nlnno 0
[ 12](sec  6)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000000 .idata$4
[ 13](sec  5)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000000 .idata$5
[ 14](sec  4)(fl 0x00)(ty   0)(scl   2) (nx 0) 0x00000000 __head_library_dll_a
[ 15](sec  0)(fl 0x00)(ty   0)(scl   2) (nx 0) 0x00000000 __library_dll_a_iname


RELOCATION RECORDS FOR [.idata$2]:
OFFSET   TYPE              VALUE
00000000 rva32             .idata$4
0000000c rva32             __library_dll_a_iname
00000010 rva32             .idata$5


Contents of section .idata$2:
 0000 00000000 00000000 00000000 00000000  ................
 0010 00000000                             ....

disds00001.o:     file format pe-i386

...

Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .text         00000008  00000000  00000000  0000012c  2**2
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
  1 .data         00000000  00000000  00000000  00000000  2**2
                  ALLOC, LOAD, DATA
  2 .bss          00000000  00000000  00000000  00000000  2**2
                  ALLOC
  3 .idata$7      00000004  00000000  00000000  00000134  2**2
                  CONTENTS, RELOC
  4 .idata$5      00000004  00000000  00000000  00000138  2**2
                  CONTENTS, RELOC
  5 .idata$4      00000004  00000000  00000000  0000013c  2**2
                  CONTENTS, RELOC
  6 .idata$6      00000012  00000000  00000000  00000140  2**1
                  CONTENTS
SYMBOL TABLE:
[  0](sec  1)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000000 .text
[  1](sec  2)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000000 .data
[  2](sec  3)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000000 .bss
[  3](sec  4)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000000 .idata$7
[  4](sec  5)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000000 .idata$5
[  5](sec  6)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000000 .idata$4
[  6](sec  7)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000000 .idata$6
[  7](sec  1)(fl 0x00)(ty   0)(scl   2) (nx 0) 0x00000000 _function_export
[  8](sec  5)(fl 0x00)(ty   0)(scl   2) (nx 0) 0x00000000 __imp__function_export
[  9](sec  0)(fl 0x00)(ty   0)(scl   2) (nx 0) 0x00000000 __head_library_dll_a


RELOCATION RECORDS FOR [.text]:
OFFSET   TYPE              VALUE
00000002 dir32             .idata$5


RELOCATION RECORDS FOR [.idata$7]:
OFFSET   TYPE              VALUE
00000000 rva32             __head_library_dll_a


RELOCATION RECORDS FOR [.idata$5]:
OFFSET   TYPE              VALUE
00000000 rva32             .idata$6


RELOCATION RECORDS FOR [.idata$4]:
OFFSET   TYPE              VALUE
00000000 rva32             .idata$6


Contents of section .text:
 0000 ff250000 00009090                    .%......
Contents of section .idata$7:
 0000 00000000                             ....
Contents of section .idata$5:
 0000 00000000                             ....
Contents of section .idata$4:
 0000 00000000                             ....
Contents of section .idata$6:
 0000 01006675 6e637469 6f6e5f65 78706f72  ..function_expor
 0010 7400                                 t.

disds00000.o:     file format pe-i386

...

Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .text         00000000  00000000  00000000  00000000  2**2
                  ALLOC, LOAD, READONLY, CODE
  1 .data         00000000  00000000  00000000  00000000  2**2
                  ALLOC, LOAD, DATA
  2 .bss          00000000  00000000  00000000  00000000  2**2
                  ALLOC
  3 .idata$7      00000004  00000000  00000000  0000012c  2**2
                  CONTENTS, RELOC
  4 .idata$5      00000004  00000000  00000000  00000130  2**2
                  CONTENTS, RELOC
  5 .idata$4      00000004  00000000  00000000  00000134  2**2
                  CONTENTS, RELOC
  6 .idata$6      0000000e  00000000  00000000  00000138  2**1
                  CONTENTS
SYMBOL TABLE:
[  0](sec  1)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000000 .text
[  1](sec  2)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000000 .data
[  2](sec  3)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000000 .bss
[  3](sec  4)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000000 .idata$7
[  4](sec  5)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000000 .idata$5
[  5](sec  6)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000000 .idata$4
[  6](sec  7)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000000 .idata$6
[  7](sec  5)(fl 0x00)(ty   0)(scl   2) (nx 0) 0x00000000 __imp__data_export
[  8](sec  0)(fl 0x00)(ty   0)(scl   2) (nx 0) 0x00000000 __head_library_dll_a


RELOCATION RECORDS FOR [.idata$7]:
OFFSET   TYPE              VALUE
00000000 rva32             __head_library_dll_a


RELOCATION RECORDS FOR [.idata$5]:
OFFSET   TYPE              VALUE
00000000 rva32             .idata$6


RELOCATION RECORDS FOR [.idata$4]:
OFFSET   TYPE              VALUE
00000000 rva32             .idata$6


Contents of section .idata$7:
 0000 00000000                             ....
Contents of section .idata$5:
 0000 00000000                             ....
Contents of section .idata$4:
 0000 00000000                             ....
Contents of section .idata$6:
 0000 00006461 74615f65 78706f72 7400      ..data_export.
</code>

Note that the object corresponding to data_export has an empty .text section, whereas function_export does define some code. If we disassemble it we get this:

<code>00000000 <_function_export>:
   0:   ff 25 00 00 00 00       jmp    *0x0
                        2: dir32        .idata$5
   6:   90                      nop
   7:   90                      nop
</code>

The relocation of type dir32 tells the linker how to fill in the address being dereferenced by the jmp. We can see that _function_export, when entered, will jump directly to the function at the address loaded from the memory named .idata$5. Inspection of the complete .idata section satisfies us that .idata$5 corresponds to the address of the fragment of the import address table corresponding to the function_export import name, and hence the address where the absolute address of the loaded function_export import can be found.

Although only function_export gets a corresponding _function_export function, both of the exports have lead to a symbol with the __imp__ prefix (__imp__data_export and __imp__function_export) being defined in the import library. As discussed before, this symbol stands for the address at which the pointer to the data/function will be inserted by the dynamic linker. As such, the __imp__ symbols always point directly into the import address table.

With an import library in hand, we are capable of writing some client code that uses our exports, main1.c:

#include <stdio.h>

__declspec(dllimport) extern int function_export(void);
__declspec(dllimport) extern int data_export;

int main(int argc, char **argv) {
    printf("%d\n", function_export());
    printf("%d\n", data_export);

    data_export++;

    printf("%d\n", function_export());
    printf("%d\n", data_export);

    return 0;
}

Build and link it against the import library and we will get the results we expect:

<code>$ gcc main1.c library.dll.a -o main1 && ./main1
1379
42
1380
43
</code>

The reason that this works even though there is no data_export symbol defined by library.dll.a is because the __declspec(dllimport) qualifier on our data_export declaration in main.c has caused the compiled to generate code that uses the __imp_data_export symbol directly, as we can see if we disassemble the generated code:

<code>$ gcc -c main1.c -o main1.o && objdump --disassemble -r main1.o

main1.o:     file format pe-i386


Disassembly of section .text:

00000000 <_main>:
   0:   8d 4c 24 04             lea    0x4(%esp),%ecx
   4:   83 e4 f0                and    $0xfffffff0,%esp
   7:   ff 71 fc                pushl  -0x4(%ecx)
   a:   55                      push   %ebp
   b:   89 e5                   mov    %esp,%ebp
   d:   51                      push   %ecx
   e:   83 ec 14                sub    $0x14,%esp
  11:   e8 00 00 00 00          call   16 <_main+0x16>
                        12: DISP32      ___main
  16:   a1 00 00 00 00          mov    0x0,%eax
                        17: dir32       __imp__function_export
  1b:   ff d0                   call   *%eax
  1d:   89 44 24 04             mov    %eax,0x4(%esp)
  21:   c7 04 24 00 00 00 00    movl   $0x0,(%esp)
                        24: dir32       .rdata
  28:   e8 00 00 00 00          call   2d <_main+0x2d>
                        29: DISP32      _printf
  2d:   a1 00 00 00 00          mov    0x0,%eax
                        2e: dir32       __imp__data_export
  32:   8b 00                   mov    (%eax),%eax
  34:   89 44 24 04             mov    %eax,0x4(%esp)
  38:   c7 04 24 00 00 00 00    movl   $0x0,(%esp)
                        3b: dir32       .rdata
  3f:   e8 00 00 00 00          call   44 <_main+0x44>
                        40: DISP32      _printf
  44:   a1 00 00 00 00          mov    0x0,%eax
                        45: dir32       __imp__data_export
  49:   8b 00                   mov    (%eax),%eax
  4b:   8d 50 01                lea    0x1(%eax),%edx
  4e:   a1 00 00 00 00          mov    0x0,%eax
                        4f: dir32       __imp__data_export
  53:   89 10                   mov    %edx,(%eax)
  55:   a1 00 00 00 00          mov    0x0,%eax
                        56: dir32       __imp__function_export
  5a:   ff d0                   call   *%eax
  5c:   89 44 24 04             mov    %eax,0x4(%esp)
  60:   c7 04 24 00 00 00 00    movl   $0x0,(%esp)
                        63: dir32       .rdata
  67:   e8 00 00 00 00          call   6c <_main+0x6c>
                        68: DISP32      _printf
  6c:   a1 00 00 00 00          mov    0x0,%eax
                        6d: dir32       __imp__data_export
  71:   8b 00                   mov    (%eax),%eax
  73:   89 44 24 04             mov    %eax,0x4(%esp)
  77:   c7 04 24 00 00 00 00    movl   $0x0,(%esp)
                        7a: dir32       .rdata
  7e:   e8 00 00 00 00          call   83 <_main+0x83>
                        7f: DISP32      _printf
  83:   b8 00 00 00 00          mov    $0x0,%eax
  88:   83 c4 14                add    $0x14,%esp
  8b:   59                      pop    %ecx
  8c:   5d                      pop    %ebp
  8d:   8d 61 fc                lea    -0x4(%ecx),%esp
  90:   c3                      ret
  91:   90                      nop
  92:   90                      nop
  93:   90                      nop
</code>

In fact, we can see that the generated code doesn't even use the _function_export symbol, preferring __imp__function_export. Essentially, the code of the _function_export symbol in the import library has been inlined at every use site. This is why using __declspec(dllimport) can improve performance of cross-DLL calls, even though it is entirely optional on function declarations.

We might wonder what happens if we drop the __declspec(dllimport) qualifier on our declarations. Because of our discussion about the difference between data and function imports earlier, you might expect linking to fail. Our test file, main2.c is:

#include <stdio.h>

extern int function_export(void);
extern int data_export;

int main(int argc, char **argv) {
    printf("%d\n", function_export());
    printf("%d\n", data_export);

    data_export++;

    printf("%d\n", function_export());
    printf("%d\n", data_export);

    return 0;
}

Let's try it out:

<code>$ gcc main2.c library.dll.a -o main2 && ./main2
1379
42
1380
43
</code>

What the hell -- it worked? This is a bit uprising. The reason that it works despite the fact that the import library library.dll.a not defining the _data_export symbol is because of a nifty feature of GNU ld called auto-import. Without auto-import the link fails as we would expect:

<code>$ gcc main2.c library.dll.a -o main2 -Wl,--disable-auto-import && ./main2
/tmp/ccGd8Urx.o:main2.c:(.text+0x2c): undefined reference to `_data_export'
/tmp/ccGd8Urx.o:main2.c:(.text+0x41): undefined reference to `_data_export'
/tmp/ccGd8Urx.o:main2.c:(.text+0x49): undefined reference to `_data_export'
/tmp/ccGd8Urx.o:main2.c:(.text+0x63): undefined reference to `_data_export'
collect2: ld returned 1 exit status
</code>

The Microsoft linker does not implement auto-import, so this is the error you would get if you were using the Microsoft toolchain.

However, there is a way to write client code that does not depend on auto-import or use the __declspec(dllimport) keyword. Our new client, main3.c is as follows:

#include <stdio.h>

extern int (*_imp__function_export)(void);
extern int *_imp__data_export;

#define function_export (*_imp__function_export)
#define data_export (*_imp__data_export)

int main(int argc, char **argv) {
    printf("%d\n", function_export());
    printf("%d\n", data_export);

    data_export++;

    printf("%d\n", function_export());
    printf("%d\n", data_export);

    return 0;
}

In this code, we directly use the __imp__-prefixed symbols from the import library. These name an address at which the real address of the import can be found, which is reflected by our C-preprocessor definitions of data_export and function_export.

This code compiles perfectly even without auto-import:

<code>$ gcc main3.c library.dll.a -o main3 -Wl,--disable-auto-import && ./main3
1379
42
1380
43
</code>

If you have followed along until this point you should have a solid understanding of how DLL import and export are implemented on Windows.

How auto-import works

As a bonus, I'm going to explain how auto-import is implemented by the GNU linker. It is a rather cute hack you may get a kick out of.

As a reminder, auto-import is a feature of the linker that allows the programmer to declare an item of DLL-imported data with a simple extern keyword, without having to explicitly use __declspec(dllimport). This is extremely convenient because this is exactly how most nix source code declares symbols it expects to import from a shared library, so by supporting this use case thatnix code becomes more portable to Windows.

Auto-import kicks in whenever the linker finds an object file making use of a symbol foo which is not defined by any other object in the link, but where a symbol __imp_foo is defined by some object. In this case, it assumes that the use of foo is an attempt to access some DLL-imported data item called foo.

Now, the problem is that the linker needs to replace the use of foo with the address of foo itself. However, all we seem to know statically is an address where that address will be placed at runtime (__imp_foo). To square the circle, the linker plays a clever trick.

The trick is to extend the .idata of the image being created with an entry for a "new" DLL. The new entry is set up as follows:

  • The filename of the image being imported is set to the same filename as the .idata entry covering __imp_foo. So if __imp_foo was being filled out by an address in Bar.dll, our new .idata entry will use Bar.dll here.

  • The import lookup table is of length 1, whose sole entry is a pointer to the name of the imported symbol corresponding to __imp_foo. So if __imp_foo is filled out by the address of the foo export from Bar.dll, the name of the symbol we put in here will be foo.

  • The import address table is of length 1 -- and here is the clever bit -- is located precisely at the location in the object file that was referring to the (undefined) symbol foo.

This solution neatly defers the task of filling out the address that the object file wants to the dynamic linker. The reason that the linker can play this trick is that it can see all of the object code that goes into the final image, and can thus fix all of the sites that need to refer to the imported data.

Note that in general the final image's .idata will contain several entries for the same DLL: one from the import library, and one for every place in any object file in the link which referred to some data exported by the DLL. Although this is somewhat unusual behaviour, the Windows linker has no problem with there being several imports of the same DLL.

A wrinkle

Unfortunately, the scheme described above only works if the object code has an undefined reference to foo itself. What if instead it has a reference to foo+N, an address N bytes after the address of foo itself? There is no way to set up the .idata so that the dynamic linker adds a constant to the address it fills in, so we seem to be stuck.

Alas, such relocations are reasonably common, and originate from code that accesses a field of a DLL-imported structure type. Cygwin actually contains another hack to make auto-import work in such cases, known as "pseudo-relocations". If you want to know the details of how these works, there is more information in the original thread on the topic.

Conclusion

Dynamic linking on Windows is hairier than it at first appears. I hope this article has gone some way to clearing up the meaning of the mysterious dllimport and dllexport keywords, and at clarifying the role of the import and export libraries.

Linux and friends implement dynamic linking in a totally different manner to Windows. The scheme they use is more flexible and allows more in-memory sharing of code, but incurs a significant runtime penalty (especially on i386). For more details see here and the Dynamic Linking section of the the ELF spec.

Upgrading An Unactivated Windows Install To Parallels 4.0

This is a pretty obscure problem, but I'm going to put a post up about it on the off chance I can help someone else out. My regular reader (hi dad!) will probably find this of no interest and should give it a miss 🙂

The situation I found myself in was upgrading a Boot Camp install of Windows Vista for the new release of Parallels Desktop 4.0 - no big deal, you may think. Unfortunately, I had forgotten that that particular install of Windows Vista wasn't activated, which caused the automatic upgrade process to bork, dropping me back to manual mode.

To complete the upgrade I needed to run the Parallels Tools setup executable. However, since I hadn't activated, I could only log in as far as getting the "please activate Windows now" screen. As it happened, I knew that I could get rid of this screen by feeding it the details of a Windows Vista license I own, but in order to do that I needed an Internet connection (I don't think my PAYG phone had enough credit on it for an extended Microsoft call centre experience). However, to get an Internet connection I had to install the Parallels Ethernet Connection drivers, and hence the Tools. Catch 22!

The workaround is convoluted, to say the least. First, we need a command prompt in the restricted Vista activation session. You do this by clicking any of the links in the activation window: they should cause a browser to open. From here, you can ask the browser to "Open a file" and direct it to C:\Windows\System32\cmd.exe - this should initiate "download" of the executable. Click the option to run the file and voila!

Now you have a command prompt the fun really begins. You might think you could just type D:\setup.exe and the Tools would begin installing, but life just isn't that simple - in Their infinite wisdom, Microsoft have imposed quotas on the resource consumption of the session they set up for the purposes of activation. This is probably the Right Thing to do from their POV, but it's just a pain in the arse for us.

The workaround is to get the internet connection working, so you can do the activation and hence lift the resource limits. To do this, create a floppy disk image containing the Windows 2000 drivers for a Realtek 8029AS adapter (you should be able to get those from here, until Realtek break their incoming links again). Personally I did this by using another virtual machine to download the files and extract them onto a new floppy disk image (you can create a blank image on the Floppy Drive tab of the VM settings). I would make the fruits of this labour available to you as a simple download if it were not for (unfounded?) fear of Realtek's highly trained attack lawyers.

Once you have the requisite image in your sweaty virtual paws you can proceed to mount it into the Vista VM. To finish up, type compmgmt.msc into that command prompt and update the drivers for the detected network adapter by searching for new ones on the A:\ drive.

You should now be free to run the online activation and break the Catch 22, allowing installation of the Tools - at this point feel free to help yourself to a cup of coffee and a ginger-snap biscuit to celebrate a difficult job done well (I know I did...).

I'm really quite suprised that I had to jump through this many hoops - the Realtek drivers allegedly come with Vista, for one thing. But - c'est la vie! It's also quite pleasing that the humble, long outmoded, floppy drive still has a place in solving modern IT problems 🙂