Sep 3 2015

Asynchronous and non-blocking IO

This post aims to explain the difference between asynchronous and non-blocking IO, with particular reference to their
implementation in Java. These two styles of IO API are closely related but have a number of important differences,
especially when it comes to OS support.

Asynchronous IO

Asynchronous IO refers to an interface where you supply a callback to an IO operation, which is invoked when the operation completes.
This invocation often happens to an entirely different thread to the one that originally made the request, but this is not necessarily the case.
Asynchronous IO is a manifestation of the "proactor" pattern.

One common way to implement asynchronous IO is to have a thread pool whose threads are used to make the normal blocking IO requests,
and execute the appropriate callbacks when these return. The less common implementation approach is to avoid a thread pool, and just push
the actual asynchronous operations down into the kernel. This alternative solution obviously has the disadvantage that it depends on operating
system specific support for making async operations, but has the following advantages:

  • The maximum number of in-flight requests is not bounded by the size of your thread pool
  • The overhead of creating thread pool threads is avoided (e.g. you need not reserve any memory for the thread stacks,
    and you don't pay the extra context switching cost associated with having more schedulable entities)
  • You expose more information to the kernel, which it can potentially use to make good choices about how to do the IO operations
    — e.g. by minimizing the distance that the disk head needs to travel to satisfy your requests, or by using
    native command queueing.

Operating system support for asynchronous IO is mixed:

  • Linux has at least two implementations of async IO:

    • POSIX AIO (aio_read et al). This is implemented on Linux by glibc,
      but other POSIX systems (Solaris, OS X etc) have their own implementations. The glibc implementation is simply a thread pool based one — I'm not sure about the other systems.
    • Linux kernel AIO (io_submit et al). No thread pool is used here, but it has quite a few
      limitations (e.g. it only works for files, not sockets, and has alignment restrictions on file reads) and does not seem to be used much in practice.

    There is a good discussion of the *nix AIO situation on the libtorrent blog,
    summarised by the same writer on Stack Overflow here.
    The experience of this author was that the limitations and poor implementation quality of the various *nix AIO implementations are such that you are much better off
    just using your own thread pool to issue blocking operations.

  • Windows provides a mechanism called completion ports for performing asynchronous IO. With this system:

    1. You start up a thread pool and arrange for each thread to spin calling GetQueuedCompletionStatus
    2. You make IO requests using the normal Windows APIs (e.g. ReadFile
      and WSARecv), with the small added twist
      that you supply a special LPOVERLAPPED parameter indicating that the calls should be non-blocking and the result should be reported to the thread pool
    3. As IO completes, thread pool threads blocked on GetQueuedCompletionStatus are woken up as necessary to process completion events

    Windows intelligently schedules how it delivers GetQueuedCompletionStatus wakeups, such that it tries to roughly keep the same number
    of threads active at any time. This avoids excessive context switching and scheduler transitions — things are
    arranged so that a thread which has just processed a completion event will likely be able to immediately grab a new work item.
    With this arrangement, your pool can be much smaller than the number of IO operations you want
    to have in-flight: you only need to have as many threads as are required to process completion events.

In Java, support for asynchronous IO was added as part of the NIO2 work in JDK7, and the appropriate APIs are exposed by the
AsynchronousChannel class.
On *nix, AsynchronousFileChannel and AsynchronousSocketChannel are implemented using the standard thread pool approach
(the pools are owned by an AsynchronousChannelGroup).
On Windows, completion ports are used — in this case, the AsynchronousChannelGroup thread poll is used as the GetQueuedCompletionStatus listeners.

If you happen to be stuck on JDK6, your only option is to ignore completion ports and roll your own thread pool to dispatch operations
on e.g. standard synchronous FileChannels. However,
if you do this you may find that you don't actually get much concurrency on Windows. This happens because, long)
is needlessly crippled by taking
a lock on the whole FileChannel
. This lock is needless because FileChannel is otherwise a thread-safe class, and in order to make sure your positioned read
isn't interfering with the other readers you don't need to lock — you simply need to issue a ReadFile call with a custom position by using one of the fields
of the LPOVERLAPPED struct parameter. Note that the *nix implementation of does the right thing and simply issues a pread
call without locking.

Non-blocking IO

Non-blocking IO refers to an interface where IO operations will return immediately with a special error code if called when they are in a state
that would otherwise cause them to block. So for example, a non-blocking recv will return immediately
with a EAGAIN or EWOULDBLOCK error code if no data is available on the socket, and likewise send will return
immediately with an error if the OS send buffers are full.
Generally APIs providing non-blocking IO will also provide some sort of interface where you can efficiently wait for certain operations to enter a state
where invoking the non-blocking IO operation will actually make some progress rather than immediately returning. APIs in this style are
implementations of the reactor pattern.

No OS that I know of implements non-blocking IO for file IO, but support for socket IO is generally reasonable:

  • Non-blocking read and writes are available via the POSIX O_NONBLOCK operating mode,
    which can be set on file descriptors (FDs) representing sockets and FIFOs.

  • POSIX provides select and poll which let you
    wait for reads and writes to be ready on several FDs. (The difference between these two is pretty much just that select lets you wait for a number of FDs up to FD_SETSIZE,
    while poll can wait for as many FDs as you are allowed to create.)

    Select and poll have the major disadvantage that when the kernel returns from one of these calls, you only know the number of FDs that got triggered —
    not which specific FDs have become unblocked. This means you later have to do a linear time scan across each of the FDs you supplied to figure out which one you actually
    need to use.

  • This limitation motivated the development of several successor interfaces. BSD & OS X got kqueue,
    Solaris got /dev/poll, and Linux got epoll. Roughly speaking,
    these interfaces lets you build up a set of FDs you are interested in watching, and
    then make a call that returns to you a list those of FDs in the set that were actually triggered.

    There's lots of good info about these mechanisms at the classic C10K page. If you like hearing someone who
    clearly knows what he is talking about rant for 10 minutes about syscalls, this Bryan Cantrill bit about epoll is quite amusing.

  • Unfortunately, Windows never got one of these successor mechanisms: only select is supported.
    It is possible to do an epoll-like thing by kicking off an operation that would normally block (e.g. WSARecv)
    with a specially prepared LPOVERLAPPED parameter, such that you can wait it to complete using WSAWaitForMultipleEvents.
    Like epoll, when this wait returns it gives you a notion of which of the sockets of interest caused the wakeup. Unfortunately, this API won't let you wait for more than 64 events —
    if you want to wait for more you need to create child threads that recursively call WSAWaitForMultipleEvents, and then wait on those threads!

  • The reason that Windows support is a bit lacking here is that they seem to expect you to use an asynchronous IO mechanism instead: either completion
    ports, or completion handlers. (Completion handlers are implemented using the windows APC mechanism and are
    a form of callback that don't require a thread pool — instead, they are executed in the spare
    CPU time when the thread that issued the IO operation is otherwise suspended, e.g. in a call to WaitForMultipleObjectsEx).

In Java, non-blocking IO has been exposed via SelectableChannel
since JDK4. As I mentioned above, OS support for non-blocking IO on files is nonexistant — correspondingly, Java's SocketChannel
extends SelectableChannel, but FileChannel does not.

The JDK implements SelectableChannel using whatever the platform-appropriate API is (i.e. epoll, kqueue, /dev/poll, poll or select).
The Windows implementation is based on select — to ameliorate the fact that select requires a linear scan, the JDK creates a new thread for every 1024 sockets being waited on.


Let's say that you want to do Java IO in a non-synchronous way. The bottom line is:

  • If you want to do IO against files, your only option is asynchronous IO. You'll need to roll it yourself with JDK6 and below (and the resulting implementation
    won't be as concurrent as you expect Windows). On the other hand, with Java 7 and up you can just use the built-in mechanisms, and what you'll get is basically
    as good as the state-of-the-art.

  • If you want to do IO against sockets, an ideal solution would use non-blocking IO on *nix and asynchronous IO on Windows. This is
    obviously a bit awkward to do, since it involves working with two rather different APIs. There might be some project akin
    to libuv that wraps these two mechanisms up into a single API you can write against, but I don't know of it if so.

    The Netty project is an interesting data point. This high performance Java server is based principally on
    non-blocking IO, but they did make an abortive attempt to use async IO instead at one point — it was backed out because there was no performance
    advantage to using async IO instead of non-blocking IO on Linux. Some users report that
    the now-removed asynchronous IO code drastically reduces CPU usage on Windows, but others report that
    Java's dubious select-based implementation of Windows non-blocking IO is good enough.

Jul 15 2014

Quirks of the Matlab file format

The Matlab file format has become something of a standard for data exchange in quant finance circles. It is not only handy for those who are using the Matlab interactive environment itself, but also to users working in a diverse spectrum of language, thanks to widespread availability of libraries for reading and writing the files. The format itself also has the handy property of supporting compression — an essential property for keeping disk usage reasonable with working with the highly compressible data that is typical of financial timeseries.

At work we have implemented our own high-performance Java library for reading and writing these files. The Mathworks have helpfully published a complete description of the format online, which makes this task for the most part straightforward. Unfortunately, the format also has some dark and undocumented corners that I spent quite some time investigating. This post is intended to record a couple of these oddities for posterity.


The Matlab environment supports Unicode strings, and so consequently Matlab files can contain arbitrary Unicode strings. Unfortunately this is one area where the capabilities of Matlab itself and those intended by the Mathworks spec diverge somewhat. Specifically:

  1. While the spec documents a miUTF8 storage type, Matlab itself only seems to understand a very limited subset of UTF-8. For example, it can't even decode an example file which simply contains the UTF-8 encoded character sequence ←↑→↓↔. It turns out that Matlab cannot read codepoints that are encoded as three or more bytes! This means it can only understand U+0000 to U+07FF, leaving us in a sad situation when Matlab can't even understand the BMP.
  2. The miUTF32 storage type isn't supported at all. For example,
    this file is correctly formed according to the spec but unreadable in Matlab.
  3. UTF-16 mostly works. As it stands, this is really your only option if you want the ability to roundtrip Unicode via Matlab. One issue is that Matlab chars aren't really Unicode codepoints - they are sequences of UTF-16 code units. However, this is an issue shared by Python 2 and Java, so even though it is broken at least it is broken in the "normal" way.

Interestingly, most 3rd party libraries seem to implement these parts of the spec better than Matlab itself does — for example, scipy's loadmat and savemat functions have full support for all of these text storage data types. (Scipy does still have trouble with non-BMP characters however.)


As mentioned, .mat files have support for storing compressed matrices. These are simply implemented as nested zlib-compressed streams. Alas, it appears that the way that Matlab is invoking zlib is slightly broken, with the following consequences:

  • Matlab does not attempt to validate that the trailing ZLib checksum is present, and doesn't check it even if it is there.
  • If you attempt to open a file containing a ZLib stream that has experienced corruption such that the decompressed data is longer than Matlab was expecting, the error is silently ignored.
  • When writing out a .mat file, Matlab will sometimes not write the ZLib checksum. This happens very infrequently though — most files it creates do have a checksum as you would expect.

Until recently scipy's Matlab reader would not verify the checksum either, but I added support for this after we saw corrupted .mat files in the wild at work.

I've reported these compression and Unicode problems to the Mathworks and they have acknowledged that they are bugs, but at this time there is no ETA for a fix.

Dec 6 2012

Rpath emulation: absolute DLL references on Windows

When creating an executable or shared library on Linux, it’s possible to include an ELF RPATH header which tells the dynamic linker where to search for the any shared libraries that you reference. This is a pretty handy feature because it can be used to nail down exactly which shared library you will link against, without leaving anything up to chance at runtime.

Unfortunately, Windows does not have an equivalent feature. However, it does have an undocumented feature which may be enough to replace your use of rpath if you are porting software from Linux.

Executables or DLLs or Windows always reference any DLLs that they import by name only. So, the import table for an executable will refer to kernel32.dll rather than C:\Windows\kernel32.dll. Window’s dynamic loader will look for a file with the appropriate name in the DLL search path as usual. (For full details on DLL import tables and more, you can check out my previous in depth post.)

However, Window’s dynamic loader will, as a completely undocumented (and presumably unsupported) feature, also accept absolute paths in the import table. This is game-changing because it means that you can hard-code exactly which DLL you want to refer to, just like you would be able to with rpath on Linux.


To demonstrate this technique, we’re going to need code for a DLL and a referring EXE:

$ cat library.c
#include <stdio.h>

__declspec(dllexport) int librarycall(void) {
        printf("Made library call!\n");
        return 0;

$ cat rpath.c
__declspec(dllimport) int librarycall(void);

int main(int argc, char **argv) {
        return librarycall();

If we were building a DLL and EXE normally, we would do this:

gcc -c library.c
gcc -shared -o library.dll library.o
gcc -o rpath rpath.c -L./ -llibrary

This all works fine:

$ ./rpath
Made library call!

However, as you would expect, if you move library.dll elsewhere, the EXE will fail to start:

$ mv library.dll C:/library.dll
$ ./rpath
/home/Max/rpath/rpath.exe: error while loading shared libraries: library.dll: cannot open shared object file: No such file or directory

Now let’s work some magic! If we open up rpath.exe in a hex editor, we see something like this:

Let’s just tweak that a bit to change the relative path to library.dll to an absolute path. Luckily there is enough padding to make it fit:

The EXE will now work perfectly!

$ ./rpath
Made library call!

In practice

Knowing that this feature exists is one thing. Actually making use of it in a reliable way is another. The problem is that to my knowledge no linkers are capable of creating a DLL or EXE which include an absolute path in their import tables. Sometimes we will be lucky enough that the linker creates an EXE or DLL with enough padding in it for us to manually edit in an absolute path, but with the method above there is no guarantee that this will be possible.

In order to exploit this technique robustly, we’re going to use a little trick with import libraries. Instead of using GCC’s ability to link directly to a DLL, we will generate an import library for the DLL, which we will call library.lib:

$ dlltool --output-lib library.lib --dllname veryverylongdllname.dll library.o

When you use dlltool you either need to write a .def file for the DLL you are creating an import library for, or you need to supply all the object files that were used to create the DLL. I’ve taken the second route here and just told dlltool that the our DLL was built from library.o.

Now we have an import library, we can do our hex-editing trick again, but this time on the library. Before:

And after (note that I have null-terminated the new absolute path):

The beauty of editing the import library rather than the output of the linker is that using the --dllname option we can ensure that the import library contains as much space as we need to fit the entire absolute path of the DLL, no matter how long it may be. This is the key to making robust use of absolute paths in DLL loading, even if linkers don’t support them!

Now we have the import library, we can link rpath.exe again, but this time using the import library rather than library.dll:

$ gcc -o rpath rpath.c library.lib
$ ./rpath
Made library call!

Yes, it really is using the DLL on the C: drive:

$ mv C:/library.dll C:/foo.dll
$ ./rpath
/home/Max/rpath/rpath.exe: error while loading shared libraries: C:\library.dll: cannot open shared object file: No such file or directory


I haven’t seen this technique for using absolute paths for DLL references anywhere on the web, so it doesn’t seem to be widely known. However, it works beautifully on Windows 7 and probably on all other versions of Windows as well.

I may apply these techniques to the Glasgow Haskell Compiler in order to improve the support for Haskell shared objects on Windows: more information on this topic can be found on the GHC wiki.

Sep 30 2011

GHC-specific Alias Analysis for LLVM

The setup

A few years ago, David Terei did some great work adding a LLVM backend to the Glasgow Haskell Compiler. The idea with this is that instead of writing our own optimiser and assembly-code generators for our custom three-address-code, we can just translate into LLVM IR and have LLVM do the heavy lifting. In theory, this means that GHC will be able to compile for many different CPUs, and will benefit from the smart optimisations the LLVM team have implemented.

The portability part has definitely worked out for us: for example, a couple of people have successfully got GHC to compile for the ARM by using the LLVM backend. However, the promise of LLVM being able to speed up our generated code has never really been fully borne out. LLVM-generated code does tend to be better than that produced by GHCs own backends, but this is mostly because LLVM is doing much better register allocation (it is much smarter about reusing the “pinned registers” required that form part of the interface between GHC’s generated code and the garbage collector).

The reason that LLVM does not optimise as much as we would like is often to do with aliasing. In particular, LLVM conservatively assumes that GHC’s stack (which is explicitly represented in the generated code as an array of words) and the heap may alias.

What’s the problem?

A concrete example of this is the following Haskell program:

module Main(main) where

import Data.Array.Base
import Data.Array.IO
import Data.Array.MArray

main :: IO ()
main = do
    arr <- newArray_ (0, 200)
    go arr 2 0 100

go :: IOUArray Int Int -> Int -> Int -> Int -> IO ()
go arr stride x y | x < y     = do unsafeWrite arr (x * stride) 1337
                                   go arr stride (x + 1) y
                  | otherwise = return ()

This loop compiles to fairly good Core:

Main.main_$s$wa =
  \ (@ sg0_sKA::Data.Array.Base.STUArray
                  GHC.Prim.RealWorld GHC.Types.Int GHC.Types.Int
                Data.Array.IO.Internals.IOUArray GHC.Types.Int GHC.Types.Int)
    (sc_sKs :: GHC.Prim.State# GHC.Prim.RealWorld)
    (sc1_sKt :: GHC.Prim.Int#)
    (sc2_sKu :: GHC.Prim.Int#)
    (sc3_sKv :: GHC.Prim.Int#)
    (sc4_sKw :: GHC.Types.Int)
    (sc5_sKx :: GHC.Types.Int)
    (sc6_sKy :: GHC.Types.Int)
    (sc7_sKz :: GHC.Prim.MutableByteArray# GHC.Prim.RealWorld) ->
    case GHC.Prim.<# sc2_sKu sc1_sKt of _ {
      GHC.Bool.False -> (# sc_sKs, GHC.Unit.() #);
      GHC.Bool.True ->
        case GHC.Prim.writeIntArray#
               @ GHC.Prim.RealWorld
               (GHC.Prim.*# sc2_sKu sc3_sKv)
        of s2#_aHo { __DEFAULT ->
          @ (sym
               Data.Array.IO.Internals.NTCo:IOUArray GHC.Types.Int GHC.Types.Int)
          (GHC.Prim.+# sc2_sKu 1)

One weird thing about this Core is that it passes around a number of dead arguments (sc4_sKw, sc5_sKx and sc6_sKy). This is a known bug in GHC, and is caused by a phase ordering problem. However, this particular infelicity should not prevent LLVM from being able to do the classic loop optimisation of strength reduction on our code.

The particular strength reduction we would like to perform si to replace the multiplication GHC.Prim.*# sc2_sKu sc3_sKv in the main_$s$wa loop with an addition. This is possible because the left operand sc2_sKu is a loop induction variable, increasing by 1 every iteration. Thus, on every iteration the value of the multiplication GHC.Prim.*# sc2_sKu sc3_sKv is just the value of the multiplication on the previous loop, plus sc3_sKv. Thus, by adding a loop variable that records the value of the multiplication on the previous iteration, we can replace the multiplication by an addition.

Unfortunately, LLVM currently can’t strength-reduce this loop in the suggested way. As we will soon see, this is due to aliasing.

Why does the problem happen?

We can immediately see the problem if we look at the optimised LLVM code for this loop:
  %ln1TL1 = load i64* %Sp_Arg, align 8

c1TW:                                             ; preds =, %c1TW
  %ln1TL4 = phi i64 [ %ln1TL1, ], [ %ln1UF, %c1TW ]
  %ln1Uy = mul i64 %ln1Uu, %ln1TL4
  %ln1Uz = add i64 %ln1Uw, %ln1Uy
  %ln1UA = inttoptr i64 %ln1Uz to i64*
  store i64 1337, i64* %ln1UA, align 8
  %ln1UE = load i64* %Sp_Arg, align 8
  %ln1UF = add i64 %ln1UE, 1
  store i64 %ln1UF, i64* %Sp_Arg, align 8
  %ln1TP = load i64* %ln1TN, align 8
  %ln1TQ = icmp slt i64 %ln1UF, %ln1TP
  br i1 %ln1TQ, label %c1TW, label %n1TX.loopexit

The strength reduction optimisation depends on one of the operands to the multiplication being a loop induction variable. In our case, we expect that sc2_sKu will be such a variable. However, looking at the LLVM code we can see that the equivalent LLVM variable, %ln1TL4, has its induction-ness hidden because it is reloaded from the stack by load i64* %Sp_Arg on every iteration.

You might wonder why the store to the same stack location by store i64 %ln1UF, i64* %Sp_Arg is not forwarded to this load by LLVM. If this were to happen, we could get code like this:
  %ln1TL1 = load i64* %Sp_Arg, align 8 = load i64* %Sp_Arg, align 8

c1TW:                                             ; preds =, %c1TW
  %ln1TL4 = phi i64 [ %ln1TL1, ], [ %ln1UF, %c1TW ]
  %ln1UE = phi i64 [, ], [ %ln1UF, %c1TW ]
  %ln1Uy = mul i64 %ln1Uu, %ln1TL4
  %ln1Uz = add i64 %ln1Uw, %ln1Uy
  %ln1UA = inttoptr i64 %ln1Uz to i64*
  store i64 1337, i64* %ln1UA, align 8
  %ln1UF = add i64 %ln1UE, 1
  store i64 %ln1UF, i64* %Sp_Arg, align 8
  %ln1TP = load i64* %ln1TN, align 8
  %ln1TQ = icmp slt i64 %ln1UF, %ln1TP
  br i1 %ln1TQ, label %c1TW, label %n1TX.loopexit

In this code the fact that %ln1UE is an induction variable is obvious, and not obscured by an intermediate load from memory. And indeed, LLVM is able to strength-reduce this loop!

The reason that LLVM does not forward this load is because in general it is unsafe, since the store to %ln1UA might alias it if %ln1UA were equal to %Sp_Arg. The ridiculous thing about this is that we know that in the code generated by GHC, the stack pointer will never be stored away anywhere, so it can’t possible alias with the unknown pointer %ln1UA and LLVM is being unnecessarily conservative.

The solution

LLVM is a beautiful bit of software, and it provides exactly the extensibility point we require to resolve this problem: we can write our own alias analysis pass that knows that GHC’s stack never alias with any another non-stack pointer and dynamically load it into the LLVM optimisation tool chain.

This is exactly what I’ve done. The code is available as a Gist, and interested parties (who use OS X!) can build it like so:

g++ -D__STDC_LIMIT_MACROS -D__STDC_CONSTANT_MACROS -fno-exceptions -fno-rtti -fno-common -Wall \
-Wl,-flat_namespace -dynamiclib GHCAliasAnalysis.cpp -o GHCAliasAnalysis.dylib -lLLVM-`llvm-config --version`

Once built, we can dynamically load the resulting dylib into LLVMs opt tool using the -load option, and then use the new -ghc-aa flag to tell LLVM to use our alias analyser as a complement to the default one. Unfortunately, due to an infelicity in LLVM, we have to specify -ghc-aa in between every single optimisation pass if we want to be sure that it is used. So the final command line to opt, including all passes done by the standard -O2 optimisation level, and the -loop-reduce strength-reduction pass, needs to look something like this:

opt -load GHCAliasAnalysis.dylib -S -no-aa -tbaa -basicaa -ghc-aa \
-globalopt -ghc-aa -ghc-aa -ipsccp -ghc-aa -deadargelim -ghc-aa -instcombine -ghc-aa -simplifycfg \
-ghc-aa -basiccg -ghc-aa -prune-eh -ghc-aa -inline -ghc-aa -functionattrs -ghc-aa -scalarrepl-ssa \
-ghc-aa -domtree -ghc-aa -early-cse -ghc-aa -simplify-libcalls -ghc-aa -lazy-value-info -ghc-aa \
-jump-threading -ghc-aa -correlated-propagation -ghc-aa -simplifycfg -ghc-aa -instcombine -ghc-aa \
-tailcallelim -ghc-aa -simplifycfg -ghc-aa -reassociate -ghc-aa -domtree -ghc-aa -loops -ghc-aa \
-loop-simplify -ghc-aa -lcssa -ghc-aa -loop-rotate -ghc-aa -licm -ghc-aa -lcssa -ghc-aa -loop-unswitch \
-ghc-aa -instcombine -ghc-aa -scalar-evolution -ghc-aa -loop-simplify -ghc-aa -lcssa -ghc-aa -indvars \
-ghc-aa -loop-idiom -ghc-aa -loop-deletion -ghc-aa -loop-unroll -ghc-aa -memdep -ghc-aa -gvn -ghc-aa \
-memdep -ghc-aa -memcpyopt -ghc-aa -sccp -ghc-aa -instcombine -ghc-aa -lazy-value-info -ghc-aa \
-jump-threading -ghc-aa -correlated-propagation -ghc-aa -domtree -ghc-aa -memdep -ghc-aa -dse \
-ghc-aa -adce -ghc-aa -simplifycfg -ghc-aa -instcombine -ghc-aa -strip-dead-prototypes -ghc-aa \
-constmerge -loop-reduce

(Yes, I know this is ridiculous! I hope the LLVM developers fix this soon.)

With my new alias analysis pass, LLVM is able to produce the following beautiful code for the loop:

c1TW:                                             ; preds = %c1TW,
  %lsr.iv = phi i64 [, %c1TW ], [ %5, ]
  %ln1UF1 = phi i64 [ %ln1TL1, ], [ %ln1UF, %c1TW ]
  %ln1UA = inttoptr i64 %lsr.iv to i64*
  store i64 1337, i64* %ln1UA, align 8
  %ln1UF = add i64 %ln1UF1, 1 = add i64 %lsr.iv, %6
  %ln1TQ = icmp slt i64 %ln1UF, %ln1TP2
  br i1 %ln1TQ, label %c1TW, label %n1TX.loopexit

Note that the original loop contained a store and two loads, but the optimised loop contains only a single store: our new alias analysis has allowed the loads to be floated out of the loop. This has in turn allowed LLVM to discover the loop induction variable and apply strength reduction - note that the final loop never uses the multiplication instruction!

The final program runs 8.8% faster than the version that is compiled without the custom alias analysis.


My custom alias analyser for GHC-generated code gives LLVM much more room for applying its existing powerful optimisation. There is plenty of scope for improvement, though:

  1. I’d really like people to report their experiences using with this alias analyser and the LLVM backend. Do you see a big speed boost on your data-parallel Haskell programs, for example?

  2. Of course, I would like this alias analyser to included with GHC so you can all seamlessly benefit from it. I’ll be working with GHC HQ to make this happen.

  3. I think there is still scope for getting even more useful information about GHC-generated code into LLVM. For example, currently LLVM is unable to eliminate stores to stack locations that we can see will never be accessed because we make a tail call to another function with a stack pointer that points above these locations. I can think of at least two ways to express this to LLVM, and this would produce another nice gain.

    If would also be great if we could teach LLVM something about the garbage collector, as currently if your loop does any allocation at all the presence of calls to the GC pessimises the output code a lot.

I was partly inspired to do this by Ben Lippmeier, whose paper at the Haskell Symposium this year had to do strength-reduction manually at the Haskell level because LLVM wasn’t working for him. I hope I’ve fixed that issue.

Performance problems were also a focus of the discussions about the future of Haskell at ICFP. I’ve been to these discussions three years in a row, and several topics keep cropping back up: performance, and the fact that Hackage 2.0 still isn’t released. I’ve grown tired of hearing so much talk about the issues with little-to-no action to resolve them, so I spent this post-ICFP week doing my best to fix them. I first wrote a documentation build bot for the Hackage 2.0 effort, and then moved on to the LLVM performance issues - if everyone helps to move these issues along then hopefully we can finally talk about some different problems next year!

Sep 10 2011

Constraint Kinds for GHC

I recently implemented a powerful new extension to GHC HEAD called ConstraintKinds. This (Literate Haskell) post will explain what this means, and how we can exploit it to do some cool stuff.

(For long-time readers, this stuff is a generalisation of my earlier post about constraint families which was later also expounded on by Dominic Orchard and Tom Schrijvers in Type Constraints Unleashed. The proposal in its current form is due to Conor McBride.)

First of all, we’re going to turn on a whacking great load of extensions:

{-# LANGUAGE UndecidableInstances,
             OverlappingInstances #-}

(Yes, some of the cooler examples will require UndecidableInstances. Never mind!)

Let’s have some imports as well:

import qualified Data.Set as S

When we talk about constraints in Haskell, we usually mean one of the following things:

  • Class contexts such as Show a
  • Implicit parameters, such as ?x::Int
  • Equality assertions, such as a ~ Int
  • Tuples of any of the above, such as (Show a, Read a)

Is standard Haskell, these constraints can only occur to the left of => arrow, and they are the only thing that can appear there. With the ConstraintKinds extension, we instead allow any type of a brand-new kind Constraint to appear to the left of =>. Naturally, all of the constraints we already mentioned are parsed as types, and are all given an appropriate kind:

  • Show :: * -> Constraint
  • (?x::Int) :: Constraint
  • (a ~ Int) :: Constraint
  • (Show a, Read a) :: Constraint

Constraint synonyms

At the simplest level, this unification of constraints and types means that code like the following is valid:

type Func cxt a = cxt a => a -> a
incc :: Func Num a
incc = (+1)

Or we can even use type synonyms as constraint synonyms:

type Stringy a = (Show a, Read a)
viaString :: Stringy a => a -> a
viaString = read . show

Simulating this without the extension is a little more cumbersome:

class (Show a, Read a) => Stringy a where
instance Stringy a where

Indexed constraints

But it doesn’t stop there. Since constraints are just types, we can type-index them using type functions! We can use this to solve the well-known problem where lists can be an instance of the Monad type class, but sets cannot. This problem arises because the elements of a set must be orderable, but e.g. the return method of the Monad class allows an element of any type to be made into an “element” of the monad — not only the orderable ones.

A restricted monad is a monad where we need to impose some constraints on the elements it can contain. Existing Hackage packages such as Ganesh Sittampalam’s rmonad package provide a way to define these monads in unextended Haskell. However, with our new extension we get a much smoother user experience by reusing the type function mechanism to encode a class of restricted monads:

class RMonad m where
  type RMonadCtxt m a :: Constraint
  return :: RMonadCtxt m a => a -> m a
  (>>=) :: (RMonadCtxt m a, RMonadCtxt m b) => m a -> (a -> m b) -> m b

Lists can of course be an instance of this class:

instance RMonad [] where
  type RMonadCtxt [] a = ()
  return x = [x]
  (>>=) = flip concatMap

But now so can sets:

instance RMonad S.Set where
  type RMonadCtxt S.Set a = Ord a
  return = S.singleton
  mx >>= fxmy = S.fromList [y | x <- S.toList mx, y <- S.toList (fxmy x)]

Another feature I added to GHC recently is associated type defaults. With this, we can change the RMonad class definition so that normal Monads which do not make any special demands of their element types can be defined without giving an explicit instance for RMonadCtxt:

class RMonad m where
  type RMonadCtxt m a :: Constraint
  type RMonadCtxt m a = ()
  return :: ...
  (>>=) :: ...

(Associated type defaults were always described in the published papers about associated types, but were never implemented until now).

Reified dictionaries

A common trick is to reify a constraint as an explicit dictionary using a GADT:

data ShowDict a where
  ShowDict :: Show a => ShowDict a
showish :: ShowDict a -> a -> String
showish ShowDict x = show x
use_showish :: String
use_showish = showish ShowDict 10

With our extension we can generalise this so you can define one reified dictionary to rule them all:

data Dict ctxt where
  Dict :: ctxt => Dict ctxt
showish' :: Dict (Show a) -> a -> String
showish' Dict x = show x
use_showish' :: String
use_showish' = showish' Dict 10

Generic programming

In “Scrap Your Boilerplate With Class”, Simon Peyton Jones and Ralf Laemmel proposed an encoding for generic functions in terms of type classes. However, their presentation was impeded by the fact that they could not abstract over type classes, and they had to have a heavy encoding mechanism to make it work. With our new extension we can write generic functions in their style in a much cleaner fashion.

First, we define the class of Data which has a generic mapping operation that applies a type-indexed function one level down in the data structure, returning all the results as a list:

class (cxt a) => Data cxt a where
  gmapQ :: Proxy cxt -> (forall b. Data cxt b => b -> r) -> a -> [r]

The cxt type variable will later be instantiated to a type class corresponding to the generic function we wish to apply. The Proxy cxt argument to gmapQ is an unfortunate artifact of fact that Haskell still has no explicit type applications, so we have to use dummy value arguments to disambiguate which cxt we actually mean when we call gmapQ. The definition is trivial:

data Proxy (ctxt :: * -> Constraint) = Proxy

We can define Data instances for some built in types:

instance (cxt Int) => Data cxt Int where
  gmapQ _ f n = []
instance (cxt [a], Data cxt a) => Data cxt [a] where
  gmapQ _ f [] = []
  gmapQ _ f (x:xs) = [f x, f xs]

Now we can define a generic function gsize:

class Size a where
  gsize :: a -> Int

We can say how gsize works on particular types by giving an instance:

instance Size Int where
  gsize x = x

If no other instance is available, an overlapping instance based on gmapQ will be used:

instance Data Size t => Size t where
  gsize t = 1 + sum (gmapQ (Proxy :: Proxy Size) gsize t)

We can now evaluate gsize at both types Int and [Int] even though we never said explicitly what it means to take the size of a list:

use_gsize :: Int
use_gsize = gsize (1 :: Int) + gsize [1 :: Int, 2]

Wrapping up

The ConstraintKinds extension makes these three idioms much neater, but I’m sure there are plenty of other places where this new power will come in useful. Try it out for yourself in GHC 7.4 and find out!

Thanks are due to Simon Marlow for organising CamHac, where I started working on the implementation, and Dominic Orchard and Nicolas Wu who collaborated with me during the early stages of coding. Thanks also to Simon Peyton Jones for invaluable advice that finally let me merge it into GHC.

Jul 6 2011

The Sad State of Symbol Aliases

This point continues my quest to condense and write down some of the folklore surrounding assemblers linkers. In this case, I recently came across a situation where it would be useful to be able to generate an object file that contained an alias for a symbol defined elsewhere. For example, I want an object file to export a symbol foo that aliases bar, such that when any use site of foo is linked against the object file that use site then behaves exactly as if it had referenced bar instead.

This could be done straightforwardly (just export both foo with the same value as bar) except for the wrinkle that in general bar is not defined in the object file exporting foo, so we don't know its value yet.

This article picks apart support for this feature on a platform-by-platform basis. Long story short: this is supported by the object file format on OS X and Windows, but you can't get to it from the assembly code level. Linux has no support at all.


Buried deep within the Mach-O specification is a mention of the symbol table entry type N_INDR. Quoth the standard: "The symbol is defined to be the same as another symbol. The n_value field is an index into the string table specifying the name of the other symbol. When that symbol is linked, both this and the other symbol have the same defined type and value".

This is great stuff, and exactly what we want! The fly in the ointment is that the latest version of Apples assembler has no support for actually generating such indirections. The source tree does contain a tool called indr which is capable of generating these indirections in a limited capacity, but it is not distributed with OS X and anyway not general enough for our needs. Happily, Apple's linker does seem to include support for N_INDR, so everything should work OK if you managed to generate an object file making use of that type.


Interestingly, Windows DLLs support something called "forwarders" which give us the behaviour we want for dynamically exported symbols. You can create such DLLs with special syntax in your .def file EXPORTS section. This is not relevant to our problem though, because there is no equivalent at the object file level.

Page 44 of the PE/COFF specification talks about symbol tables. Reading carefully, we find a mention of "Weak Externals" on page 51:

“Weak externals” are a mechanism for object files that allows flexibility at link time. A module can contain an unresolved external symbol (sym1), but it can also include an auxiliary record that indicates that if sym1 is not present at link time, another external symbol (sym2) is used to resolve references instead. If a definition of sym1 is linked, then an external reference to the symbol is resolved normally. If a definition of sym1 is not linked, then all references to the weak external for sym1 refer to sym2 instead. The external symbol, sym2, must always be linked; typically, it is defined in the module that contains the weak reference to sym1.

This is not exactly what we had in mind, but it can be abused for the same effect. Nothing will go wrong unless someone else defines a symbol with the same name as our alias in another object file.

As far as I can see, the GNU assembler can't be persuaded to generate this. The assembler does have rudimentary support for generating weak externals, but only uses it in the rudimentary capacity of supporting the .weak directive (with ELF-style semantics) on Windows. And as we shall shortly see, ELF semantics are not what we want at all...


Turning to page 1-16 of the ELF specification we find the definition of the ELF symbol table. As far as I can tell, there is no support whatsoever for this use case. Bah.

We might be tempted to search for some equivalent to the weak externals feature on Windows. Unfortunately, ELF weak symbols have a rather different semantics:

  1. An undefined weak symbol will not cause the linker to error out if a definition is not found. Instead, the symbol will be filled in with a default value of 0.
  2. A defined weak symbol has a lower link precedence than a strong symbol of the same name, and will not cause the linker to generate an error about duplicate symbol definitions in the case of such a conflict.

The difference between this and the Windows situation is that Windows basically lets us change the default value filled in by the linker in the case of no definition being found to an arbitrary symbol.


GCC supports an alias attribute that does exactly what I want. Unfortunately despite a few people trying to do exactly what I want they have elected to reject the construct:

This is because it's meaningless to define an alias to an undefined symbol. On Solaris, the native assembler would have caught this error, but GNU as does not.

This comment refers to the fact that assembly like this:

.globl reexport
.globl export
.equiv export, reexport

Does not fail to compile with the GNU assembler, but generates an object file that does not define any symbols despite referencing the reexport symbol.


A sufficiently motivated hacker could support a (weak) aliasing feature along the lines described above in the GNU assembler on Windows and OS_X without problems. However, there seems to be no way to support it on Linux within the bounds of the ELF specification.

Unusually Linux is the platform that lags behind the others in linker features! I usually find that quite the opposite is true.

Jul 4 2011

Everything You Never Wanted To Know About DLLs

I've recently had cause to investigate how dynamic linking is implemented on Windows. This post is basically a brain dump of everything I've learnt on the issue. This is mostly for my future reference, but I hope it will be useful to others too as I'm going to bring together lots of information you would otherwise have to hunt around for.

Without further ado, here we go:

Export and import directories

The Windows executable loader is responsible for doing all dynamic loading and symbol resolution before running the code. The linker works out what functions are exported or imported by each image (an image is a DLL or EXE file) by inspecting the .edata and .idata sections of those images, respectively.

The contents of these sections is covered in detail by the PE/COFF specification.

The .edata section

This section records the exports of the image (yes, EXEs can export things). This takes the form of:

  • The export address table: an array of length N holding the addresses of the exported functions/data (the addresses are stored relative to the image base). Indexes into this table are called ordinals.
  • The export name pointer table: an array of length M holding pointers to strings that represent the name of an export. This array is lexically ordered by name, to allow binary searches for a given export.
  • The export ordinal table: a parallel array of length M holding the ordinal of the corresponding name in the export name pointer table.

(As an alternative to importing an image's export by its name, it is possible to import by specifying an ordinal. Importing by ordinal is slightly faster at runtime because the dynamic linker doesn't have to do a lookup. Furthermore, if the import is not given a name by the exporting DLL, importing by ordinal is the only way to do the import.)

How does the .edata section get created in the first place? There are two main methods:

  1. Most commonly, they start life in the object files created by compiling some source code that defines a function/some data that was declared with the __declspec(dllimport) modifier. The compiler just emits an appropriate .edata section naming these exports.

  2. Less commonly, the programmer might write a .def file specifying which functions they would like to export. By supplying this to dlltool --output-exp, an export file can be generated. An export file is just an object file which only contains a .edata section, exporting (via some unresolved references that will be filled in by the linker in the usual way) the symbols named in the .def file. This export library must be named by the programmer when he comes to link together his object files into a DLL.

In both these cases, the linker collects the .edata sections from all objects named on the link line to build the .edata for the overall image file. One last possible way that the .edata can be created is by the linker itself, without having to put .edata into any object files:

  1. The linker could choose to export all symbols defined by object files named on the link line. For example, this is the default behaviour of GNU ld (the behaviour can also be explicitly asked for using –-export-all-symbols). In this case, the linker generates the .edata section itself. (GNU ld also supports specifying a .def file on the command line, in which case the generated section will export just those things named by the .def).

The .idata section

The .idata section records those things that the image imports. It consists of:

  • For every image from which symbols are imported:

    • The filename of the image. Used by the dynamic linker to locate it on disk.

    • The import lookup table: an array of length N, which each entry is either an ordinal or a pointer to a string representing the name to import.

    • The import address table: an array of N pointers. The dynamic linker is responsible for filling out this array with the address of the function/data named by the corresponding symbol in the import lookup table.

The ways in which .idata entries are created are as follows:

  1. Most commonly, they originate in a library of object files called an import library'. This import library can be created by usingdlltool` on the DLL you wish to export or a .def file of the type we discussed earlier. Just like the export library, the import library must be named by the user on the link line.

  2. Alternatively, some linkers (like GNU ld) let you specify a DLL directly on the link line. The linker will automatically generate .idata entries for any symbols that you must import from the DLL.

Notice that unlike the case when we were exporting symbols, __declspec(dllimport) does not cause .idata sections to be generated.

Import libraries are a bit more complicated than they first appear. The Windows dynamic loader fills the import address table with the addresses of the imported symbols (say, the address of a function Func). However, when the assembly code in other object files says call Func they expect that Func to name the address of that code. But we don't know that address until runtime: the only thing we know statically is the address where that address will be placed by the dynamic linker. We will call this address __imp__Func.

To deal with this extra level of indirection, the import library exports a function Func that just dereferences __imp__Func (to get the actual function pointer) and then jmps to it. All of the other object files in the project can now say call Func just as they would if Func had been defined in some other object file, rather than a DLL. For this reason, saying __declspec(dllimport) in the declaration of a dynamically linked function is optional (though in fact you will get slightly more efficient code if you add them, as we will see later).

Unfortunately, there is no equivalent trick if you want to import data from another DLL. If we have some imported data myData, there is no way the import library can be defined so that a mov $eax, myData in an object file linked against it writes to the storage for myData in that DLL. Instead, the import library defines a symbol __imp__myData that resolves to the address at which the linked-in address of the storage can be found. The compiler then ensures that when you read or write from a variable defined with __declspec(dllimport) those reads and writes go through the __imp_myData indirection. Because different code needs to be generated at the use site, __declspec declarations on data imports are not optional.

Practical example

Theory is all very well but it can be helpful to see all the pieces in play.

Building a DLL

First, lets build a simple DLL exporting both functions and data. For maximum clarity, we'll use an explicit export library rather instead of decorating our functions with declspec(dllexport) or supply a .def file to the linker.

First lets write the .def file, library.def:

LIBRARY library
   data_export      DATA

(The DATA keyword and LIBRARY line only affects how the import library is generated, as explained later on. Ignore them for now.)

Build an export file from that:

$ dlltool --output-exp library_exports.o -d library.def

The resulting object basically just contains an .edata section that exports the symbols _data_export and _function_export under the names data_export and function_export respectively:

$ objdump -xs library_exports.o


There is an export table in .edata at 0x0

The Export Tables (interpreted .edata section contents)

Export Flags                    0
Time/Date stamp                 4e10e5c1
Major/Minor                     0/0
Name                            00000028 library_exports.o.dll
Ordinal Base                    1
Number in:
        Export Address Table            00000002
        [Name Pointer/Ordinal] Table    00000002
Table Addresses
        Export Address Table            00000040
        Name Pointer Table              00000048
        Ordinal Table                   00000050

Export Address Table -- Ordinal Base 1

[Ordinal/Name Pointer] Table
        [   0] data_export
        [   1] function_export

Idx Name          Size      VMA       LMA       File off  Algn
  0 .text         00000000  00000000  00000000  00000000  2**2
                  ALLOC, LOAD, READONLY, CODE
  1 .data         00000000  00000000  00000000  00000000  2**2
                  ALLOC, LOAD, DATA
  2 .bss          00000000  00000000  00000000  00000000  2**2
  3 .edata        00000070  00000000  00000000  000000b4  2**2
[  0](sec -2)(fl 0x00)(ty   0)(scl 103) (nx 1) 0x00000000 fake
[  2](sec  4)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000028 name
[  3](sec  4)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000040 afuncs
[  4](sec  4)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000048 anames
[  5](sec  4)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000050 anords
[  6](sec  4)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000054 n1
[  7](sec  4)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000060 n2
[  8](sec  1)(fl 0x00)(ty   0)(scl   3) (nx 1) 0x00000000 .text
AUX scnlen 0x0 nreloc 0 nlnno 0
[ 10](sec  2)(fl 0x00)(ty   0)(scl   3) (nx 1) 0x00000000 .data
AUX scnlen 0x0 nreloc 0 nlnno 0
[ 12](sec  3)(fl 0x00)(ty   0)(scl   3) (nx 1) 0x00000000 .bss
AUX scnlen 0x0 nreloc 0 nlnno 0
[ 14](sec  4)(fl 0x00)(ty   0)(scl   3) (nx 1) 0x00000000 .edata
AUX scnlen 0x70 nreloc 8 nlnno 0
[ 16](sec  0)(fl 0x00)(ty   0)(scl   2) (nx 0) 0x00000000 _data_export
[ 17](sec  0)(fl 0x00)(ty   0)(scl   2) (nx 0) 0x00000000 _function_export

OFFSET   TYPE              VALUE
0000000c rva32             .edata
0000001c rva32             .edata
00000020 rva32             .edata
00000024 rva32             .edata
00000040 rva32             _data_export
00000044 rva32             _function_export
00000048 rva32             .edata
0000004c rva32             .edata

Contents of section .edata:
 0000 00000000 c1e5104e 00000000 28000000  .......N....(...
 0010 01000000 02000000 02000000 40000000  ............@...
 0020 48000000 50000000 6c696272 6172795f  H...P...library_
 0030 6578706f 7274732e 6f2e646c 6c000000  exports.o.dll...
 0040 00000000 00000000 54000000 60000000  ........T...`...
 0050 00000100 64617461 5f657870 6f727400  ....data_export.
 0060 66756e63 74696f6e 5f657870 6f727400  function_export.

We'll fulfil these symbol with a trivial implementation of the DLL, library.c:

int data_export = 42;

int function_export() {
    return 1337 + data_export;

We can put it together into a DLL:

$ gcc -shared -o library.dll library.c library_exports.o

The export table for the DLL is as follows, showing that we have exported what we wanted:

The Export Tables (interpreted .edata section contents)

Export Flags                    0
Time/Date stamp                 4e10e5c1
Major/Minor                     0/0
Name                            00005028 library_exports.o.dll
Ordinal Base                    1
Number in:
        Export Address Table            00000002
        [Name Pointer/Ordinal] Table    00000002
Table Addresses
        Export Address Table            00005040
        Name Pointer Table              00005048
        Ordinal Table                   00005050

Export Address Table -- Ordinal Base 1
        [   0] +base[   1] 200c Export RVA
        [   1] +base[   2] 10f0 Export RVA

[Ordinal/Name Pointer] Table
        [   0] data_export
        [   1] function_export

Using the DLL

When we come to look at using the DLL, things become a lot more interesting. First, we need an import library:

$ dlltool --output-lib library.dll.a -d library.def

(The reason that we have an import library but an export object is because using a library for the imports allows the linker to discard .idata for any imports that are not used. Contrariwise ,he linker can never discard any .edata entry because any export may potentially be used by a user of the DLL).

This import library is rather complex. It contains one object for each export (disds00000.o and disds00001.o) but also two other object files (distdt.o and disdh.o) that set up the header and footer of the import list. (The header of the import list contains, among other things, the name of the DLL to link in at runtime, as derived from the LIBRARY line of the .def file.)

$ objdump -xs library.dll.a In archive library.dll.a: disdt.o: file format pe-i386 ... Sections: Idx Name Size VMA LMA File off Algn 0 .text 00000000 00000000 00000000 00000000 2**2 ALLOC, LOAD, READONLY, CODE 1 .data 00000000 00000000 00000000 00000000 2**2 ALLOC, LOAD, DATA 2 .bss 00000000 00000000 00000000 00000000 2**2 ALLOC 3 .idata$4 00000004 00000000 00000000 00000104 2**2 CONTENTS, ALLOC, LOAD, DATA 4 .idata$5 00000004 00000000 00000000 00000108 2**2 CONTENTS, ALLOC, LOAD, DATA 5 .idata$7 0000000c 00000000 00000000 0000010c 2**2 CONTENTS, ALLOC, LOAD, DATA SYMBOL TABLE: [ 0](sec -2)(fl 0x00)(ty 0)(scl 103) (nx 1) 0x00000000 fake File [ 2](sec 1)(fl 0x00)(ty 0)(scl 3) (nx 1) 0x00000000 .text AUX scnlen 0x0 nreloc 0 nlnno 0 [ 4](sec 2)(fl 0x00)(ty 0)(scl 3) (nx 1) 0x00000000 .data AUX scnlen 0x0 nreloc 0 nlnno 0 [ 6](sec 3)(fl 0x00)(ty 0)(scl 3) (nx 1) 0x00000000 .bss AUX scnlen 0x0 nreloc 0 nlnno 0 [ 8](sec 4)(fl 0x00)(ty 0)(scl 3) (nx 1) 0x00000000 .idata$4 AUX scnlen 0x4 nreloc 0 nlnno 0 [ 10](sec 5)(fl 0x00)(ty 0)(scl 3) (nx 1) 0x00000000 .idata$5 AUX scnlen 0x4 nreloc 0 nlnno 0 [ 12](sec 6)(fl 0x00)(ty 0)(scl 3) (nx 1) 0x00000000 .idata$7 AUX scnlen 0x7 nreloc 0 nlnno 0 [ 14](sec 6)(fl 0x00)(ty 0)(scl 2) (nx 0) 0x00000000 __library_dll_a_iname Contents of section .idata$4: 0000 00000000 .... Contents of section .idata$5: 0000 00000000 .... Contents of section .idata$7: 0000 6c696272 6172792e 646c6c00 library.dll. disdh.o: file format pe-i386 ... Sections: Idx Name Size VMA LMA File off Algn 0 .text 00000000 00000000 00000000 00000000 2**2 ALLOC, LOAD, READONLY, CODE 1 .data 00000000 00000000 00000000 00000000 2**2 ALLOC, LOAD, DATA 2 .bss 00000000 00000000 00000000 00000000 2**2 ALLOC 3 .idata$2 00000014 00000000 00000000 00000104 2**2 CONTENTS, ALLOC, LOAD, RELOC, DATA 4 .idata$5 00000000 00000000 00000000 00000000 2**2 ALLOC, LOAD, DATA 5 .idata$4 00000000 00000000 00000000 00000000 2**2 ALLOC, LOAD, DATA SYMBOL TABLE: [ 0](sec -2)(fl 0x00)(ty 0)(scl 103) (nx 1) 0x00000000 fake File [ 2](sec 6)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 hname [ 3](sec 5)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 fthunk [ 4](sec 1)(fl 0x00)(ty 0)(scl 3) (nx 1) 0x00000000 .text AUX scnlen 0x0 nreloc 0 nlnno 0 [ 6](sec 2)(fl 0x00)(ty 0)(scl 3) (nx 1) 0x00000000 .data AUX scnlen 0x0 nreloc 0 nlnno 0 [ 8](sec 3)(fl 0x00)(ty 0)(scl 3) (nx 1) 0x00000000 .bss AUX scnlen 0x0 nreloc 0 nlnno 0 [ 10](sec 4)(fl 0x00)(ty 0)(scl 3) (nx 1) 0x00000000 .idata$2 AUX scnlen 0x14 nreloc 3 nlnno 0 [ 12](sec 6)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .idata$4 [ 13](sec 5)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .idata$5 [ 14](sec 4)(fl 0x00)(ty 0)(scl 2) (nx 0) 0x00000000 __head_library_dll_a [ 15](sec 0)(fl 0x00)(ty 0)(scl 2) (nx 0) 0x00000000 __library_dll_a_iname RELOCATION RECORDS FOR [.idata$2]: OFFSET TYPE VALUE 00000000 rva32 .idata$4 0000000c rva32 __library_dll_a_iname 00000010 rva32 .idata$5 Contents of section .idata$2: 0000 00000000 00000000 00000000 00000000 ................ 0010 00000000 .... disds00001.o: file format pe-i386 ... Sections: Idx Name Size VMA LMA File off Algn 0 .text 00000008 00000000 00000000 0000012c 2**2 CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE 1 .data 00000000 00000000 00000000 00000000 2**2 ALLOC, LOAD, DATA 2 .bss 00000000 00000000 00000000 00000000 2**2 ALLOC 3 .idata$7 00000004 00000000 00000000 00000134 2**2 CONTENTS, RELOC 4 .idata$5 00000004 00000000 00000000 00000138 2**2 CONTENTS, RELOC 5 .idata$4 00000004 00000000 00000000 0000013c 2**2 CONTENTS, RELOC 6 .idata$6 00000012 00000000 00000000 00000140 2**1 CONTENTS SYMBOL TABLE: [ 0](sec 1)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .text [ 1](sec 2)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .data [ 2](sec 3)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .bss [ 3](sec 4)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .idata$7 [ 4](sec 5)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .idata$5 [ 5](sec 6)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .idata$4 [ 6](sec 7)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .idata$6 [ 7](sec 1)(fl 0x00)(ty 0)(scl 2) (nx 0) 0x00000000 _function_export [ 8](sec 5)(fl 0x00)(ty 0)(scl 2) (nx 0) 0x00000000 __imp__function_export [ 9](sec 0)(fl 0x00)(ty 0)(scl 2) (nx 0) 0x00000000 __head_library_dll_a RELOCATION RECORDS FOR [.text]: OFFSET TYPE VALUE 00000002 dir32 .idata$5 RELOCATION RECORDS FOR [.idata$7]: OFFSET TYPE VALUE 00000000 rva32 __head_library_dll_a RELOCATION RECORDS FOR [.idata$5]: OFFSET TYPE VALUE 00000000 rva32 .idata$6 RELOCATION RECORDS FOR [.idata$4]: OFFSET TYPE VALUE 00000000 rva32 .idata$6 Contents of section .text: 0000 ff250000 00009090 .%...... Contents of section .idata$7: 0000 00000000 .... Contents of section .idata$5: 0000 00000000 .... Contents of section .idata$4: 0000 00000000 .... Contents of section .idata$6: 0000 01006675 6e637469 6f6e5f65 78706f72 ..function_expor 0010 7400 t. disds00000.o: file format pe-i386 ... Sections: Idx Name Size VMA LMA File off Algn 0 .text 00000000 00000000 00000000 00000000 2**2 ALLOC, LOAD, READONLY, CODE 1 .data 00000000 00000000 00000000 00000000 2**2 ALLOC, LOAD, DATA 2 .bss 00000000 00000000 00000000 00000000 2**2 ALLOC 3 .idata$7 00000004 00000000 00000000 0000012c 2**2 CONTENTS, RELOC 4 .idata$5 00000004 00000000 00000000 00000130 2**2 CONTENTS, RELOC 5 .idata$4 00000004 00000000 00000000 00000134 2**2 CONTENTS, RELOC 6 .idata$6 0000000e 00000000 00000000 00000138 2**1 CONTENTS SYMBOL TABLE: [ 0](sec 1)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .text [ 1](sec 2)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .data [ 2](sec 3)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .bss [ 3](sec 4)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .idata$7 [ 4](sec 5)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .idata$5 [ 5](sec 6)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .idata$4 [ 6](sec 7)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .idata$6 [ 7](sec 5)(fl 0x00)(ty 0)(scl 2) (nx 0) 0x00000000 __imp__data_export [ 8](sec 0)(fl 0x00)(ty 0)(scl 2) (nx 0) 0x00000000 __head_library_dll_a RELOCATION RECORDS FOR [.idata$7]: OFFSET TYPE VALUE 00000000 rva32 __head_library_dll_a RELOCATION RECORDS FOR [.idata$5]: OFFSET TYPE VALUE 00000000 rva32 .idata$6 RELOCATION RECORDS FOR [.idata$4]: OFFSET TYPE VALUE 00000000 rva32 .idata$6 Contents of section .idata$7: 0000 00000000 .... Contents of section .idata$5: 0000 00000000 .... Contents of section .idata$4: 0000 00000000 .... Contents of section .idata$6: 0000 00006461 74615f65 78706f72 7400 ..data_export.

Note that the object corresponding to data_export has an empty .text section, whereas function_export does define some code. If we disassemble it we get this:

00000000 <_function_export>:
   0:   ff 25 00 00 00 00       jmp    *0x0
                        2: dir32        .idata$5
   6:   90                      nop
   7:   90                      nop

The relocation of type dir32 tells the linker how to fill in the address being dereferenced by the jmp. We can see that _function_export, when entered, will jump directly to the function at the address loaded from the memory named .idata$5. Inspection of the complete .idata section satisfies us that .idata$5 corresponds to the address of the fragment of the import address table corresponding to the function_export import name, and hence the address where the absolute address of the loaded function_export import can be found.

Although only function_export gets a corresponding _function_export function, both of the exports have lead to a symbol with the __imp__ prefix (__imp__data_export and __imp__function_export) being defined in the import library. As discussed before, this symbol stands for the address at which the pointer to the data/function will be inserted by the dynamic linker. As such, the __imp__ symbols always point directly into the import address table.

With an import library in hand, we are capable of writing some client code that uses our exports, main1.c:

#include <stdio.h>

__declspec(dllimport) extern int function_export(void);
__declspec(dllimport) extern int data_export;

int main(int argc, char **argv) {
    printf("%d\n", function_export());
    printf("%d\n", data_export);


    printf("%d\n", function_export());
    printf("%d\n", data_export);

    return 0;

Build and link it against the import library and we will get the results we expect:

$ gcc main1.c library.dll.a -o main1 && ./main1

The reason that this works even though there is no data_export symbol defined by library.dll.a is because the __declspec(dllimport) qualifier on our data_export declaration in main.c has caused the compiled to generate code that uses the __imp_data_export symbol directly, as we can see if we disassemble the generated code:

$ gcc -c main1.c -o main1.o && objdump --disassemble -r main1.o

main1.o:     file format pe-i386

Disassembly of section .text:

00000000 <_main>:
   0:   8d 4c 24 04             lea    0x4(%esp),%ecx
   4:   83 e4 f0                and    $0xfffffff0,%esp
   7:   ff 71 fc                pushl  -0x4(%ecx)
   a:   55                      push   %ebp
   b:   89 e5                   mov    %esp,%ebp
   d:   51                      push   %ecx
   e:   83 ec 14                sub    $0x14,%esp
  11:   e8 00 00 00 00          call   16 <_main+0x16>
                        12: DISP32      ___main
  16:   a1 00 00 00 00          mov    0x0,%eax
                        17: dir32       __imp__function_export
  1b:   ff d0                   call   *%eax
  1d:   89 44 24 04             mov    %eax,0x4(%esp)
  21:   c7 04 24 00 00 00 00    movl   $0x0,(%esp)
                        24: dir32       .rdata
  28:   e8 00 00 00 00          call   2d <_main+0x2d>
                        29: DISP32      _printf
  2d:   a1 00 00 00 00          mov    0x0,%eax
                        2e: dir32       __imp__data_export
  32:   8b 00                   mov    (%eax),%eax
  34:   89 44 24 04             mov    %eax,0x4(%esp)
  38:   c7 04 24 00 00 00 00    movl   $0x0,(%esp)
                        3b: dir32       .rdata
  3f:   e8 00 00 00 00          call   44 <_main+0x44>
                        40: DISP32      _printf
  44:   a1 00 00 00 00          mov    0x0,%eax
                        45: dir32       __imp__data_export
  49:   8b 00                   mov    (%eax),%eax
  4b:   8d 50 01                lea    0x1(%eax),%edx
  4e:   a1 00 00 00 00          mov    0x0,%eax
                        4f: dir32       __imp__data_export
  53:   89 10                   mov    %edx,(%eax)
  55:   a1 00 00 00 00          mov    0x0,%eax
                        56: dir32       __imp__function_export
  5a:   ff d0                   call   *%eax
  5c:   89 44 24 04             mov    %eax,0x4(%esp)
  60:   c7 04 24 00 00 00 00    movl   $0x0,(%esp)
                        63: dir32       .rdata
  67:   e8 00 00 00 00          call   6c <_main+0x6c>
                        68: DISP32      _printf
  6c:   a1 00 00 00 00          mov    0x0,%eax
                        6d: dir32       __imp__data_export
  71:   8b 00                   mov    (%eax),%eax
  73:   89 44 24 04             mov    %eax,0x4(%esp)
  77:   c7 04 24 00 00 00 00    movl   $0x0,(%esp)
                        7a: dir32       .rdata
  7e:   e8 00 00 00 00          call   83 <_main+0x83>
                        7f: DISP32      _printf
  83:   b8 00 00 00 00          mov    $0x0,%eax
  88:   83 c4 14                add    $0x14,%esp
  8b:   59                      pop    %ecx
  8c:   5d                      pop    %ebp
  8d:   8d 61 fc                lea    -0x4(%ecx),%esp
  90:   c3                      ret
  91:   90                      nop
  92:   90                      nop
  93:   90                      nop

In fact, we can see that the generated code doesn't even use the _function_export symbol, preferring __imp__function_export. Essentially, the code of the _function_export symbol in the import library has been inlined at every use site. This is why using __declspec(dllimport) can improve performance of cross-DLL calls, even though it is entirely optional on function declarations.

We might wonder what happens if we drop the __declspec(dllimport) qualifier on our declarations. Because of our discussion about the difference between data and function imports earlier, you might expect linking to fail. Our test file, main2.c is:

#include <stdio.h>

extern int function_export(void);
extern int data_export;

int main(int argc, char **argv) {
    printf("%d\n", function_export());
    printf("%d\n", data_export);


    printf("%d\n", function_export());
    printf("%d\n", data_export);

    return 0;

Let's try it out:

$ gcc main2.c library.dll.a -o main2 && ./main2

What the hell -- it worked? This is a bit uprising. The reason that it works despite the fact that the import library library.dll.a not defining the _data_export symbol is because of a nifty feature of GNU ld called auto-import. Without auto-import the link fails as we would expect:

$ gcc main2.c library.dll.a -o main2 -Wl,--disable-auto-import && ./main2
/tmp/ccGd8Urx.o:main2.c:(.text+0x2c): undefined reference to `_data_export'
/tmp/ccGd8Urx.o:main2.c:(.text+0x41): undefined reference to `_data_export'
/tmp/ccGd8Urx.o:main2.c:(.text+0x49): undefined reference to `_data_export'
/tmp/ccGd8Urx.o:main2.c:(.text+0x63): undefined reference to `_data_export'
collect2: ld returned 1 exit status

The Microsoft linker does not implement auto-import, so this is the error you would get if you were using the Microsoft toolchain.

However, there is a way to write client code that does not depend on auto-import or use the __declspec(dllimport) keyword. Our new client, main3.c is as follows:

#include <stdio.h>

extern int (*_imp__function_export)(void);
extern int *_imp__data_export;

#define function_export (*_imp__function_export)
#define data_export (*_imp__data_export)

int main(int argc, char **argv) {
    printf("%d\n", function_export());
    printf("%d\n", data_export);


    printf("%d\n", function_export());
    printf("%d\n", data_export);

    return 0;

In this code, we directly use the __imp__-prefixed symbols from the import library. These name an address at which the real address of the import can be found, which is reflected by our C-preprocessor definitions of data_export and function_export.

This code compiles perfectly even without auto-import:

$ gcc main3.c library.dll.a -o main3 -Wl,--disable-auto-import && ./main3

If you have followed along until this point you should have a solid understanding of how DLL import and export are implemented on Windows.

How auto-import works

As a bonus, I'm going to explain how auto-import is implemented by the GNU linker. It is a rather cute hack you may get a kick out of.

As a reminder, auto-import is a feature of the linker that allows the programmer to declare an item of DLL-imported data with a simple extern keyword, without having to explicitly use __declspec(dllimport). This is extremely convenient because this is exactly how most nix source code declares symbols it expects to import from a shared library, so by supporting this use case thatnix code becomes more portable to Windows.

Auto-import kicks in whenever the linker finds an object file making use of a symbol foo which is not defined by any other object in the link, but where a symbol __imp_foo is defined by some object. In this case, it assumes that the use of foo is an attempt to access some DLL-imported data item called foo.

Now, the problem is that the linker needs to replace the use of foo with the address of foo itself. However, all we seem to know statically is an address where that address will be placed at runtime (__imp_foo). To square the circle, the linker plays a clever trick.

The trick is to extend the .idata of the image being created with an entry for a "new" DLL. The new entry is set up as follows:

  • The filename of the image being imported is set to the same filename as the .idata entry covering __imp_foo. So if __imp_foo was being filled out by an address in Bar.dll, our new .idata entry will use Bar.dll here.

  • The import lookup table is of length 1, whose sole entry is a pointer to the name of the imported symbol corresponding to __imp_foo. So if __imp_foo is filled out by the address of the foo export from Bar.dll, the name of the symbol we put in here will be foo.

  • The import address table is of length 1 -- and here is the clever bit -- is located precisely at the location in the object file that was referring to the (undefined) symbol foo.

This solution neatly defers the task of filling out the address that the object file wants to the dynamic linker. The reason that the linker can play this trick is that it can see all of the object code that goes into the final image, and can thus fix all of the sites that need to refer to the imported data.

Note that in general the final image's .idata will contain several entries for the same DLL: one from the import library, and one for every place in any object file in the link which referred to some data exported by the DLL. Although this is somewhat unusual behaviour, the Windows linker has no problem with there being several imports of the same DLL.

A wrinkle

Unfortunately, the scheme described above only works if the object code has an undefined reference to foo itself. What if instead it has a reference to foo+N, an address N bytes after the address of foo itself? There is no way to set up the .idata so that the dynamic linker adds a constant to the address it fills in, so we seem to be stuck.

Alas, such relocations are reasonably common, and originate from code that accesses a field of a DLL-imported structure type. Cygwin actually contains another hack to make auto-import work in such cases, known as "pseudo-relocations". If you want to know the details of how these works, there is more information in the original thread on the topic.


Dynamic linking on Windows is hairier than it at first appears. I hope this article has gone some way to clearing up the meaning of the mysterious dllimport and dllexport keywords, and at clarifying the role of the import and export libraries.

Linux and friends implement dynamic linking in a totally different manner to Windows. The scheme they use is more flexible and allows more in-memory sharing of code, but incurs a significant runtime penalty (especially on i386). For more details see here and the Dynamic Linking section of the the ELF spec.

Apr 4 2011

Fixing "files could not be moved" error in Boot Camp Assistant

Recently I've been trying to install Windows on an OS X laptop by using Boot Camp. However, every time the Boot Camp Assistant would tell me that "some files could not be moved" during the creation of the Windows partition. The most commonly suggested solution is a total reinstall of OS X, which I was absolutely not willing to perform.

I read online that this problem could sometimes be solved by using iDefrag in "Compact" mode to manually move all your files to the front of the disk before using the Assistant. However, buying and running this £24 software had absolutely no effect on the problem.

Looking at iDefrag's summary view, it seems like there was a single unmovable "alternate volume label" block right at the end of the drive which might account for the problem. Anyway, I never got to the bottom of this, since I found an alternate solution that worked: hold down Apple+S during startup to enter single user mode and then run these commands:

/sbin/fsck -fy

This just repairs any filesystem errors on your disk. It looks like filesystem errors were the true culprits, not unmovable files, since running the Assistant after this let me create the partition with no problems - Windows is installing right now.

Mar 29 2011

Security implications of PEP 383

I've been looking into improving GHC's support for non-ASCII text, and my investigations have lead to me to PEP 383.

One motivation behind this PEP is as follows: on Unix, the names of files, command line arguments, and environment variables should probably be treated as sequences of bytes. However, for good reasons it is quite natural for programs to act on them as if they were strings. This means that we have to choose some text encoding to use to interpret those byte sequences.

Unfortunately, whatever encoding you choose to use, it is quite likely that some byte sequences you encounter in practice will not in fact decode nicely using that encoding. An example would be a Big5 filename supplied as a command line argument to a program run in the UTF-8 locale.

In this case, what should happen? One sensible thing to do would be to fail, but this might be surprising. Python 3, with PEP 383, chooses to encode the non-decodable bytes as part of the string using surrogates. So if we try to parse a Big5 filename as a string we get a string full of surrogates representing the raw bytes we had to begin with.

This is a good thing to do because if that string is then immediately fed back into a function that just decodes the filename for use on the file system, the original byte sequence can be exactly reconstituted by decoding the surrogates back into bytes and using the locale encoding for the rest. If the user attempts to do something else with a string containing surrogates (such as e.g. display it to the terminal), then an exception will be raised.

This is a reasonably neat solution to a hard problem. However, it has weird implications. For example, consider this script that uses a black list to control access to some files:

#!/usr/bin/env python3

import sys

file = sys.argv[1]

blacklist = open("blacklist.big5", encoding='big5').read().split()
print("Blacklist is:\n" + repr(blacklist))

if file in blacklist:
print("Blacklisted file, not allowed!")
print("OK, I'll let you in!")

Let's say that the blacklist contains a single entry, for the file 你好 (encoded in Big5, naturally).

Seems simple enough, right? Although I store file names as Big5, I compare Python's Unicode strings. And indeed this program works perfectly when run from a terminal in the Big5 locale, with Big5 file names.

However, consider what happens when the terminal is set to UTF-8 and we invoke the script with the command line argument 你好 (encoded in Big5 of course, because the file name on disk is still Big5 even though we changed the terminal locale). In this case, Python 3 will attempt to decode the file name as UTF-8. Naturally, it will fail, so the Big5 filename will be represented in memory with surrogates.

Now for the punchline: when we come to compare that string (containing surrogates) with the entry from the blacklist (without surrogates) they will not be equal. Yet, when we go on to open the file, the filename (with surrogates) is decoded perfectly back into valid Big5 and hence we get the contents of the blacklisted file.

In my opinion, the fact that the current encoding affects the results of string comparisons is deeply weird behaviour and could probably be the cause of subtle security bugs. This is just one reason that I'm wary about adopting PEP 383-like behaviour for GHC.

P.S. For those who believe that my code is broken because you should only compare normalised unicode strings, I will add that even after using unicodedata.normalize to normalise to NFC I get the same problem.

P.P.S I will further note that you get the same issue even if the blacklist and filename had been in UTF-8, but this time it gets broken from a terminal in the Big5 locale. I didn't show it this way around because I understand that Python 3 may only have just recently started using the locale to decode argv, rather than being hardcoded to UTF-8.

Mar 8 2011

How to build 32/64 bit fat (universal) binaries

The OS X version of the Glasgow Haskell Compiler compiles Haskell into 32-bit code. Unfortunately, this means that if you are on a system where it is the default for libraries to be built in 64-bit mode, you tend to get errors when linking Haskell code telling you that you are trying to link 32-bit code against 64-bit code.

The best solution to this problem is to build all libraries you intend to link to from Haskell code as universal binaries that include both 32-bit and 64-bit versions of the code. These libraries will then work seamlessly with both Haskell code and also when pulled in as part of the build process for non-Haskell 64-bit executables.

If you can install the library using MacPorts, this is easy to do. Instead of doing:

sudo port install mylibrary

Just do:

sudo port install mylibrary +universal

However, if the library you want is not available through MacPorts or the MacPorts version is not up to date you will need to know how to build these universal libraries for yourself. This is the process that I aim to explain in this post. I'm going to use igraph as my example library because it's what I needed to install (I needed to install the unreleased v0.6).

The easy method

If you are lucky, building a universal library is as simple as changing how you invoke make. Run the library's configure scripts etc as usual, and then invoke make as follows:

make CXXFLAGS="-arch i386 -arch x86_64" CFLAGS="-arch i386 -arch x86_64" LDFLAGS="-arch i386 -arch x86_64"

The -arch flags tell GCC and the linker to build and link both versions of the library. If this works, you are done. In the case of igraph, this wasn't quite enough - the above command failed with this error:

gcc-4.2: -E, -S, -save-temps and -M options are not allowed with multiple -arch flags

The reason that this occurs is because igraph invokes GCC with the -M series of flags that generate makefile dependency rules from the C code - but GCC doesn't like generating those rules for two architectures simultaneously. Luckily, there was an easy workaround in my case - I just needed to reconfigure igraph as follows:

./configure --disable-dependency-tracking

The --disable-dependency-tracking flag just stops Automake from determining the dependencies of each C file as it compiles it. It is totally harmless to disable this because that dependency information is only used in order to rebuild less stuff upon subsequent invocations of make - the worst that happens when you disable it is that if you make more than once you will have to wait a bit longer. For more information on this feature see also the relevant section of the Automake manual.

After reconfiguring in this manner, the original make invocation worked correctly for igraph.

The hard method

The above method may perhaps fail for some libraries, in which case you can use this more arduous manual method. The idea is to run the library's build process from scratch twice: once to get the 32-bit library and once for the 64-bit library. We can then use the lipo to glue together the build artifacts from the two runs.

We start by building the 32-bit version:

make clean
make CXXFLAGS=-m32 CFLAGS=-m32 LDFLAGS=-m32 -j12

We now need to store the 32-bit build artifacts somewhere. Exactly which files you have to save will vary according to the library you are building, but for igraph this was sufficient:

mkdir -p ~/Junk/32 ~/Junk/64
cp src/.libs/libigraph.{a,0.dylib} ~/Junk/32

Now do the 64-bit build and once again save the artifacts somewhere:

make clean
make CXXFLAGS=-m64 CFLAGS=-m64 LDFLAGS=-m64 -j12
cp src/.libs/libigraph.{a,0.dylib} ~/Junk/64

Finally we can use lipo to finish up:

lipo -create ~/Junk/{32,64}/libigraph.a -output src/.libs/libigraph.a
lipo -create ~/Junk/{32,64}/libigraph.0.dylib -output src/.libs/libigraph.0.dylib

At this point, you can do sudo make install and get a universal version of the library installed.

If you want to check that your libraries are indeed universal, you can use lipo -info:

$ lipo -info src/.libs/libigraph.a
Architectures in the fat file: src/.libs/libigraph.a are: i386 x86_64


Building universal 32-bit/64-bit binaries is apparently fairly straightforward but it was tricky to find documentation for the process. I hope this article helps others who need to get this done.