Sep 30 2011

GHC-specific Alias Analysis for LLVM

The setup

A few years ago, David Terei did some great work adding a LLVM backend to the Glasgow Haskell Compiler. The idea with this is that instead of writing our own optimiser and assembly-code generators for our custom three-address-code, we can just translate into LLVM IR and have LLVM do the heavy lifting. In theory, this means that GHC will be able to compile for many different CPUs, and will benefit from the smart optimisations the LLVM team have implemented.

The portability part has definitely worked out for us: for example, a couple of people have successfully got GHC to compile for the ARM by using the LLVM backend. However, the promise of LLVM being able to speed up our generated code has never really been fully borne out. LLVM-generated code does tend to be better than that produced by GHCs own backends, but this is mostly because LLVM is doing much better register allocation (it is much smarter about reusing the “pinned registers” required that form part of the interface between GHC’s generated code and the garbage collector).

The reason that LLVM does not optimise as much as we would like is often to do with aliasing. In particular, LLVM conservatively assumes that GHC’s stack (which is explicitly represented in the generated code as an array of words) and the heap may alias.

What’s the problem?

A concrete example of this is the following Haskell program:

module Main(main) where

import Data.Array.Base
import Data.Array.IO
import Data.Array.MArray

main :: IO ()
main = do
    arr <- newArray_ (0, 200)
    go arr 2 0 100

go :: IOUArray Int Int -> Int -> Int -> Int -> IO ()
go arr stride x y | x < y     = do unsafeWrite arr (x * stride) 1337
                                   go arr stride (x + 1) y
                  | otherwise = return ()

This loop compiles to fairly good Core:

Main.main_$s$wa =
  \ (@ sg0_sKA::Data.Array.Base.STUArray
                  GHC.Prim.RealWorld GHC.Types.Int GHC.Types.Int
                  ~
                Data.Array.IO.Internals.IOUArray GHC.Types.Int GHC.Types.Int)
    (sc_sKs :: GHC.Prim.State# GHC.Prim.RealWorld)
    (sc1_sKt :: GHC.Prim.Int#)
    (sc2_sKu :: GHC.Prim.Int#)
    (sc3_sKv :: GHC.Prim.Int#)
    (sc4_sKw :: GHC.Types.Int)
    (sc5_sKx :: GHC.Types.Int)
    (sc6_sKy :: GHC.Types.Int)
    (sc7_sKz :: GHC.Prim.MutableByteArray# GHC.Prim.RealWorld) ->
    case GHC.Prim.<# sc2_sKu sc1_sKt of _ {
      GHC.Bool.False -> (# sc_sKs, GHC.Unit.() #);
      GHC.Bool.True ->
        case GHC.Prim.writeIntArray#
               @ GHC.Prim.RealWorld
               sc7_sKz
               (GHC.Prim.*# sc2_sKu sc3_sKv)
               1337
               sc_sKs
        of s2#_aHo { __DEFAULT ->
        Main.main_$s$wa
          @ (sym
               Data.Array.IO.Internals.NTCo:IOUArray GHC.Types.Int GHC.Types.Int)
          s2#_aHo
          sc1_sKt
          (GHC.Prim.+# sc2_sKu 1)
          sc3_sKv
          sc4_sKw
          sc5_sKx
          sc6_sKy
          sc7_sKz
        }
    }

One weird thing about this Core is that it passes around a number of dead arguments (sc4_sKw, sc5_sKx and sc6_sKy). This is a known bug in GHC, and is caused by a phase ordering problem. However, this particular infelicity should not prevent LLVM from being able to do the classic loop optimisation of strength reduction on our code.

The particular strength reduction we would like to perform si to replace the multiplication GHC.Prim.*# sc2_sKu sc3_sKv in the main_$s$wa loop with an addition. This is possible because the left operand sc2_sKu is a loop induction variable, increasing by 1 every iteration. Thus, on every iteration the value of the multiplication GHC.Prim.*# sc2_sKu sc3_sKv is just the value of the multiplication on the previous loop, plus sc3_sKv. Thus, by adding a loop variable that records the value of the multiplication on the previous iteration, we can replace the multiplication by an addition.

Unfortunately, LLVM currently can’t strength-reduce this loop in the suggested way. As we will soon see, this is due to aliasing.

Why does the problem happen?

We can immediately see the problem if we look at the optimised LLVM code for this loop:

c1TW.lr.ph:
  ...
  %ln1TL1 = load i64* %Sp_Arg, align 8
  ...

c1TW:                                             ; preds = %c1TW.lr.ph, %c1TW
  %ln1TL4 = phi i64 [ %ln1TL1, %c1TW.lr.ph ], [ %ln1UF, %c1TW ]
  %ln1Uy = mul i64 %ln1Uu, %ln1TL4
  %ln1Uz = add i64 %ln1Uw, %ln1Uy
  %ln1UA = inttoptr i64 %ln1Uz to i64*
  store i64 1337, i64* %ln1UA, align 8
  %ln1UE = load i64* %Sp_Arg, align 8
  %ln1UF = add i64 %ln1UE, 1
  store i64 %ln1UF, i64* %Sp_Arg, align 8
  %ln1TP = load i64* %ln1TN, align 8
  %ln1TQ = icmp slt i64 %ln1UF, %ln1TP
  br i1 %ln1TQ, label %c1TW, label %n1TX.loopexit

The strength reduction optimisation depends on one of the operands to the multiplication being a loop induction variable. In our case, we expect that sc2_sKu will be such a variable. However, looking at the LLVM code we can see that the equivalent LLVM variable, %ln1TL4, has its induction-ness hidden because it is reloaded from the stack by load i64* %Sp_Arg on every iteration.

You might wonder why the store to the same stack location by store i64 %ln1UF, i64* %Sp_Arg is not forwarded to this load by LLVM. If this were to happen, we could get code like this:

c1TW.lr.ph:
  ...
  %ln1TL1 = load i64* %Sp_Arg, align 8
  %ln1UE.ph = load i64* %Sp_Arg, align 8
  ...

c1TW:                                             ; preds = %c1TW.lr.ph, %c1TW
  %ln1TL4 = phi i64 [ %ln1TL1, %c1TW.lr.ph ], [ %ln1UF, %c1TW ]
  %ln1UE = phi i64 [ %ln1UE.ph, %c1TW.lr.ph ], [ %ln1UF, %c1TW ]
  %ln1Uy = mul i64 %ln1Uu, %ln1TL4
  %ln1Uz = add i64 %ln1Uw, %ln1Uy
  %ln1UA = inttoptr i64 %ln1Uz to i64*
  store i64 1337, i64* %ln1UA, align 8
  %ln1UF = add i64 %ln1UE, 1
  store i64 %ln1UF, i64* %Sp_Arg, align 8
  %ln1TP = load i64* %ln1TN, align 8
  %ln1TQ = icmp slt i64 %ln1UF, %ln1TP
  br i1 %ln1TQ, label %c1TW, label %n1TX.loopexit

In this code the fact that %ln1UE is an induction variable is obvious, and not obscured by an intermediate load from memory. And indeed, LLVM is able to strength-reduce this loop!

The reason that LLVM does not forward this load is because in general it is unsafe, since the store to %ln1UA might alias it if %ln1UA were equal to %Sp_Arg. The ridiculous thing about this is that we know that in the code generated by GHC, the stack pointer will never be stored away anywhere, so it can’t possible alias with the unknown pointer %ln1UA and LLVM is being unnecessarily conservative.

The solution

LLVM is a beautiful bit of software, and it provides exactly the extensibility point we require to resolve this problem: we can write our own alias analysis pass that knows that GHC’s stack never alias with any another non-stack pointer and dynamically load it into the LLVM optimisation tool chain.

This is exactly what I’ve done. The code is available as a Gist, and interested parties (who use OS X!) can build it like so:

g++ -D__STDC_LIMIT_MACROS -D__STDC_CONSTANT_MACROS -fno-exceptions -fno-rtti -fno-common -Wall \
-Wl,-flat_namespace -dynamiclib GHCAliasAnalysis.cpp -o GHCAliasAnalysis.dylib -lLLVM-`llvm-config --version`

Once built, we can dynamically load the resulting dylib into LLVMs opt tool using the -load option, and then use the new -ghc-aa flag to tell LLVM to use our alias analyser as a complement to the default one. Unfortunately, due to an infelicity in LLVM, we have to specify -ghc-aa in between every single optimisation pass if we want to be sure that it is used. So the final command line to opt, including all passes done by the standard -O2 optimisation level, and the -loop-reduce strength-reduction pass, needs to look something like this:

opt -load GHCAliasAnalysis.dylib -S -no-aa -tbaa -basicaa -ghc-aa \
-globalopt -ghc-aa -ghc-aa -ipsccp -ghc-aa -deadargelim -ghc-aa -instcombine -ghc-aa -simplifycfg \
-ghc-aa -basiccg -ghc-aa -prune-eh -ghc-aa -inline -ghc-aa -functionattrs -ghc-aa -scalarrepl-ssa \
-ghc-aa -domtree -ghc-aa -early-cse -ghc-aa -simplify-libcalls -ghc-aa -lazy-value-info -ghc-aa \
-jump-threading -ghc-aa -correlated-propagation -ghc-aa -simplifycfg -ghc-aa -instcombine -ghc-aa \
-tailcallelim -ghc-aa -simplifycfg -ghc-aa -reassociate -ghc-aa -domtree -ghc-aa -loops -ghc-aa \
-loop-simplify -ghc-aa -lcssa -ghc-aa -loop-rotate -ghc-aa -licm -ghc-aa -lcssa -ghc-aa -loop-unswitch \
-ghc-aa -instcombine -ghc-aa -scalar-evolution -ghc-aa -loop-simplify -ghc-aa -lcssa -ghc-aa -indvars \
-ghc-aa -loop-idiom -ghc-aa -loop-deletion -ghc-aa -loop-unroll -ghc-aa -memdep -ghc-aa -gvn -ghc-aa \
-memdep -ghc-aa -memcpyopt -ghc-aa -sccp -ghc-aa -instcombine -ghc-aa -lazy-value-info -ghc-aa \
-jump-threading -ghc-aa -correlated-propagation -ghc-aa -domtree -ghc-aa -memdep -ghc-aa -dse \
-ghc-aa -adce -ghc-aa -simplifycfg -ghc-aa -instcombine -ghc-aa -strip-dead-prototypes -ghc-aa \
-constmerge -loop-reduce

(Yes, I know this is ridiculous! I hope the LLVM developers fix this soon.)

With my new alias analysis pass, LLVM is able to produce the following beautiful code for the loop:

c1TW:                                             ; preds = %c1TW, %c1TW.lr.ph
  %lsr.iv = phi i64 [ %lsr.iv.next, %c1TW ], [ %5, %c1TW.lr.ph ]
  %ln1UF1 = phi i64 [ %ln1TL1, %c1TW.lr.ph ], [ %ln1UF, %c1TW ]
  %ln1UA = inttoptr i64 %lsr.iv to i64*
  store i64 1337, i64* %ln1UA, align 8
  %ln1UF = add i64 %ln1UF1, 1
  %lsr.iv.next = add i64 %lsr.iv, %6
  %ln1TQ = icmp slt i64 %ln1UF, %ln1TP2
  br i1 %ln1TQ, label %c1TW, label %n1TX.loopexit

Note that the original loop contained a store and two loads, but the optimised loop contains only a single store: our new alias analysis has allowed the loads to be floated out of the loop. This has in turn allowed LLVM to discover the loop induction variable and apply strength reduction - note that the final loop never uses the multiplication instruction!

The final program runs 8.8% faster than the version that is compiled without the custom alias analysis.

Conclusions

My custom alias analyser for GHC-generated code gives LLVM much more room for applying its existing powerful optimisation. There is plenty of scope for improvement, though:

  1. I’d really like people to report their experiences using with this alias analyser and the LLVM backend. Do you see a big speed boost on your data-parallel Haskell programs, for example?

  2. Of course, I would like this alias analyser to included with GHC so you can all seamlessly benefit from it. I’ll be working with GHC HQ to make this happen.

  3. I think there is still scope for getting even more useful information about GHC-generated code into LLVM. For example, currently LLVM is unable to eliminate stores to stack locations that we can see will never be accessed because we make a tail call to another function with a stack pointer that points above these locations. I can think of at least two ways to express this to LLVM, and this would produce another nice gain.

    If would also be great if we could teach LLVM something about the garbage collector, as currently if your loop does any allocation at all the presence of calls to the GC pessimises the output code a lot.

I was partly inspired to do this by Ben Lippmeier, whose paper at the Haskell Symposium this year had to do strength-reduction manually at the Haskell level because LLVM wasn’t working for him. I hope I’ve fixed that issue.

Performance problems were also a focus of the discussions about the future of Haskell at ICFP. I’ve been to these discussions three years in a row, and several topics keep cropping back up: performance, and the fact that Hackage 2.0 still isn’t released. I’ve grown tired of hearing so much talk about the issues with little-to-no action to resolve them, so I spent this post-ICFP week doing my best to fix them. I first wrote a documentation build bot for the Hackage 2.0 effort, and then moved on to the LLVM performance issues - if everyone helps to move these issues along then hopefully we can finally talk about some different problems next year!


Jul 4 2011

Everything You Never Wanted To Know About DLLs

I've recently had cause to investigate how dynamic linking is implemented on Windows. This post is basically a brain dump of everything I've learnt on the issue. This is mostly for my future reference, but I hope it will be useful to others too as I'm going to bring together lots of information you would otherwise have to hunt around for.

Without further ado, here we go:

Export and import directories

The Windows executable loader is responsible for doing all dynamic loading and symbol resolution before running the code. The linker works out what functions are exported or imported by each image (an image is a DLL or EXE file) by inspecting the .edata and .idata sections of those images, respectively.

The contents of these sections is covered in detail by the PE/COFF specification.

The .edata section

This section records the exports of the image (yes, EXEs can export things). This takes the form of:

  • The export address table: an array of length N holding the addresses of the exported functions/data (the addresses are stored relative to the image base). Indexes into this table are called ordinals.
  • The export name pointer table: an array of length M holding pointers to strings that represent the name of an export. This array is lexically ordered by name, to allow binary searches for a given export.
  • The export ordinal table: a parallel array of length M holding the ordinal of the corresponding name in the export name pointer table.

(As an alternative to importing an image's export by its name, it is possible to import by specifying an ordinal. Importing by ordinal is slightly faster at runtime because the dynamic linker doesn't have to do a lookup. Furthermore, if the import is not given a name by the exporting DLL, importing by ordinal is the only way to do the import.)

How does the .edata section get created in the first place? There are two main methods:

  1. Most commonly, they start life in the object files created by compiling some source code that defines a function/some data that was declared with the __declspec(dllimport) modifier. The compiler just emits an appropriate .edata section naming these exports.

  2. Less commonly, the programmer might write a .def file specifying which functions they would like to export. By supplying this to dlltool --output-exp, an export file can be generated. An export file is just an object file which only contains a .edata section, exporting (via some unresolved references that will be filled in by the linker in the usual way) the symbols named in the .def file. This export library must be named by the programmer when he comes to link together his object files into a DLL.

In both these cases, the linker collects the .edata sections from all objects named on the link line to build the .edata for the overall image file. One last possible way that the .edata can be created is by the linker itself, without having to put .edata into any object files:

  1. The linker could choose to export all symbols defined by object files named on the link line. For example, this is the default behaviour of GNU ld (the behaviour can also be explicitly asked for using –-export-all-symbols). In this case, the linker generates the .edata section itself. (GNU ld also supports specifying a .def file on the command line, in which case the generated section will export just those things named by the .def).

The .idata section

The .idata section records those things that the image imports. It consists of:

  • For every image from which symbols are imported:

    • The filename of the image. Used by the dynamic linker to locate it on disk.

    • The import lookup table: an array of length N, which each entry is either an ordinal or a pointer to a string representing the name to import.

    • The import address table: an array of N pointers. The dynamic linker is responsible for filling out this array with the address of the function/data named by the corresponding symbol in the import lookup table.

The ways in which .idata entries are created are as follows:

  1. Most commonly, they originate in a library of object files called an import library'. This import library can be created by usingdlltool` on the DLL you wish to export or a .def file of the type we discussed earlier. Just like the export library, the import library must be named by the user on the link line.

  2. Alternatively, some linkers (like GNU ld) let you specify a DLL directly on the link line. The linker will automatically generate .idata entries for any symbols that you must import from the DLL.

Notice that unlike the case when we were exporting symbols, __declspec(dllimport) does not cause .idata sections to be generated.

Import libraries are a bit more complicated than they first appear. The Windows dynamic loader fills the import address table with the addresses of the imported symbols (say, the address of a function Func). However, when the assembly code in other object files says call Func they expect that Func to name the address of that code. But we don't know that address until runtime: the only thing we know statically is the address where that address will be placed by the dynamic linker. We will call this address __imp__Func.

To deal with this extra level of indirection, the import library exports a function Func that just dereferences __imp__Func (to get the actual function pointer) and then jmps to it. All of the other object files in the project can now say call Func just as they would if Func had been defined in some other object file, rather than a DLL. For this reason, saying __declspec(dllimport) in the declaration of a dynamically linked function is optional (though in fact you will get slightly more efficient code if you add them, as we will see later).

Unfortunately, there is no equivalent trick if you want to import data from another DLL. If we have some imported data myData, there is no way the import library can be defined so that a mov $eax, myData in an object file linked against it writes to the storage for myData in that DLL. Instead, the import library defines a symbol __imp__myData that resolves to the address at which the linked-in address of the storage can be found. The compiler then ensures that when you read or write from a variable defined with __declspec(dllimport) those reads and writes go through the __imp_myData indirection. Because different code needs to be generated at the use site, __declspec declarations on data imports are not optional.

Practical example

Theory is all very well but it can be helpful to see all the pieces in play.

Building a DLL

First, lets build a simple DLL exporting both functions and data. For maximum clarity, we'll use an explicit export library rather instead of decorating our functions with declspec(dllexport) or supply a .def file to the linker.

First lets write the .def file, library.def:

LIBRARY library
EXPORTS
   function_export
   data_export      DATA

(The DATA keyword and LIBRARY line only affects how the import library is generated, as explained later on. Ignore them for now.)

Build an export file from that:

$ dlltool --output-exp library_exports.o -d library.def

The resulting object basically just contains an .edata section that exports the symbols _data_export and _function_export under the names data_export and function_export respectively:

$ objdump -xs library_exports.o

...

There is an export table in .edata at 0x0

The Export Tables (interpreted .edata section contents)

Export Flags                    0
Time/Date stamp                 4e10e5c1
Major/Minor                     0/0
Name                            00000028 library_exports.o.dll
Ordinal Base                    1
Number in:
        Export Address Table            00000002
        [Name Pointer/Ordinal] Table    00000002
Table Addresses
        Export Address Table            00000040
        Name Pointer Table              00000048
        Ordinal Table                   00000050

Export Address Table -- Ordinal Base 1

[Ordinal/Name Pointer] Table
        [   0] data_export
        [   1] function_export

Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .text         00000000  00000000  00000000  00000000  2**2
                  ALLOC, LOAD, READONLY, CODE
  1 .data         00000000  00000000  00000000  00000000  2**2
                  ALLOC, LOAD, DATA
  2 .bss          00000000  00000000  00000000  00000000  2**2
                  ALLOC
  3 .edata        00000070  00000000  00000000  000000b4  2**2
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
SYMBOL TABLE:
[  0](sec -2)(fl 0x00)(ty   0)(scl 103) (nx 1) 0x00000000 fake
File
[  2](sec  4)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000028 name
[  3](sec  4)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000040 afuncs
[  4](sec  4)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000048 anames
[  5](sec  4)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000050 anords
[  6](sec  4)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000054 n1
[  7](sec  4)(fl 0x00)(ty   0)(scl   3) (nx 0) 0x00000060 n2
[  8](sec  1)(fl 0x00)(ty   0)(scl   3) (nx 1) 0x00000000 .text
AUX scnlen 0x0 nreloc 0 nlnno 0
[ 10](sec  2)(fl 0x00)(ty   0)(scl   3) (nx 1) 0x00000000 .data
AUX scnlen 0x0 nreloc 0 nlnno 0
[ 12](sec  3)(fl 0x00)(ty   0)(scl   3) (nx 1) 0x00000000 .bss
AUX scnlen 0x0 nreloc 0 nlnno 0
[ 14](sec  4)(fl 0x00)(ty   0)(scl   3) (nx 1) 0x00000000 .edata
AUX scnlen 0x70 nreloc 8 nlnno 0
[ 16](sec  0)(fl 0x00)(ty   0)(scl   2) (nx 0) 0x00000000 _data_export
[ 17](sec  0)(fl 0x00)(ty   0)(scl   2) (nx 0) 0x00000000 _function_export


RELOCATION RECORDS FOR [.edata]:
OFFSET   TYPE              VALUE
0000000c rva32             .edata
0000001c rva32             .edata
00000020 rva32             .edata
00000024 rva32             .edata
00000040 rva32             _data_export
00000044 rva32             _function_export
00000048 rva32             .edata
0000004c rva32             .edata


Contents of section .edata:
 0000 00000000 c1e5104e 00000000 28000000  .......N....(...
 0010 01000000 02000000 02000000 40000000  ............@...
 0020 48000000 50000000 6c696272 6172795f  H...P...library_
 0030 6578706f 7274732e 6f2e646c 6c000000  exports.o.dll...
 0040 00000000 00000000 54000000 60000000  ........T...`...
 0050 00000100 64617461 5f657870 6f727400  ....data_export.
 0060 66756e63 74696f6e 5f657870 6f727400  function_export.

We'll fulfil these symbol with a trivial implementation of the DLL, library.c:

int data_export = 42;

int function_export() {
    return 1337 + data_export;
}

We can put it together into a DLL:

$ gcc -shared -o library.dll library.c library_exports.o

The export table for the DLL is as follows, showing that we have exported what we wanted:

The Export Tables (interpreted .edata section contents)

Export Flags                    0
Time/Date stamp                 4e10e5c1
Major/Minor                     0/0
Name                            00005028 library_exports.o.dll
Ordinal Base                    1
Number in:
        Export Address Table            00000002
        [Name Pointer/Ordinal] Table    00000002
Table Addresses
        Export Address Table            00005040
        Name Pointer Table              00005048
        Ordinal Table                   00005050

Export Address Table -- Ordinal Base 1
        [   0] +base[   1] 200c Export RVA
        [   1] +base[   2] 10f0 Export RVA

[Ordinal/Name Pointer] Table
        [   0] data_export
        [   1] function_export

Using the DLL

When we come to look at using the DLL, things become a lot more interesting. First, we need an import library:

$ dlltool --output-lib library.dll.a -d library.def

(The reason that we have an import library but an export object is because using a library for the imports allows the linker to discard .idata for any imports that are not used. Contrariwise ,he linker can never discard any .edata entry because any export may potentially be used by a user of the DLL).

This import library is rather complex. It contains one object for each export (disds00000.o and disds00001.o) but also two other object files (distdt.o and disdh.o) that set up the header and footer of the import list. (The header of the import list contains, among other things, the name of the DLL to link in at runtime, as derived from the LIBRARY line of the .def file.)


$ objdump -xs library.dll.a In archive library.dll.a: disdt.o: file format pe-i386 ... Sections: Idx Name Size VMA LMA File off Algn 0 .text 00000000 00000000 00000000 00000000 2**2 ALLOC, LOAD, READONLY, CODE 1 .data 00000000 00000000 00000000 00000000 2**2 ALLOC, LOAD, DATA 2 .bss 00000000 00000000 00000000 00000000 2**2 ALLOC 3 .idata$4 00000004 00000000 00000000 00000104 2**2 CONTENTS, ALLOC, LOAD, DATA 4 .idata$5 00000004 00000000 00000000 00000108 2**2 CONTENTS, ALLOC, LOAD, DATA 5 .idata$7 0000000c 00000000 00000000 0000010c 2**2 CONTENTS, ALLOC, LOAD, DATA SYMBOL TABLE: [ 0](sec -2)(fl 0x00)(ty 0)(scl 103) (nx 1) 0x00000000 fake File [ 2](sec 1)(fl 0x00)(ty 0)(scl 3) (nx 1) 0x00000000 .text AUX scnlen 0x0 nreloc 0 nlnno 0 [ 4](sec 2)(fl 0x00)(ty 0)(scl 3) (nx 1) 0x00000000 .data AUX scnlen 0x0 nreloc 0 nlnno 0 [ 6](sec 3)(fl 0x00)(ty 0)(scl 3) (nx 1) 0x00000000 .bss AUX scnlen 0x0 nreloc 0 nlnno 0 [ 8](sec 4)(fl 0x00)(ty 0)(scl 3) (nx 1) 0x00000000 .idata$4 AUX scnlen 0x4 nreloc 0 nlnno 0 [ 10](sec 5)(fl 0x00)(ty 0)(scl 3) (nx 1) 0x00000000 .idata$5 AUX scnlen 0x4 nreloc 0 nlnno 0 [ 12](sec 6)(fl 0x00)(ty 0)(scl 3) (nx 1) 0x00000000 .idata$7 AUX scnlen 0x7 nreloc 0 nlnno 0 [ 14](sec 6)(fl 0x00)(ty 0)(scl 2) (nx 0) 0x00000000 __library_dll_a_iname Contents of section .idata$4: 0000 00000000 .... Contents of section .idata$5: 0000 00000000 .... Contents of section .idata$7: 0000 6c696272 6172792e 646c6c00 library.dll. disdh.o: file format pe-i386 ... Sections: Idx Name Size VMA LMA File off Algn 0 .text 00000000 00000000 00000000 00000000 2**2 ALLOC, LOAD, READONLY, CODE 1 .data 00000000 00000000 00000000 00000000 2**2 ALLOC, LOAD, DATA 2 .bss 00000000 00000000 00000000 00000000 2**2 ALLOC 3 .idata$2 00000014 00000000 00000000 00000104 2**2 CONTENTS, ALLOC, LOAD, RELOC, DATA 4 .idata$5 00000000 00000000 00000000 00000000 2**2 ALLOC, LOAD, DATA 5 .idata$4 00000000 00000000 00000000 00000000 2**2 ALLOC, LOAD, DATA SYMBOL TABLE: [ 0](sec -2)(fl 0x00)(ty 0)(scl 103) (nx 1) 0x00000000 fake File [ 2](sec 6)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 hname [ 3](sec 5)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 fthunk [ 4](sec 1)(fl 0x00)(ty 0)(scl 3) (nx 1) 0x00000000 .text AUX scnlen 0x0 nreloc 0 nlnno 0 [ 6](sec 2)(fl 0x00)(ty 0)(scl 3) (nx 1) 0x00000000 .data AUX scnlen 0x0 nreloc 0 nlnno 0 [ 8](sec 3)(fl 0x00)(ty 0)(scl 3) (nx 1) 0x00000000 .bss AUX scnlen 0x0 nreloc 0 nlnno 0 [ 10](sec 4)(fl 0x00)(ty 0)(scl 3) (nx 1) 0x00000000 .idata$2 AUX scnlen 0x14 nreloc 3 nlnno 0 [ 12](sec 6)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .idata$4 [ 13](sec 5)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .idata$5 [ 14](sec 4)(fl 0x00)(ty 0)(scl 2) (nx 0) 0x00000000 __head_library_dll_a [ 15](sec 0)(fl 0x00)(ty 0)(scl 2) (nx 0) 0x00000000 __library_dll_a_iname RELOCATION RECORDS FOR [.idata$2]: OFFSET TYPE VALUE 00000000 rva32 .idata$4 0000000c rva32 __library_dll_a_iname 00000010 rva32 .idata$5 Contents of section .idata$2: 0000 00000000 00000000 00000000 00000000 ................ 0010 00000000 .... disds00001.o: file format pe-i386 ... Sections: Idx Name Size VMA LMA File off Algn 0 .text 00000008 00000000 00000000 0000012c 2**2 CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE 1 .data 00000000 00000000 00000000 00000000 2**2 ALLOC, LOAD, DATA 2 .bss 00000000 00000000 00000000 00000000 2**2 ALLOC 3 .idata$7 00000004 00000000 00000000 00000134 2**2 CONTENTS, RELOC 4 .idata$5 00000004 00000000 00000000 00000138 2**2 CONTENTS, RELOC 5 .idata$4 00000004 00000000 00000000 0000013c 2**2 CONTENTS, RELOC 6 .idata$6 00000012 00000000 00000000 00000140 2**1 CONTENTS SYMBOL TABLE: [ 0](sec 1)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .text [ 1](sec 2)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .data [ 2](sec 3)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .bss [ 3](sec 4)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .idata$7 [ 4](sec 5)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .idata$5 [ 5](sec 6)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .idata$4 [ 6](sec 7)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .idata$6 [ 7](sec 1)(fl 0x00)(ty 0)(scl 2) (nx 0) 0x00000000 _function_export [ 8](sec 5)(fl 0x00)(ty 0)(scl 2) (nx 0) 0x00000000 __imp__function_export [ 9](sec 0)(fl 0x00)(ty 0)(scl 2) (nx 0) 0x00000000 __head_library_dll_a RELOCATION RECORDS FOR [.text]: OFFSET TYPE VALUE 00000002 dir32 .idata$5 RELOCATION RECORDS FOR [.idata$7]: OFFSET TYPE VALUE 00000000 rva32 __head_library_dll_a RELOCATION RECORDS FOR [.idata$5]: OFFSET TYPE VALUE 00000000 rva32 .idata$6 RELOCATION RECORDS FOR [.idata$4]: OFFSET TYPE VALUE 00000000 rva32 .idata$6 Contents of section .text: 0000 ff250000 00009090 .%...... Contents of section .idata$7: 0000 00000000 .... Contents of section .idata$5: 0000 00000000 .... Contents of section .idata$4: 0000 00000000 .... Contents of section .idata$6: 0000 01006675 6e637469 6f6e5f65 78706f72 ..function_expor 0010 7400 t. disds00000.o: file format pe-i386 ... Sections: Idx Name Size VMA LMA File off Algn 0 .text 00000000 00000000 00000000 00000000 2**2 ALLOC, LOAD, READONLY, CODE 1 .data 00000000 00000000 00000000 00000000 2**2 ALLOC, LOAD, DATA 2 .bss 00000000 00000000 00000000 00000000 2**2 ALLOC 3 .idata$7 00000004 00000000 00000000 0000012c 2**2 CONTENTS, RELOC 4 .idata$5 00000004 00000000 00000000 00000130 2**2 CONTENTS, RELOC 5 .idata$4 00000004 00000000 00000000 00000134 2**2 CONTENTS, RELOC 6 .idata$6 0000000e 00000000 00000000 00000138 2**1 CONTENTS SYMBOL TABLE: [ 0](sec 1)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .text [ 1](sec 2)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .data [ 2](sec 3)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .bss [ 3](sec 4)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .idata$7 [ 4](sec 5)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .idata$5 [ 5](sec 6)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .idata$4 [ 6](sec 7)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000 .idata$6 [ 7](sec 5)(fl 0x00)(ty 0)(scl 2) (nx 0) 0x00000000 __imp__data_export [ 8](sec 0)(fl 0x00)(ty 0)(scl 2) (nx 0) 0x00000000 __head_library_dll_a RELOCATION RECORDS FOR [.idata$7]: OFFSET TYPE VALUE 00000000 rva32 __head_library_dll_a RELOCATION RECORDS FOR [.idata$5]: OFFSET TYPE VALUE 00000000 rva32 .idata$6 RELOCATION RECORDS FOR [.idata$4]: OFFSET TYPE VALUE 00000000 rva32 .idata$6 Contents of section .idata$7: 0000 00000000 .... Contents of section .idata$5: 0000 00000000 .... Contents of section .idata$4: 0000 00000000 .... Contents of section .idata$6: 0000 00006461 74615f65 78706f72 7400 ..data_export.

Note that the object corresponding to data_export has an empty .text section, whereas function_export does define some code. If we disassemble it we get this:

00000000 <_function_export>:
   0:   ff 25 00 00 00 00       jmp    *0x0
                        2: dir32        .idata$5
   6:   90                      nop
   7:   90                      nop

The relocation of type dir32 tells the linker how to fill in the address being dereferenced by the jmp. We can see that _function_export, when entered, will jump directly to the function at the address loaded from the memory named .idata$5. Inspection of the complete .idata section satisfies us that .idata$5 corresponds to the address of the fragment of the import address table corresponding to the function_export import name, and hence the address where the absolute address of the loaded function_export import can be found.

Although only function_export gets a corresponding _function_export function, both of the exports have lead to a symbol with the __imp__ prefix (__imp__data_export and __imp__function_export) being defined in the import library. As discussed before, this symbol stands for the address at which the pointer to the data/function will be inserted by the dynamic linker. As such, the __imp__ symbols always point directly into the import address table.

With an import library in hand, we are capable of writing some client code that uses our exports, main1.c:

#include <stdio.h>

__declspec(dllimport) extern int function_export(void);
__declspec(dllimport) extern int data_export;

int main(int argc, char **argv) {
    printf("%d\n", function_export());
    printf("%d\n", data_export);

    data_export++;

    printf("%d\n", function_export());
    printf("%d\n", data_export);

    return 0;
}

Build and link it against the import library and we will get the results we expect:

$ gcc main1.c library.dll.a -o main1 && ./main1
1379
42
1380
43

The reason that this works even though there is no data_export symbol defined by library.dll.a is because the __declspec(dllimport) qualifier on our data_export declaration in main.c has caused the compiled to generate code that uses the __imp_data_export symbol directly, as we can see if we disassemble the generated code:

$ gcc -c main1.c -o main1.o && objdump --disassemble -r main1.o

main1.o:     file format pe-i386


Disassembly of section .text:

00000000 <_main>:
   0:   8d 4c 24 04             lea    0x4(%esp),%ecx
   4:   83 e4 f0                and    $0xfffffff0,%esp
   7:   ff 71 fc                pushl  -0x4(%ecx)
   a:   55                      push   %ebp
   b:   89 e5                   mov    %esp,%ebp
   d:   51                      push   %ecx
   e:   83 ec 14                sub    $0x14,%esp
  11:   e8 00 00 00 00          call   16 <_main+0x16>
                        12: DISP32      ___main
  16:   a1 00 00 00 00          mov    0x0,%eax
                        17: dir32       __imp__function_export
  1b:   ff d0                   call   *%eax
  1d:   89 44 24 04             mov    %eax,0x4(%esp)
  21:   c7 04 24 00 00 00 00    movl   $0x0,(%esp)
                        24: dir32       .rdata
  28:   e8 00 00 00 00          call   2d <_main+0x2d>
                        29: DISP32      _printf
  2d:   a1 00 00 00 00          mov    0x0,%eax
                        2e: dir32       __imp__data_export
  32:   8b 00                   mov    (%eax),%eax
  34:   89 44 24 04             mov    %eax,0x4(%esp)
  38:   c7 04 24 00 00 00 00    movl   $0x0,(%esp)
                        3b: dir32       .rdata
  3f:   e8 00 00 00 00          call   44 <_main+0x44>
                        40: DISP32      _printf
  44:   a1 00 00 00 00          mov    0x0,%eax
                        45: dir32       __imp__data_export
  49:   8b 00                   mov    (%eax),%eax
  4b:   8d 50 01                lea    0x1(%eax),%edx
  4e:   a1 00 00 00 00          mov    0x0,%eax
                        4f: dir32       __imp__data_export
  53:   89 10                   mov    %edx,(%eax)
  55:   a1 00 00 00 00          mov    0x0,%eax
                        56: dir32       __imp__function_export
  5a:   ff d0                   call   *%eax
  5c:   89 44 24 04             mov    %eax,0x4(%esp)
  60:   c7 04 24 00 00 00 00    movl   $0x0,(%esp)
                        63: dir32       .rdata
  67:   e8 00 00 00 00          call   6c <_main+0x6c>
                        68: DISP32      _printf
  6c:   a1 00 00 00 00          mov    0x0,%eax
                        6d: dir32       __imp__data_export
  71:   8b 00                   mov    (%eax),%eax
  73:   89 44 24 04             mov    %eax,0x4(%esp)
  77:   c7 04 24 00 00 00 00    movl   $0x0,(%esp)
                        7a: dir32       .rdata
  7e:   e8 00 00 00 00          call   83 <_main+0x83>
                        7f: DISP32      _printf
  83:   b8 00 00 00 00          mov    $0x0,%eax
  88:   83 c4 14                add    $0x14,%esp
  8b:   59                      pop    %ecx
  8c:   5d                      pop    %ebp
  8d:   8d 61 fc                lea    -0x4(%ecx),%esp
  90:   c3                      ret
  91:   90                      nop
  92:   90                      nop
  93:   90                      nop

In fact, we can see that the generated code doesn't even use the _function_export symbol, preferring __imp__function_export. Essentially, the code of the _function_export symbol in the import library has been inlined at every use site. This is why using __declspec(dllimport) can improve performance of cross-DLL calls, even though it is entirely optional on function declarations.

We might wonder what happens if we drop the __declspec(dllimport) qualifier on our declarations. Because of our discussion about the difference between data and function imports earlier, you might expect linking to fail. Our test file, main2.c is:

#include <stdio.h>

extern int function_export(void);
extern int data_export;

int main(int argc, char **argv) {
    printf("%d\n", function_export());
    printf("%d\n", data_export);

    data_export++;

    printf("%d\n", function_export());
    printf("%d\n", data_export);

    return 0;
}

Let's try it out:

$ gcc main2.c library.dll.a -o main2 && ./main2
1379
42
1380
43

What the hell -- it worked? This is a bit uprising. The reason that it works despite the fact that the import library library.dll.a not defining the _data_export symbol is because of a nifty feature of GNU ld called auto-import. Without auto-import the link fails as we would expect:

$ gcc main2.c library.dll.a -o main2 -Wl,--disable-auto-import && ./main2
/tmp/ccGd8Urx.o:main2.c:(.text+0x2c): undefined reference to `_data_export'
/tmp/ccGd8Urx.o:main2.c:(.text+0x41): undefined reference to `_data_export'
/tmp/ccGd8Urx.o:main2.c:(.text+0x49): undefined reference to `_data_export'
/tmp/ccGd8Urx.o:main2.c:(.text+0x63): undefined reference to `_data_export'
collect2: ld returned 1 exit status

The Microsoft linker does not implement auto-import, so this is the error you would get if you were using the Microsoft toolchain.

However, there is a way to write client code that does not depend on auto-import or use the __declspec(dllimport) keyword. Our new client, main3.c is as follows:

#include <stdio.h>

extern int (*_imp__function_export)(void);
extern int *_imp__data_export;

#define function_export (*_imp__function_export)
#define data_export (*_imp__data_export)

int main(int argc, char **argv) {
    printf("%d\n", function_export());
    printf("%d\n", data_export);

    data_export++;

    printf("%d\n", function_export());
    printf("%d\n", data_export);

    return 0;
}

In this code, we directly use the __imp__-prefixed symbols from the import library. These name an address at which the real address of the import can be found, which is reflected by our C-preprocessor definitions of data_export and function_export.

This code compiles perfectly even without auto-import:

$ gcc main3.c library.dll.a -o main3 -Wl,--disable-auto-import && ./main3
1379
42
1380
43

If you have followed along until this point you should have a solid understanding of how DLL import and export are implemented on Windows.

How auto-import works

As a bonus, I'm going to explain how auto-import is implemented by the GNU linker. It is a rather cute hack you may get a kick out of.

As a reminder, auto-import is a feature of the linker that allows the programmer to declare an item of DLL-imported data with a simple extern keyword, without having to explicitly use __declspec(dllimport). This is extremely convenient because this is exactly how most nix source code declares symbols it expects to import from a shared library, so by supporting this use case thatnix code becomes more portable to Windows.

Auto-import kicks in whenever the linker finds an object file making use of a symbol foo which is not defined by any other object in the link, but where a symbol __imp_foo is defined by some object. In this case, it assumes that the use of foo is an attempt to access some DLL-imported data item called foo.

Now, the problem is that the linker needs to replace the use of foo with the address of foo itself. However, all we seem to know statically is an address where that address will be placed at runtime (__imp_foo). To square the circle, the linker plays a clever trick.

The trick is to extend the .idata of the image being created with an entry for a "new" DLL. The new entry is set up as follows:

  • The filename of the image being imported is set to the same filename as the .idata entry covering __imp_foo. So if __imp_foo was being filled out by an address in Bar.dll, our new .idata entry will use Bar.dll here.

  • The import lookup table is of length 1, whose sole entry is a pointer to the name of the imported symbol corresponding to __imp_foo. So if __imp_foo is filled out by the address of the foo export from Bar.dll, the name of the symbol we put in here will be foo.

  • The import address table is of length 1 -- and here is the clever bit -- is located precisely at the location in the object file that was referring to the (undefined) symbol foo.

This solution neatly defers the task of filling out the address that the object file wants to the dynamic linker. The reason that the linker can play this trick is that it can see all of the object code that goes into the final image, and can thus fix all of the sites that need to refer to the imported data.

Note that in general the final image's .idata will contain several entries for the same DLL: one from the import library, and one for every place in any object file in the link which referred to some data exported by the DLL. Although this is somewhat unusual behaviour, the Windows linker has no problem with there being several imports of the same DLL.

A wrinkle

Unfortunately, the scheme described above only works if the object code has an undefined reference to foo itself. What if instead it has a reference to foo+N, an address N bytes after the address of foo itself? There is no way to set up the .idata so that the dynamic linker adds a constant to the address it fills in, so we seem to be stuck.

Alas, such relocations are reasonably common, and originate from code that accesses a field of a DLL-imported structure type. Cygwin actually contains another hack to make auto-import work in such cases, known as "pseudo-relocations". If you want to know the details of how these works, there is more information in the original thread on the topic.

Conclusion

Dynamic linking on Windows is hairier than it at first appears. I hope this article has gone some way to clearing up the meaning of the mysterious dllimport and dllexport keywords, and at clarifying the role of the import and export libraries.

Linux and friends implement dynamic linking in a totally different manner to Windows. The scheme they use is more flexible and allows more in-memory sharing of code, but incurs a significant runtime penalty (especially on i386). For more details see here and the Dynamic Linking section of the the ELF spec.


May 14 2007

Uninformed Journal Volume 7 Released

For those of you who are unaware, the lastest edition of the very entertaining Uninformed Journal came out recently. It's a pretty interesting edition, though not as bulky as some of their previous releases: I hope the project isn't losing momentum. In particular I found that the article by skape (one half of the duo who originally broke Windows Patchguard [pdf]) about dynamic analysis of memory access in arbitrary software was nice: he outlines a variety of tricks for doing this, using page tables, segment registers and runtime instrumentation, and some research areas where this would be useful, like taint propagation. There's probably a product or two waiting in there for some enterprising company to create :-) .

For those of you who enjoy Uninformed, you can also read skywings (the other half of the areformentioned duo) blog Nynaeve for regular fixes of in-depth technical information.


Apr 27 2007

So, Does Anyone Even Use All These Darn CPU Instructions?

That's the question I found myself asking earlier this month when I was writing a simple compiler for an OCaml dialect called MinCaml. I don't know if you've ever taken a look at the Intel IA32 instruction references, but there are two volumes of detailed descriptions of all the functions one of these CPU provides: about 600 in total!

As this compiler backend was my first real go at writing out assembly (though I'd read quite a lot of it before) I found the task of potentially learning all these instructions extremely intimidating, and wished I had an assembly language "cheat sheet" where I could go to find out the most common (and hence probably the most useful) instructions. So, today I got around to writing a little program to use UDis86 to scan all the binaries in C:\Windows\System32 and my Cygwin /bin directory and spit out some instruction counts. Now, before we see the results let me first say that my sample has a bias in that it's all precompiled stuff that will be designed to take advantage of only the lowest common denominator of the target CPU instruction sets. This means that the distribution below will probably be quite different than if you scan, say, your machine-specific Gentoo installation. Right, onto the big monstrous Excel pie chart:

IA32 Instruction Incidence

I've also uploaded my raw data to my personal site, along with the program I used to get the results for anyone who is interested.

In the finest traditions of top 10 lists everywhere, let's give a quick rundown of the most popular instructions:

  1. add: is anyone really suprised by this?
  2. mov: again, it's fairly hard to get work done without moving data around
  3. push/pop: these instructions are commonly used for setting up/destroying function stack frames, hence their popularity despite IA32s register orientation
  4. invalid: do'h! These are from areas of code where udis86 just barfs :-) . Their high ranking could mean a lot of instructions are in use that it can't deal with, or (more likely) it is trying to disassemble executable regions of the binary that just don't have anything useful in them
  5. cmp: the absolute bedrock of branching on the IA32, but what is suprising is that it's so much higher than jz. Are there really that many pure numeric comparisions in programs?
  6. dec/inc: I'm fairly suprised that dec beats inc here, especially since sub is so much lower than add in my data: does anyone have a theory?
  7. xor/or: xor no doubt clocks in so high because of the "xor reg, reg" idiom used to zero out registers, but or is quite popular too for something which is really only good for bit flags and the odd if expression...

Well that was fun, in a geeky sort of way! I'd be quite interested in knowing the reasons behind some of the oddities above: maybe another blog entry is in order...?

Edit: an experienced reverse engineer, t_w, commented on Reddit that my count for add was quite high, and that it might be caused by counting junk code. It turns out that the byte sequence 0x0000 disassembles to an add instruction, which skews my results since the null byte is common padding for executable .text sections. However, after correcting for this (the charts above have taken this into account) it still leaves add in first place (it had 83355055 counted occurances before excluding this sequence, but still 45303805 after exclusion).