All posts by Max

Rpath emulation: absolute DLL references on Windows

When creating an executable or shared library on Linux, it’s possible to include an ELF RPATH header which tells the dynamic linker where to search for the any shared libraries that you reference. This is a pretty handy feature because it can be used to nail down exactly which shared library you will link against, without leaving anything up to chance at runtime.

Unfortunately, Windows does not have an equivalent feature. However, it does have an undocumented feature which may be enough to replace your use of rpath if you are porting software from Linux.

Executables or DLLs or Windows always reference any DLLs that they import by name only. So, the import table for an executable will refer to kernel32.dll rather than C:\Windows\kernel32.dll. Window’s dynamic loader will look for a file with the appropriate name in the DLL search path as usual. (For full details on DLL import tables and more, you can check out my previous in depth post.)

However, Window’s dynamic loader will, as a completely undocumented (and presumably unsupported) feature, also accept absolute paths in the import table. This is game-changing because it means that you can hard-code exactly which DLL you want to refer to, just like you would be able to with rpath on Linux.

Demonstration

To demonstrate this technique, we’re going to need code for a DLL and a referring EXE:

$ cat library.c
#include <stdio.h>

__declspec(dllexport) int librarycall(void) {
        printf("Made library call!\n");
        return 0;
}

$ cat rpath.c
__declspec(dllimport) int librarycall(void);

int main(int argc, char **argv) {
        return librarycall();
}

If we were building a DLL and EXE normally, we would do this:

<code>gcc -c library.c
gcc -shared -o library.dll library.o
gcc -o rpath rpath.c -L./ -llibrary</code>

This all works fine:

<code>$ ./rpath
Made library call!</code>

However, as you would expect, if you move library.dll elsewhere, the EXE will fail to start:

<code>$ mv library.dll C:/library.dll
$ ./rpath
/home/Max/rpath/rpath.exe: error while loading shared libraries: library.dll: cannot open shared object file: No such file or directory</code>

Now let’s work some magic! If we open up rpath.exe in a hex editor, we see something like this:

Let’s just tweak that a bit to change the relative path to library.dll to an absolute path. Luckily there is enough padding to make it fit:

The EXE will now work perfectly!

<code>$ ./rpath
Made library call!</code>

In practice

Knowing that this feature exists is one thing. Actually making use of it in a reliable way is another. The problem is that to my knowledge no linkers are capable of creating a DLL or EXE which include an absolute path in their import tables. Sometimes we will be lucky enough that the linker creates an EXE or DLL with enough padding in it for us to manually edit in an absolute path, but with the method above there is no guarantee that this will be possible.

In order to exploit this technique robustly, we’re going to use a little trick with import libraries. Instead of using GCC’s ability to link directly to a DLL, we will generate an import library for the DLL, which we will call library.lib:

<code>$ dlltool --output-lib library.lib --dllname veryverylongdllname.dll library.o</code>

When you use dlltool you either need to write a .def file for the DLL you are creating an import library for, or you need to supply all the object files that were used to create the DLL. I’ve taken the second route here and just told dlltool that the our DLL was built from library.o.

Now we have an import library, we can do our hex-editing trick again, but this time on the library. Before:

And after (note that I have null-terminated the new absolute path):

The beauty of editing the import library rather than the output of the linker is that using the --dllname option we can ensure that the import library contains as much space as we need to fit the entire absolute path of the DLL, no matter how long it may be. This is the key to making robust use of absolute paths in DLL loading, even if linkers don’t support them!

Now we have the import library, we can link rpath.exe again, but this time using the import library rather than library.dll:

<code>$ gcc -o rpath rpath.c library.lib
$ ./rpath
Made library call!</code>

Yes, it really is using the DLL on the C: drive:

<code>$ mv C:/library.dll C:/foo.dll
$ ./rpath
/home/Max/rpath/rpath.exe: error while loading shared libraries: C:\library.dll: cannot open shared object file: No such file or directory</code>

Conclusion

I haven’t seen this technique for using absolute paths for DLL references anywhere on the web, so it doesn’t seem to be widely known. However, it works beautifully on Windows 7 and probably on all other versions of Windows as well.

I may apply these techniques to the Glasgow Haskell Compiler in order to improve the support for Haskell shared objects on Windows: more information on this topic can be found on the GHC wiki.

GHC-specific Alias Analysis for LLVM

The setup

A few years ago, David Terei did some great work adding a LLVM backend to the Glasgow Haskell Compiler. The idea with this is that instead of writing our own optimiser and assembly-code generators for our custom three-address-code, we can just translate into LLVM IR and have LLVM do the heavy lifting. In theory, this means that GHC will be able to compile for many different CPUs, and will benefit from the smart optimisations the LLVM team have implemented.

The portability part has definitely worked out for us: for example, a couple of people have successfully got GHC to compile for the ARM by using the LLVM backend. However, the promise of LLVM being able to speed up our generated code has never really been fully borne out. LLVM-generated code does tend to be better than that produced by GHCs own backends, but this is mostly because LLVM is doing much better register allocation (it is much smarter about reusing the “pinned registers” required that form part of the interface between GHC’s generated code and the garbage collector).

The reason that LLVM does not optimise as much as we would like is often to do with aliasing. In particular, LLVM conservatively assumes that GHC’s stack (which is explicitly represented in the generated code as an array of words) and the heap may alias.

What’s the problem?

A concrete example of this is the following Haskell program:

module Main(main) where

import Data.Array.Base
import Data.Array.IO
import Data.Array.MArray

main :: IO ()
main = do
    arr <- newArray_ (0, 200)
    go arr 2 0 100

go :: IOUArray Int Int -> Int -> Int -> Int -> IO ()
go arr stride x y | x < y     = do unsafeWrite arr (x * stride) 1337
                                   go arr stride (x + 1) y
                  | otherwise = return ()

This loop compiles to fairly good Core:

Main.main_$s$wa =
  \ (@ sg0_sKA::Data.Array.Base.STUArray
                  GHC.Prim.RealWorld GHC.Types.Int GHC.Types.Int
                  ~
                Data.Array.IO.Internals.IOUArray GHC.Types.Int GHC.Types.Int)
    (sc_sKs :: GHC.Prim.State# GHC.Prim.RealWorld)
    (sc1_sKt :: GHC.Prim.Int#)
    (sc2_sKu :: GHC.Prim.Int#)
    (sc3_sKv :: GHC.Prim.Int#)
    (sc4_sKw :: GHC.Types.Int)
    (sc5_sKx :: GHC.Types.Int)
    (sc6_sKy :: GHC.Types.Int)
    (sc7_sKz :: GHC.Prim.MutableByteArray# GHC.Prim.RealWorld) ->
    case GHC.Prim.<# sc2_sKu sc1_sKt of _ {
      GHC.Bool.False -> (# sc_sKs, GHC.Unit.() #);
      GHC.Bool.True ->
        case GHC.Prim.writeIntArray#
               @ GHC.Prim.RealWorld
               sc7_sKz
               (GHC.Prim.*# sc2_sKu sc3_sKv)
               1337
               sc_sKs
        of s2#_aHo { __DEFAULT ->
        Main.main_$s$wa
          @ (sym
               Data.Array.IO.Internals.NTCo:IOUArray GHC.Types.Int GHC.Types.Int)
          s2#_aHo
          sc1_sKt
          (GHC.Prim.+# sc2_sKu 1)
          sc3_sKv
          sc4_sKw
          sc5_sKx
          sc6_sKy
          sc7_sKz
        }
    }

One weird thing about this Core is that it passes around a number of dead arguments (sc4_sKw, sc5_sKx and sc6_sKy). This is a known bug in GHC, and is caused by a phase ordering problem. However, this particular infelicity should not prevent LLVM from being able to do the classic loop optimisation of strength reduction on our code.

The particular strength reduction we would like to perform si to replace the multiplication GHC.Prim.*# sc2_sKu sc3_sKv in the main_$s$wa loop with an addition. This is possible because the left operand sc2_sKu is a loop induction variable, increasing by 1 every iteration. Thus, on every iteration the value of the multiplication GHC.Prim.*# sc2_sKu sc3_sKv is just the value of the multiplication on the previous loop, plus sc3_sKv. Thus, by adding a loop variable that records the value of the multiplication on the previous iteration, we can replace the multiplication by an addition.

Unfortunately, LLVM currently can’t strength-reduce this loop in the suggested way. As we will soon see, this is due to aliasing.

Why does the problem happen?

We can immediately see the problem if we look at the optimised LLVM code for this loop:

c1TW.lr.ph:
  ...
  %ln1TL1 = load i64* %Sp_Arg, align 8
  ...

c1TW:                                             ; preds = %c1TW.lr.ph, %c1TW
  %ln1TL4 = phi i64 [ %ln1TL1, %c1TW.lr.ph ], [ %ln1UF, %c1TW ]
  %ln1Uy = mul i64 %ln1Uu, %ln1TL4
  %ln1Uz = add i64 %ln1Uw, %ln1Uy
  %ln1UA = inttoptr i64 %ln1Uz to i64*
  store i64 1337, i64* %ln1UA, align 8
  %ln1UE = load i64* %Sp_Arg, align 8
  %ln1UF = add i64 %ln1UE, 1
  store i64 %ln1UF, i64* %Sp_Arg, align 8
  %ln1TP = load i64* %ln1TN, align 8
  %ln1TQ = icmp slt i64 %ln1UF, %ln1TP
  br i1 %ln1TQ, label %c1TW, label %n1TX.loopexit

The strength reduction optimisation depends on one of the operands to the multiplication being a loop induction variable. In our case, we expect that sc2_sKu will be such a variable. However, looking at the LLVM code we can see that the equivalent LLVM variable, %ln1TL4, has its induction-ness hidden because it is reloaded from the stack by load i64* %Sp_Arg on every iteration.

You might wonder why the store to the same stack location by store i64 %ln1UF, i64* %Sp_Arg is not forwarded to this load by LLVM. If this were to happen, we could get code like this:

c1TW.lr.ph:
  ...
  %ln1TL1 = load i64* %Sp_Arg, align 8
  %ln1UE.ph = load i64* %Sp_Arg, align 8
  ...

c1TW:                                             ; preds = %c1TW.lr.ph, %c1TW
  %ln1TL4 = phi i64 [ %ln1TL1, %c1TW.lr.ph ], [ %ln1UF, %c1TW ]
  %ln1UE = phi i64 [ %ln1UE.ph, %c1TW.lr.ph ], [ %ln1UF, %c1TW ]
  %ln1Uy = mul i64 %ln1Uu, %ln1TL4
  %ln1Uz = add i64 %ln1Uw, %ln1Uy
  %ln1UA = inttoptr i64 %ln1Uz to i64*
  store i64 1337, i64* %ln1UA, align 8
  %ln1UF = add i64 %ln1UE, 1
  store i64 %ln1UF, i64* %Sp_Arg, align 8
  %ln1TP = load i64* %ln1TN, align 8
  %ln1TQ = icmp slt i64 %ln1UF, %ln1TP
  br i1 %ln1TQ, label %c1TW, label %n1TX.loopexit

In this code the fact that %ln1UE is an induction variable is obvious, and not obscured by an intermediate load from memory. And indeed, LLVM is able to strength-reduce this loop!

The reason that LLVM does not forward this load is because in general it is unsafe, since the store to %ln1UA might alias it if %ln1UA were equal to %Sp_Arg. The ridiculous thing about this is that we know that in the code generated by GHC, the stack pointer will never be stored away anywhere, so it can’t possible alias with the unknown pointer %ln1UA and LLVM is being unnecessarily conservative.

The solution

LLVM is a beautiful bit of software, and it provides exactly the extensibility point we require to resolve this problem: we can write our own alias analysis pass that knows that GHC’s stack never alias with any another non-stack pointer and dynamically load it into the LLVM optimisation tool chain.

This is exactly what I’ve done. The code is available as a Gist, and interested parties (who use OS X!) can build it like so:

<code>g++ -D__STDC_LIMIT_MACROS -D__STDC_CONSTANT_MACROS -fno-exceptions -fno-rtti -fno-common -Wall \
-Wl,-flat_namespace -dynamiclib GHCAliasAnalysis.cpp -o GHCAliasAnalysis.dylib -lLLVM-`llvm-config --version`
</code>

Once built, we can dynamically load the resulting dylib into LLVMs opt tool using the -load option, and then use the new -ghc-aa flag to tell LLVM to use our alias analyser as a complement to the default one. Unfortunately, due to an infelicity in LLVM, we have to specify -ghc-aa in between every single optimisation pass if we want to be sure that it is used. So the final command line to opt, including all passes done by the standard -O2 optimisation level, and the -loop-reduce strength-reduction pass, needs to look something like this:

<code>opt -load GHCAliasAnalysis.dylib -S -no-aa -tbaa -basicaa -ghc-aa \
-globalopt -ghc-aa -ghc-aa -ipsccp -ghc-aa -deadargelim -ghc-aa -instcombine -ghc-aa -simplifycfg \
-ghc-aa -basiccg -ghc-aa -prune-eh -ghc-aa -inline -ghc-aa -functionattrs -ghc-aa -scalarrepl-ssa \
-ghc-aa -domtree -ghc-aa -early-cse -ghc-aa -simplify-libcalls -ghc-aa -lazy-value-info -ghc-aa \
-jump-threading -ghc-aa -correlated-propagation -ghc-aa -simplifycfg -ghc-aa -instcombine -ghc-aa \
-tailcallelim -ghc-aa -simplifycfg -ghc-aa -reassociate -ghc-aa -domtree -ghc-aa -loops -ghc-aa \
-loop-simplify -ghc-aa -lcssa -ghc-aa -loop-rotate -ghc-aa -licm -ghc-aa -lcssa -ghc-aa -loop-unswitch \
-ghc-aa -instcombine -ghc-aa -scalar-evolution -ghc-aa -loop-simplify -ghc-aa -lcssa -ghc-aa -indvars \
-ghc-aa -loop-idiom -ghc-aa -loop-deletion -ghc-aa -loop-unroll -ghc-aa -memdep -ghc-aa -gvn -ghc-aa \
-memdep -ghc-aa -memcpyopt -ghc-aa -sccp -ghc-aa -instcombine -ghc-aa -lazy-value-info -ghc-aa \
-jump-threading -ghc-aa -correlated-propagation -ghc-aa -domtree -ghc-aa -memdep -ghc-aa -dse \
-ghc-aa -adce -ghc-aa -simplifycfg -ghc-aa -instcombine -ghc-aa -strip-dead-prototypes -ghc-aa \
-constmerge -loop-reduce
</code>

(Yes, I know this is ridiculous! I hope the LLVM developers fix this soon.)

With my new alias analysis pass, LLVM is able to produce the following beautiful code for the loop:

<code>c1TW:                                             ; preds = %c1TW, %c1TW.lr.ph
  %lsr.iv = phi i64 [ %lsr.iv.next, %c1TW ], [ %5, %c1TW.lr.ph ]
  %ln1UF1 = phi i64 [ %ln1TL1, %c1TW.lr.ph ], [ %ln1UF, %c1TW ]
  %ln1UA = inttoptr i64 %lsr.iv to i64*
  store i64 1337, i64* %ln1UA, align 8
  %ln1UF = add i64 %ln1UF1, 1
  %lsr.iv.next = add i64 %lsr.iv, %6
  %ln1TQ = icmp slt i64 %ln1UF, %ln1TP2
  br i1 %ln1TQ, label %c1TW, label %n1TX.loopexit
</code>

Note that the original loop contained a store and two loads, but the optimised loop contains only a single store: our new alias analysis has allowed the loads to be floated out of the loop. This has in turn allowed LLVM to discover the loop induction variable and apply strength reduction - note that the final loop never uses the multiplication instruction!

The final program runs 8.8% faster than the version that is compiled without the custom alias analysis.

Conclusions

My custom alias analyser for GHC-generated code gives LLVM much more room for applying its existing powerful optimisation. There is plenty of scope for improvement, though:

  1. I’d really like people to report their experiences using with this alias analyser and the LLVM backend. Do you see a big speed boost on your data-parallel Haskell programs, for example?

  2. Of course, I would like this alias analyser to included with GHC so you can all seamlessly benefit from it. I’ll be working with GHC HQ to make this happen.

  3. I think there is still scope for getting even more useful information about GHC-generated code into LLVM. For example, currently LLVM is unable to eliminate stores to stack locations that we can see will never be accessed because we make a tail call to another function with a stack pointer that points above these locations. I can think of at least two ways to express this to LLVM, and this would produce another nice gain.

    If would also be great if we could teach LLVM something about the garbage collector, as currently if your loop does any allocation at all the presence of calls to the GC pessimises the output code a lot.

I was partly inspired to do this by Ben Lippmeier, whose paper at the Haskell Symposium this year had to do strength-reduction manually at the Haskell level because LLVM wasn’t working for him. I hope I’ve fixed that issue.

Performance problems were also a focus of the discussions about the future of Haskell at ICFP. I’ve been to these discussions three years in a row, and several topics keep cropping back up: performance, and the fact that Hackage 2.0 still isn’t released. I’ve grown tired of hearing so much talk about the issues with little-to-no action to resolve them, so I spent this post-ICFP week doing my best to fix them. I first wrote a documentation build bot for the Hackage 2.0 effort, and then moved on to the LLVM performance issues - if everyone helps to move these issues along then hopefully we can finally talk about some different problems next year!