The setup

A few years ago, David Terei did some great work adding a LLVM backend to the Glasgow Haskell Compiler. The idea with this is that instead of writing our own optimiser and assembly-code generators for our custom three-address-code, we can just translate into LLVM IR and have LLVM do the heavy lifting. In theory, this means that GHC will be able to compile for many different CPUs, and will benefit from the smart optimisations the LLVM team have implemented.

The portability part has definitely worked out for us: for example, a couple of people have successfully got GHC to compile for the ARM by using the LLVM backend. However, the promise of LLVM being able to speed up our generated code has never really been fully borne out. LLVM-generated code does tend to be better than that produced by GHCs own backends, but this is mostly because LLVM is doing much better register allocation (it is much smarter about reusing the “pinned registers” required that form part of the interface between GHC’s generated code and the garbage collector).

The reason that LLVM does not optimise as much as we would like is often to do with aliasing. In particular, LLVM conservatively assumes that GHC’s stack (which is explicitly represented in the generated code as an array of words) and the heap may alias.

What’s the problem?

A concrete example of this is the following Haskell program:

module Main(main) where
    
import Data.Array.Base
import Data.Array.IO
import Data.Array.MArray

main :: IO ()
main = do
    arr <- newArray_ (0, 200)
    go arr 2 0 100

go :: IOUArray Int Int -> Int -> Int -> Int -> IO ()
go arr stride x y | x < y     = do unsafeWrite arr (x * stride) 1337
                                   go arr stride (x + 1) y
                  | otherwise = return ()

This loop compiles to fairly good Core:

Main.main_$s$wa =
  \ (@ sg0_sKA::Data.Array.Base.STUArray
                  GHC.Prim.RealWorld GHC.Types.Int GHC.Types.Int
                  ~
                Data.Array.IO.Internals.IOUArray GHC.Types.Int GHC.Types.Int)
    (sc_sKs :: GHC.Prim.State# GHC.Prim.RealWorld)
    (sc1_sKt :: GHC.Prim.Int#)
    (sc2_sKu :: GHC.Prim.Int#)
    (sc3_sKv :: GHC.Prim.Int#)
    (sc4_sKw :: GHC.Types.Int)
    (sc5_sKx :: GHC.Types.Int)
    (sc6_sKy :: GHC.Types.Int)
    (sc7_sKz :: GHC.Prim.MutableByteArray# GHC.Prim.RealWorld) ->
    case GHC.Prim.<# sc2_sKu sc1_sKt of _ {
      GHC.Bool.False -> (# sc_sKs, GHC.Unit.() #);
      GHC.Bool.True ->
        case GHC.Prim.writeIntArray#
               @ GHC.Prim.RealWorld
               sc7_sKz
               (GHC.Prim.*# sc2_sKu sc3_sKv)
               1337
               sc_sKs
        of s2#_aHo { __DEFAULT ->
        Main.main_$s$wa
          @ (sym
               Data.Array.IO.Internals.NTCo:IOUArray GHC.Types.Int GHC.Types.Int)
          s2#_aHo
          sc1_sKt
          (GHC.Prim.+# sc2_sKu 1)
          sc3_sKv
          sc4_sKw
          sc5_sKx
          sc6_sKy
          sc7_sKz
        }
    }

One weird thing about this Core is that it passes around a number of dead arguments (sc4_sKw, sc5_sKx and sc6_sKy). This is a known bug in GHC, and is caused by a phase ordering problem. However, this particular infelicity should not prevent LLVM from being able to do the classic loop optimisation of strength reduction on our code.

The particular strength reduction we would like to perform si to replace the multiplication GHC.Prim.*# sc2_sKu sc3_sKv in the main_$s$wa loop with an addition. This is possible because the left operand sc2_sKu is a loop induction variable, increasing by 1 every iteration. Thus, on every iteration the value of the multiplication GHC.Prim.*# sc2_sKu sc3_sKv is just the value of the multiplication on the previous loop, plus sc3_sKv. Thus, by adding a loop variable that records the value of the multiplication on the previous iteration, we can replace the multiplication by an addition.

Unfortunately, LLVM currently can’t strength-reduce this loop in the suggested way. As we will soon see, this is due to aliasing.

Why does the problem happen?

We can immediately see the problem if we look at the optimised LLVM code for this loop:

c1TW.lr.ph:
  ...
  %ln1TL1 = load i64* %Sp_Arg, align 8
  ...

c1TW:                                             ; preds = %c1TW.lr.ph, %c1TW
  %ln1TL4 = phi i64 [ %ln1TL1, %c1TW.lr.ph ], [ %ln1UF, %c1TW ]
  %ln1Uy = mul i64 %ln1Uu, %ln1TL4
  %ln1Uz = add i64 %ln1Uw, %ln1Uy
  %ln1UA = inttoptr i64 %ln1Uz to i64*
  store i64 1337, i64* %ln1UA, align 8
  %ln1UE = load i64* %Sp_Arg, align 8
  %ln1UF = add i64 %ln1UE, 1
  store i64 %ln1UF, i64* %Sp_Arg, align 8
  %ln1TP = load i64* %ln1TN, align 8
  %ln1TQ = icmp slt i64 %ln1UF, %ln1TP
  br i1 %ln1TQ, label %c1TW, label %n1TX.loopexit

The strength reduction optimisation depends on one of the operands to the multiplication being a loop induction variable. In our case, we expect that sc2_sKu will be such a variable. However, looking at the LLVM code we can see that the equivalent LLVM variable, %ln1TL4, has its induction-ness hidden because it is reloaded from the stack by load i64* %Sp_Arg on every iteration.

You might wonder why the store to the same stack location by store i64 %ln1UF, i64* %Sp_Arg is not forwarded to this load by LLVM. If this were to happen, we could get code like this:

c1TW.lr.ph:
  ...
  %ln1TL1 = load i64* %Sp_Arg, align 8
  %ln1UE.ph = load i64* %Sp_Arg, align 8
  ...

c1TW:                                             ; preds = %c1TW.lr.ph, %c1TW
  %ln1TL4 = phi i64 [ %ln1TL1, %c1TW.lr.ph ], [ %ln1UF, %c1TW ]
  %ln1UE = phi i64 [ %ln1UE.ph, %c1TW.lr.ph ], [ %ln1UF, %c1TW ]
  %ln1Uy = mul i64 %ln1Uu, %ln1TL4
  %ln1Uz = add i64 %ln1Uw, %ln1Uy
  %ln1UA = inttoptr i64 %ln1Uz to i64*
  store i64 1337, i64* %ln1UA, align 8
  %ln1UF = add i64 %ln1UE, 1
  store i64 %ln1UF, i64* %Sp_Arg, align 8
  %ln1TP = load i64* %ln1TN, align 8
  %ln1TQ = icmp slt i64 %ln1UF, %ln1TP
  br i1 %ln1TQ, label %c1TW, label %n1TX.loopexit

In this code the fact that %ln1UE is an induction variable is obvious, and not obscured by an intermediate load from memory. And indeed, LLVM is able to strength-reduce this loop!

The reason that LLVM does not forward this load is because in general it is unsafe, since the store to %ln1UA might alias it if %ln1UA were equal to %Sp_Arg. The ridiculous thing about this is that we know that in the code generated by GHC, the stack pointer will never be stored away anywhere, so it can’t possible alias with the unknown pointer %ln1UA and LLVM is being unnecessarily conservative.

The solution

LLVM is a beautiful bit of software, and it provides exactly the extensibility point we require to resolve this problem: we can write our own alias analysis pass that knows that GHC’s stack never alias with any another non-stack pointer and dynamically load it into the LLVM optimisation tool chain.

This is exactly what I’ve done. The code is available as a Gist, and interested parties (who use OS X!) can build it like so:

g++ -D__STDC_LIMIT_MACROS -D__STDC_CONSTANT_MACROS -fno-exceptions -fno-rtti -fno-common -Wall \
-Wl,-flat_namespace -dynamiclib GHCAliasAnalysis.cpp -o GHCAliasAnalysis.dylib -lLLVM-`llvm-config --version`

Once built, we can dynamically load the resulting dylib into LLVMs opt tool using the -load option, and then use the new -ghc-aa flag to tell LLVM to use our alias analyser as a complement to the default one. Unfortunately, due to an infelicity in LLVM, we have to specify -ghc-aa in between every single optimisation pass if we want to be sure that it is used. So the final command line to opt, including all passes done by the standard -O2 optimisation level, and the -loop-reduce strength-reduction pass, needs to look something like this:

opt -load GHCAliasAnalysis.dylib -S -no-aa -tbaa -basicaa -ghc-aa \
-globalopt -ghc-aa -ghc-aa -ipsccp -ghc-aa -deadargelim -ghc-aa -instcombine -ghc-aa -simplifycfg \
-ghc-aa -basiccg -ghc-aa -prune-eh -ghc-aa -inline -ghc-aa -functionattrs -ghc-aa -scalarrepl-ssa \
-ghc-aa -domtree -ghc-aa -early-cse -ghc-aa -simplify-libcalls -ghc-aa -lazy-value-info -ghc-aa \
-jump-threading -ghc-aa -correlated-propagation -ghc-aa -simplifycfg -ghc-aa -instcombine -ghc-aa \
-tailcallelim -ghc-aa -simplifycfg -ghc-aa -reassociate -ghc-aa -domtree -ghc-aa -loops -ghc-aa \
-loop-simplify -ghc-aa -lcssa -ghc-aa -loop-rotate -ghc-aa -licm -ghc-aa -lcssa -ghc-aa -loop-unswitch \
-ghc-aa -instcombine -ghc-aa -scalar-evolution -ghc-aa -loop-simplify -ghc-aa -lcssa -ghc-aa -indvars \
-ghc-aa -loop-idiom -ghc-aa -loop-deletion -ghc-aa -loop-unroll -ghc-aa -memdep -ghc-aa -gvn -ghc-aa \
-memdep -ghc-aa -memcpyopt -ghc-aa -sccp -ghc-aa -instcombine -ghc-aa -lazy-value-info -ghc-aa \
-jump-threading -ghc-aa -correlated-propagation -ghc-aa -domtree -ghc-aa -memdep -ghc-aa -dse \
-ghc-aa -adce -ghc-aa -simplifycfg -ghc-aa -instcombine -ghc-aa -strip-dead-prototypes -ghc-aa \
-constmerge -loop-reduce

(Yes, I know this is ridiculous! I hope the LLVM developers fix this soon.)

With my new alias analysis pass, LLVM is able to produce the following beautiful code for the loop:

c1TW:                                             ; preds = %c1TW, %c1TW.lr.ph
  %lsr.iv = phi i64 [ %lsr.iv.next, %c1TW ], [ %5, %c1TW.lr.ph ]
  %ln1UF1 = phi i64 [ %ln1TL1, %c1TW.lr.ph ], [ %ln1UF, %c1TW ]
  %ln1UA = inttoptr i64 %lsr.iv to i64*
  store i64 1337, i64* %ln1UA, align 8
  %ln1UF = add i64 %ln1UF1, 1
  %lsr.iv.next = add i64 %lsr.iv, %6
  %ln1TQ = icmp slt i64 %ln1UF, %ln1TP2
  br i1 %ln1TQ, label %c1TW, label %n1TX.loopexit

Note that the original loop contained a store and two loads, but the optimised loop contains only a single store: our new alias analysis has allowed the loads to be floated out of the loop. This has in turn allowed LLVM to discover the loop induction variable and apply strength reduction - note that the final loop never uses the multiplication instruction!

The final program runs 8.8% faster than the version that is compiled without the custom alias analysis.

Conclusions

My custom alias analyser for GHC-generated code gives LLVM much more room for applying its existing powerful optimisation. There is plenty of scope for improvement, though:

  1. I’d really like people to report their experiences using with this alias analyser and the LLVM backend. Do you see a big speed boost on your data-parallel Haskell programs, for example?

  2. Of course, I would like this alias analyser to included with GHC so you can all seamlessly benefit from it. I’ll be working with GHC HQ to make this happen.

  3. I think there is still scope for getting even more useful information about GHC-generated code into LLVM. For example, currently LLVM is unable to eliminate stores to stack locations that we can see will never be accessed because we make a tail call to another function with a stack pointer that points above these locations. I can think of at least two ways to express this to LLVM, and this would produce another nice gain.

If would also be great if we could teach LLVM something about the garbage collector, as currently if your loop does any allocation at all the presence of calls to the GC pessimises the output code a lot.

I was partly inspired to do this by Ben Lippmeier, whose paper at the Haskell Symposium this year had to do strength-reduction manually at the Haskell level because LLVM wasn’t working for him. I hope I’ve fixed that issue.

Performance problems were also a focus of the discussions about the future of Haskell at ICFP. I’ve been to these discussions three years in a row, and several topics keep cropping back up: performance, and the fact that Hackage 2.0 still isn’t released. I’ve grown tired of hearing so much talk about the issues with little-to-no action to resolve them, so I spent this post-ICFP week doing my best to fix them. I first wrote a documentation build bot for the Hackage 2.0 effort, and then moved on to the LLVM performance issues - if everyone helps to move these issues along then hopefully we can finally talk about some different problems next year!