Security implications of PEP 383

I've been looking into improving GHC's support for non-ASCII text, and my investigations have lead to me to PEP 383.

One motivation behind this PEP is as follows: on Unix, the names of files, command line arguments, and environment variables should probably be treated as sequences of bytes. However, for good reasons it is quite natural for programs to act on them as if they were strings. This means that we have to choose some text encoding to use to interpret those byte sequences.

Unfortunately, whatever encoding you choose to use, it is quite likely that some byte sequences you encounter in practice will not in fact decode nicely using that encoding. An example would be a Big5 filename supplied as a command line argument to a program run in the UTF-8 locale.

In this case, what should happen? One sensible thing to do would be to fail, but this might be surprising. Python 3, with PEP 383, chooses to encode the non-decodable bytes as part of the string using surrogates. So if we try to parse a Big5 filename as a string we get a string full of surrogates representing the raw bytes we had to begin with.

This is a good thing to do because if that string is then immediately fed back into a function that just decodes the filename for use on the file system, the original byte sequence can be exactly reconstituted by decoding the surrogates back into bytes and using the locale encoding for the rest. If the user attempts to do something else with a string containing surrogates (such as e.g. display it to the terminal), then an exception will be raised.

This is a reasonably neat solution to a hard problem. However, it has weird implications. For example, consider this script that uses a black list to control access to some files:

#!/usr/bin/env python3

import sys

file = sys.argv[1]

blacklist = open("blacklist.big5", encoding='big5').read().split()
print("Blacklist is:\n" + repr(blacklist))

if file in blacklist:
print("Blacklisted file, not allowed!")
else:
print("OK, I'll let you in!")
print(open(file).read())

Let's say that the blacklist contains a single entry, for the file 你好 (encoded in Big5, naturally).

Seems simple enough, right? Although I store file names as Big5, I compare Python's Unicode strings. And indeed this program works perfectly when run from a terminal in the Big5 locale, with Big5 file names.

However, consider what happens when the terminal is set to UTF-8 and we invoke the script with the command line argument 你好 (encoded in Big5 of course, because the file name on disk is still Big5 even though we changed the terminal locale). In this case, Python 3 will attempt to decode the file name as UTF-8. Naturally, it will fail, so the Big5 filename will be represented in memory with surrogates.

Now for the punchline: when we come to compare that string (containing surrogates) with the entry from the blacklist (without surrogates) they will not be equal. Yet, when we go on to open the file, the filename (with surrogates) is decoded perfectly back into valid Big5 and hence we get the contents of the blacklisted file.

In my opinion, the fact that the current encoding affects the results of string comparisons is deeply weird behaviour and could probably be the cause of subtle security bugs. This is just one reason that I'm wary about adopting PEP 383-like behaviour for GHC.

P.S. For those who believe that my code is broken because you should only compare normalised unicode strings, I will add that even after using unicodedata.normalize to normalise to NFC I get the same problem.

P.P.S I will further note that you get the same issue even if the blacklist and filename had been in UTF-8, but this time it gets broken from a terminal in the Big5 locale. I didn't show it this way around because I understand that Python 3 may only have just recently started using the locale to decode argv, rather than being hardcoded to UTF-8.

How to build 32/64 bit fat (universal) binaries

The OS X version of the Glasgow Haskell Compiler compiles Haskell into 32-bit code. Unfortunately, this means that if you are on a system where it is the default for libraries to be built in 64-bit mode, you tend to get errors when linking Haskell code telling you that you are trying to link 32-bit code against 64-bit code.

The best solution to this problem is to build all libraries you intend to link to from Haskell code as universal binaries that include both 32-bit and 64-bit versions of the code. These libraries will then work seamlessly with both Haskell code and also when pulled in as part of the build process for non-Haskell 64-bit executables.

If you can install the library using MacPorts, this is easy to do. Instead of doing:

sudo port install mylibrary

Just do:

sudo port install mylibrary +universal

However, if the library you want is not available through MacPorts or the MacPorts version is not up to date you will need to know how to build these universal libraries for yourself. This is the process that I aim to explain in this post. I'm going to use igraph as my example library because it's what I needed to install (I needed to install the unreleased v0.6).

The easy method

If you are lucky, building a universal library is as simple as changing how you invoke make. Run the library's configure scripts etc as usual, and then invoke make as follows:

make CXXFLAGS="-arch i386 -arch x86_64" CFLAGS="-arch i386 -arch x86_64" LDFLAGS="-arch i386 -arch x86_64"

The -arch flags tell GCC and the linker to build and link both versions of the library. If this works, you are done. In the case of igraph, this wasn't quite enough - the above command failed with this error:

gcc-4.2: -E, -S, -save-temps and -M options are not allowed with multiple -arch flags

The reason that this occurs is because igraph invokes GCC with the -M series of flags that generate makefile dependency rules from the C code - but GCC doesn't like generating those rules for two architectures simultaneously. Luckily, there was an easy workaround in my case - I just needed to reconfigure igraph as follows:

./configure --disable-dependency-tracking

The --disable-dependency-tracking flag just stops Automake from determining the dependencies of each C file as it compiles it. It is totally harmless to disable this because that dependency information is only used in order to rebuild less stuff upon subsequent invocations of make - the worst that happens when you disable it is that if you make more than once you will have to wait a bit longer. For more information on this feature see also the relevant section of the Automake manual.

After reconfiguring in this manner, the original make invocation worked correctly for igraph.

The hard method

The above method may perhaps fail for some libraries, in which case you can use this more arduous manual method. The idea is to run the library's build process from scratch twice: once to get the 32-bit library and once for the 64-bit library. We can then use the lipo to glue together the build artifacts from the two runs.

We start by building the 32-bit version:

make clean
make CXXFLAGS=-m32 CFLAGS=-m32 LDFLAGS=-m32 -j12

We now need to store the 32-bit build artifacts somewhere. Exactly which files you have to save will vary according to the library you are building, but for igraph this was sufficient:

mkdir -p ~/Junk/32 ~/Junk/64
cp src/.libs/libigraph.{a,0.dylib} ~/Junk/32

Now do the 64-bit build and once again save the artifacts somewhere:

make clean
make CXXFLAGS=-m64 CFLAGS=-m64 LDFLAGS=-m64 -j12
cp src/.libs/libigraph.{a,0.dylib} ~/Junk/64

Finally we can use lipo to finish up:

lipo -create ~/Junk/{32,64}/libigraph.a -output src/.libs/libigraph.a
lipo -create ~/Junk/{32,64}/libigraph.0.dylib -output src/.libs/libigraph.0.dylib

At this point, you can do sudo make install and get a universal version of the library installed.

If you want to check that your libraries are indeed universal, you can use lipo -info:

$ lipo -info src/.libs/libigraph.a
Architectures in the fat file: src/.libs/libigraph.a are: i386 x86_64

Conclusions

Building universal 32-bit/64-bit binaries is apparently fairly straightforward but it was tricky to find documentation for the process. I hope this article helps others who need to get this done.

Solving GHC iconv problems on OS X 10.6

A problem that has plagued my GHC installation for a while is that whenever I tried to install any non-trivial package I would get a horrible link error like this:

Undefined symbols:
  "_iconv_close", referenced from:
      _hs_iconv_close in libHSbase-4.3.1.0.a(iconv.o)
     (maybe you meant: _hs_iconv_close)
  "_iconv_open", referenced from:
      _hs_iconv_open in libHSbase-4.3.1.0.a(iconv.o)
     (maybe you meant: _hs_iconv_open)
  "_iconv", referenced from:
      _hs_iconv in libHSbase-4.3.1.0.a(iconv.o)
     (maybe you meant: _hs_iconv_open, _hs_iconv_close , _hs_iconv )
  "_locale_charset", referenced from:
      _localeEncoding in libHSbase-4.3.1.0.a(PrelIOUtils.o)
ld: symbol(s) not found
collect2: ld returned 1 exit status

The reason for this is a combination of several factors:

  • The base library that comes with the GHC binary distribution wants to link against the standard Mac iconv
  • I have installed MacPorts libiconv, which renames the function that is named iconv_open in the standard iconv to libiconv_open
  • The Haskell library being installed by cabal depends transitively on some library that was built with something like extra-lib-dirs: /opt/local/lib, which causes -L/opt/local/lib to be passed to the linker
  • The linker's -L/opt/local/lib option occurs before -L/usr/lib, so the linker prefers to link against the MacPorts libiconv instead of the system one

In my case, it was the Haskell readline wrapper that was causing /opt/local/lib to be pulled in. I had to link the Haskell readline against MacPorts readline because the standard Mac libreadline is actually libeditline, which is almost-but-not-quite compatible and misses some crucial features.

There are several ways to fix the problem:

  • Perhaps you don't really need the MacPorts libiconv. In this case, you can stop it from being used by just doing port deactivate libiconv. This is the route I took.
  • Perhaps it's OK to link this particular library against the system libraries in preference to the MacPorts one. In this case, you can configure the package with cabal configure --extra-lib-dir=/usr/lib, so /usr/lib is searched before the MacPorts directory. This may fail if the package that needed -L/opt/local/lib requires a MacPorts version of some library that is also present in /usr/lib, though.
  • You could build GHC yourself and link it against the MacPorts library versions. This is not for the faint-hearted, but if the version of GHC you need is in MacPorts I imagine that you can just do port install ghc

I'm glad I've finally got this sorted out. If you are still having trouble, you might find some helpful information in the threads that finally helped me to overcome the issue and prompted this writeup.

The case of the mysterious "thread blocked indefinitely in an MVar operation" exception

I recently tracked down the cause of the persistent "thread blocked indefinitely in an MVar operation" exceptions I was getting from the GHC runtime when I used my parallel-io library. There doesn't seem to be much information about this exception on the web, so I'm documenting my experiences here to help those unfortunate souls that come across it.

The essence of the parallel-io library is a combinator like this:

parallel :: [IO a] -> IO [a]

The list of IO actions on the input are run (possibly in parallel) and the results returned as a list. The library ensures that we never go beyond a certain (user specified) degree of parallelism, but that's by-the-by for this post.

There are a number of interesting design questions about the design of such a combinator, but the one that tripped me up is the decision about what to do with exceptions raised by the IO actions. What I used to do was have parallel run all the IO actions wrapped in a try, and wait for all of the actions to either throw an exception or complete normally. The combinator would then either return a list of the results, or, if at least one action threw an exception, the exception would be rethrown on the thread that had invoked parallel.

At first glance, this seems very reasonable. However, in this turns out to be a disastrous choice that all-but-guarantees your program will go down in flames with "thread blocked indefinitely in an MVar operation".

Consider this simple program:

-- The main action executes on thread 1:
main = do
    res1 <- newEmptyMVar
    res2 <- newEmptyMVar
    
    wait <- newEmptyMVar
    
    -- Spawn thread 2:
    forkIO $ saveExceptionsTo res1 $ do
        () <- takeMVar wait
        putMVar res1 (Right "Hello")
    
    -- Spawn thread 3:
    forkIO $ saveExceptionsTo res2 $ do
        throwIO $ ErrorCall "Oops!"
        
        putMVar wait () -- Unblocks thread 2
        putMVar res2 (Right "World")
    
    -- Wait for the results of both threads:
    liftM2 (,) (takeMVar res1) (takeMVar res2) >>= print

saveExceptionsTo :: MVar (Either SomeException a) -> IO () -> IO ()
saveExceptionsTo mvar act = act `catch` \e -> putMVar mvar (Left e)

This code is similar to what the parallel combinator does in that the main thread (thread 1) waits for thread 2 and thread 3 to both return a result or exception to res1 and res2 respectively. After it has both, it shows the exception or result.

The fly in the ointment is that thread 2 is waiting on thread 3, so the thread dependency graph looks something like this:

(You should read an arrow that points from thread A to thread B as thread B is waiting on thread A to unblock it).

Unfortunately, our programmer has written buggy code, and the code running in thread 3 throws an exception before it can wake up thread 2. Thread 3 writes to the res2 MVar just fine, but thread 1 is still blocked on the res1 MVar, which will never be written to. At this point, everything comes crashing down with "thread blocked indefinitely in an MVar operation".

The most annoying part is that the exception that originally caused the error is squirrelled away in an MVar somewhere, and we never get to see it!

This goes to show that the proposed semantics for exceptions above is dangerous. In the presence of dependencies between the actions in the parallel call, it tends to hide exceptions that we really want to see.

The latest version of the parallel-io library solves this by the simple method of using the asynchronous exceptions feature of GHC to rethrow any exceptions coming from an action supplied to parallel as soon as they occur. This unblocks thread 1 and lets us deal with the exception in whatever way is appropriate for our situation.

Although this bug is simple to describe and obvious in retrospect, it took a hell of a time to diagnose. It never ceases to amaze me exactly how difficult it is to write reliable concurrent programs!

Interpreting GHC cost centre stacks

A little-known feature of GHC is the ability to run a program (that has been compiled with profiling) with the +RTS -xc option. Now, every time an exception is raised a stack trace will be dumped to stderr. This is the only way get stack traces for your Haskell code currently. Unfortunately:

  • This stack trace is constructed from the cost-centre stack, so you'll only see names from modules you compiled in profiling mode with -auto-all or similar
  • The stack trace is shown regardless of whether the exception in question is actually handled or not - so you are likely to see lots of spurious stack traces where exceptions have been caught and handled

It's still better than the alternative (i.e. wailing about the lack of proper Haskell stack traces) though.

Now, the question is how to interpret the mess you get out. Here is an example stack trace from my current debugging session (I've added linebreaks to aid comprehension):


<Control.Concurrent.ParallelIO.Local.parallelE,
Control.Concurrent.ParallelIO.Local.parallel,
Control.Concurrent.ParallelIO.Local.parallel_,
Control.Concurrent.ParallelIO.Local.withPool,
Development.Shake.shake,Development.Shake.CAF>

(I'm trying to figure out why OpenShake reliably dies with the dreaded "blocked indefinitely on MVar" exception.)

The answer lies in the GHC source code (see fprintCSS in Profiling.c). The items at the front of the stack trace are at the top of the stack and hence correspond to the approximate to location from which the exception was thrown.

Looking forward, it would be awesome if e.g. the Google Summer of Code could sponsor a student to take my work on GHC ticket 3693 further. Personally I would find stack trace support for Haskell totally invaluable, and I'm sure I'm not alone.

TEDxCam Hackathon

I spent Saturday at the TEDxCam Hackathon. This was an event which got together more than 40 coders to work in teams on a project in the area of data visualisation to compete for gadgets and tickets to the main conference.

Here is our team, hard at work (thanks to Dawson King for the photo and Paul Querelle for the brightness fix):

Our team decided to combine Timetric population data with general election outcomes from the Guardian to produce a movie showing the political evolution of the UK.

The idea was to colour each electoral constituency on the map with a color based on the winner of the election. There were several problems with completing this apparently simple task:

  • There is no good data source for where constituencies actually are, or what their boundaries are
  • There is also no good data source for population over time at a fine enough scale
  • The Guardian election data went back only 4 years

Of necessity, we had to find a smart way to use the limited data available. Because we didn't know constituency boundaries, but might have been able to find their centrepoints, we decided to use a Voronoi diagram to approximate the boundaries. The idea of such a diagram is that you divide up the 2D plane into cells around each input point so that every point in that cell is nearer to that input point than any other. Graphically, that means drawing the lines in this diagram:

In fact, we decided to go one better than that, and used a weighted Voronoi diagram. In this scheme, cells corresponding to constituencies could grow depending on the population of voters in that region.

The idea was that we could then make a movie showing how these cells grew, shrank, and changed color over time. Unfortunately it didn't quite come together on the day, but we did produce some lovely still images for the judges:

I spent some time on Sunday hacking the code up with some other data sources to produce a fully-fledged movie, which you can watch on Youtube. This version uses a power diagram, mainly because CGAL has some code for computing such diagrams much faster than we were able to compute our weighted Voronoi diagrams (10 minute rendering times became 10 seconds with this change!). Unfortunately, the code is buggier than a bucket full of cockroaches, as you will see if you watch the video 🙂

The Hackathon was tremendous fun, even if our project didn't turn out exactly as hoped! Thanks to the organisers (Cong Cong Bo and Dawson King) and to the excellent team who hacked through a long day to produce the stuff I've shown you above:

  • Max Bolingbroke
  • Nick Pilkington
  • Paul Querelle
  • Diva Tommei
  • George Weller

Ditaa support for gitit

I hacked together a quick plugin for the most excellent gitit wiki today. It's written in Haskell, so it's an absolute pleasure to write code for it.

What I added support for is a neat little tool called ditaa (DIagrams Through Ascii Art). Basically, in the markdown source of your Gitit wiki you can now write something like the following:

~~~ {.ditaa}
+--------+   +-------+    +-------+
|        | --+ ditaa +--> |       |
|  Text  |   +-------+    |diagram|
|Document|   |!magic!|    |       |
|     {d}|   |       |    |       |
+---+----+   +-------+    +-------+
    :                         ^
    |       Lots of work      |
    +-------------------------+
~~~

The plugin will then call out to the ditaa command line tool (written in Java, boo!) to render that to a beautiful image:

A ditaa image on a Gitit page

To get this set up for yourself, try the following from the root of your Gitit wiki:

git clone git://github.com/batterseapower/gitit-plugins.git batterseapower-plugins
wget http://downloads.sourceforge.net/project/ditaa/ditaa/0.9/ditaa0_9.zip?use_mirror=kent -O ditaa0_9.zip
unzip ditaa0_9.zip

Now edit your Gitit configuration file so the plugins list includes my plugin:

plugins: batterseapower-plugins/Ditaa.hs

That's it - restart Gitit and you should be ready to go!

Remembering The Hanzi

I've been learning Mandarin Chinese this year, and I've just finished a truly excellent book on the subject: Remembering The Hanzi, by Heisig and Richardson. The basic premise of this book is to introduce a system of mnemonic "stories" around each of the characters of the Chinese writing system which help you remember how to draw the character.

To give a simple example of how this works: first you learn some "primitive" collections of strokes (which may not necessarily be characters themselves) but which are easy to commit to visual memory. So pretty near the start of the book you learn the character 口 (meaning mouth) and 卜 (meaning divination, but which is given the mental image of a magic wand). Then you start building up stories around those characters, so the story around 占, which Heisig has you remember as meaning "fortune telling", could be something like: "To tell fortunes you need two things: a mouth, and a magic wand with which to make the predictions".

Remembering The Simplified Hanzi

Remembering these "stories" is usually vastly easier than remembering abstract collections of strokes, so I've found it a great help in learning the language. If this sounds interesting to you, you can actually get a free preview of the first 100 characters from their website.

Thanks to this excellent system, and help from sites like Reviewing The Kanji (for use when I get stuck for stories for some character) I've been able to learn 1500 hanzi in my spare time over the last 5 months. I am of course using Anki, a spaced repitition system, to continually reinforce the stories and check my knowledge of the characters, and I've exported my progress from it, which looks like this:

Hanzi Progress

The flat area at the start is where my progress is stalled at 100 characters due to waiting for the book to arrive from Hawaii and only having access to the sample pages. Since then my progress has been... erratic, but more recently I've settled down into a regular rhythm where I try and learn 20 every day, and it looks like I've managed to average 17 a day during that period - not too bad!

For you Anki users, I've just uploaded the final version of my Remembering The Hanzi flashcard deck into the shared decks area, so you base your own decks on that if you wish to.

What's next? Well, I want to get the second book in the series when it comes out (allegedly sometime this summer) to take the number of characters I know to 3000. However, more importantly I'm going to start working on my knowledge of pronuncation of some of these characters (the book links the characters to English keywords, not their Mandarin pronunciations), and how you use them in compound words, with an eye to taking the first level of the 汉语水平考试 test before the end of the year. I feel confident that this process can go quite quickly, with such a strong base of knowledge of the character system to build upon.

Constraint families

Various people, notably John Meacham, have proposed adding "context aliases" to Haskell. The basic idea is that you could write declarations like the following in Haskell:

context Num a = (Monoid a, Group a, Multiplicative a, FromInteger a)

Now, what this means is that when you write Num a in a constraint, you really mean all of Monoid a, Group a and so on. This means that the following program is valid, and presumably computes the number 7:

foo :: Num a => a -> a
foo = fromInteger 2 `mappend` fromInteger 5

This lets you write shorter type signatures in programs which make ubiquitous use of type classes. However, in the brave new world of type families an obvious generalisation is to allow class-associated constraints. In particular, this lets us solve the classic problem where you can't make Set an instance of Monad:

class RMonad m where
    context RMonadElem a
    
    return :: RMonadElem a => a -> m a
    (>>=) :: (RMonadElem a, RMonadElem b) => m a -> (a -> m b) -> m b

instance RMonad [] where
    context RMonadElem a = ()
    
    return x = [x]
    (>>=) = concatMap

instance RMonad Set where
    context RMonadElem a = Ord a
    
    return x = singleton x
    (>>=) = fold (\a s' -> union (f a) s') empty s

A few interesting points:

  1. What is the kind signature of the context synonym? We probably need another "kind" - that of class constraints - which is preserved by n-ary tupling.
  2. Can you provide a default implementation for the kind synonym? This would let us change the definition of the Monad type class in a backward compatible way, by defaulting RMonadElem a to ()
  3. I mentioned this idea to Ganesh at Fun In The Afternoon, and he told me about his rmonad package, which essentially does exactly this, but by reifying the dictionaries explicitly as data. This is a nice demonstration that the approach is workable, but I think we could really do without the boilerplate dictionary management.
  4. Amusingly, GHC currently represents type classes internall as a normal data type, with some extra invariants. This means that most of the existing machinery for dealing with associated type synonyms could probably be used changed to implement this extension!

I don't think that this introduces any horrendous type checking problems, and I can see how the desugarer has to treat dictionaries arising from such contexts. Nonetheless, there are probably some weirdnesses that I'm forgetting, so I should probably try to come up with a real specification (and implementation!) when I get some time...

(P.S: It looks like some guys at Hac5 were working on adding the simple constraint families to GHC - does anyone know how far they got with that?)