Mar 29 2011

Security implications of PEP 383

I've been looking into improving GHC's support for non-ASCII text, and my investigations have lead to me to PEP 383.

One motivation behind this PEP is as follows: on Unix, the names of files, command line arguments, and environment variables should probably be treated as sequences of bytes. However, for good reasons it is quite natural for programs to act on them as if they were strings. This means that we have to choose some text encoding to use to interpret those byte sequences.

Unfortunately, whatever encoding you choose to use, it is quite likely that some byte sequences you encounter in practice will not in fact decode nicely using that encoding. An example would be a Big5 filename supplied as a command line argument to a program run in the UTF-8 locale.

In this case, what should happen? One sensible thing to do would be to fail, but this might be surprising. Python 3, with PEP 383, chooses to encode the non-decodable bytes as part of the string using surrogates. So if we try to parse a Big5 filename as a string we get a string full of surrogates representing the raw bytes we had to begin with.

This is a good thing to do because if that string is then immediately fed back into a function that just decodes the filename for use on the file system, the original byte sequence can be exactly reconstituted by decoding the surrogates back into bytes and using the locale encoding for the rest. If the user attempts to do something else with a string containing surrogates (such as e.g. display it to the terminal), then an exception will be raised.

This is a reasonably neat solution to a hard problem. However, it has weird implications. For example, consider this script that uses a black list to control access to some files:

#!/usr/bin/env python3

import sys

file = sys.argv[1]

blacklist = open("blacklist.big5", encoding='big5').read().split()
print("Blacklist is:\n" + repr(blacklist))

if file in blacklist:
print("Blacklisted file, not allowed!")
else:
print("OK, I'll let you in!")
print(open(file).read())

Let's say that the blacklist contains a single entry, for the file 你好 (encoded in Big5, naturally).

Seems simple enough, right? Although I store file names as Big5, I compare Python's Unicode strings. And indeed this program works perfectly when run from a terminal in the Big5 locale, with Big5 file names.

However, consider what happens when the terminal is set to UTF-8 and we invoke the script with the command line argument 你好 (encoded in Big5 of course, because the file name on disk is still Big5 even though we changed the terminal locale). In this case, Python 3 will attempt to decode the file name as UTF-8. Naturally, it will fail, so the Big5 filename will be represented in memory with surrogates.

Now for the punchline: when we come to compare that string (containing surrogates) with the entry from the blacklist (without surrogates) they will not be equal. Yet, when we go on to open the file, the filename (with surrogates) is decoded perfectly back into valid Big5 and hence we get the contents of the blacklisted file.

In my opinion, the fact that the current encoding affects the results of string comparisons is deeply weird behaviour and could probably be the cause of subtle security bugs. This is just one reason that I'm wary about adopting PEP 383-like behaviour for GHC.

P.S. For those who believe that my code is broken because you should only compare normalised unicode strings, I will add that even after using unicodedata.normalize to normalise to NFC I get the same problem.

P.P.S I will further note that you get the same issue even if the blacklist and filename had been in UTF-8, but this time it gets broken from a terminal in the Big5 locale. I didn't show it this way around because I understand that Python 3 may only have just recently started using the locale to decode argv, rather than being hardcoded to UTF-8.


Mar 8 2011

How to build 32/64 bit fat (universal) binaries

The OS X version of the Glasgow Haskell Compiler compiles Haskell into 32-bit code. Unfortunately, this means that if you are on a system where it is the default for libraries to be built in 64-bit mode, you tend to get errors when linking Haskell code telling you that you are trying to link 32-bit code against 64-bit code.

The best solution to this problem is to build all libraries you intend to link to from Haskell code as universal binaries that include both 32-bit and 64-bit versions of the code. These libraries will then work seamlessly with both Haskell code and also when pulled in as part of the build process for non-Haskell 64-bit executables.

If you can install the library using MacPorts, this is easy to do. Instead of doing:

sudo port install mylibrary

Just do:

sudo port install mylibrary +universal

However, if the library you want is not available through MacPorts or the MacPorts version is not up to date you will need to know how to build these universal libraries for yourself. This is the process that I aim to explain in this post. I'm going to use igraph as my example library because it's what I needed to install (I needed to install the unreleased v0.6).

The easy method

If you are lucky, building a universal library is as simple as changing how you invoke make. Run the library's configure scripts etc as usual, and then invoke make as follows:

make CXXFLAGS="-arch i386 -arch x86_64" CFLAGS="-arch i386 -arch x86_64" LDFLAGS="-arch i386 -arch x86_64"

The -arch flags tell GCC and the linker to build and link both versions of the library. If this works, you are done. In the case of igraph, this wasn't quite enough - the above command failed with this error:

gcc-4.2: -E, -S, -save-temps and -M options are not allowed with multiple -arch flags

The reason that this occurs is because igraph invokes GCC with the -M series of flags that generate makefile dependency rules from the C code - but GCC doesn't like generating those rules for two architectures simultaneously. Luckily, there was an easy workaround in my case - I just needed to reconfigure igraph as follows:

./configure --disable-dependency-tracking

The --disable-dependency-tracking flag just stops Automake from determining the dependencies of each C file as it compiles it. It is totally harmless to disable this because that dependency information is only used in order to rebuild less stuff upon subsequent invocations of make - the worst that happens when you disable it is that if you make more than once you will have to wait a bit longer. For more information on this feature see also the relevant section of the Automake manual.

After reconfiguring in this manner, the original make invocation worked correctly for igraph.

The hard method

The above method may perhaps fail for some libraries, in which case you can use this more arduous manual method. The idea is to run the library's build process from scratch twice: once to get the 32-bit library and once for the 64-bit library. We can then use the lipo to glue together the build artifacts from the two runs.

We start by building the 32-bit version:

make clean
make CXXFLAGS=-m32 CFLAGS=-m32 LDFLAGS=-m32 -j12

We now need to store the 32-bit build artifacts somewhere. Exactly which files you have to save will vary according to the library you are building, but for igraph this was sufficient:

mkdir -p ~/Junk/32 ~/Junk/64
cp src/.libs/libigraph.{a,0.dylib} ~/Junk/32

Now do the 64-bit build and once again save the artifacts somewhere:

make clean
make CXXFLAGS=-m64 CFLAGS=-m64 LDFLAGS=-m64 -j12
cp src/.libs/libigraph.{a,0.dylib} ~/Junk/64

Finally we can use lipo to finish up:

lipo -create ~/Junk/{32,64}/libigraph.a -output src/.libs/libigraph.a
lipo -create ~/Junk/{32,64}/libigraph.0.dylib -output src/.libs/libigraph.0.dylib

At this point, you can do sudo make install and get a universal version of the library installed.

If you want to check that your libraries are indeed universal, you can use lipo -info:

$ lipo -info src/.libs/libigraph.a
Architectures in the fat file: src/.libs/libigraph.a are: i386 x86_64

Conclusions

Building universal 32-bit/64-bit binaries is apparently fairly straightforward but it was tricky to find documentation for the process. I hope this article helps others who need to get this done.