I’ve been looking into improving GHC’s support for non-ASCII text, and my investigations have lead to me to PEP 383.

One motivation behind this PEP is as follows: on Unix, the names of files, command line arguments, and environment variables should probably be treated as sequences of bytes. However, for good reasons it is quite natural for programs to act on them as if they were strings. This means that we have to choose some text encoding to use to interpret those byte sequences.

Unfortunately, whatever encoding you choose to use, it is quite likely that some byte sequences you encounter in practice will not in fact decode nicely using that encoding. An example would be a Big5 filename supplied as a command line argument to a program run in the UTF-8 locale.

In this case, what should happen? One sensible thing to do would be to fail, but this might be surprising. Python 3, with PEP 383, chooses to encode the non-decodable bytes as part of the string using surrogates. So if we try to parse a Big5 filename as a string we get a string full of surrogates representing the raw bytes we had to begin with.

This is a good thing to do because if that string is then immediately fed back into a function that just decodes the filename for use on the file system, the original byte sequence can be exactly reconstituted by decoding the surrogates back into bytes and using the locale encoding for the rest. If the user attempts to do something else with a string containing surrogates (such as e.g. display it to the terminal), then an exception will be raised.

This is a reasonably neat solution to a hard problem. However, it has weird implications. For example, consider this script that uses a black list to control access to some files:

#!/usr/bin/env python3

import sys

file = sys.argv[1]

blacklist = open("blacklist.big5", encoding='big5').read().split()
print("Blacklist is:\n" + repr(blacklist))

if file in blacklist:
	print("Blacklisted file, not allowed!")
	print("OK, I'll let you in!")

Let’s say that the blacklist contains a single entry, for the file 你好 (encoded in Big5, naturally).

Seems simple enough, right? Although I store file names as Big5, I compare Python’s Unicode strings. And indeed this program works perfectly when run from a terminal in the Big5 locale, with Big5 file names.

However, consider what happens when the terminal is set to UTF-8 and we invoke the script with the command line argument 你好 (encoded in Big5 of course, because the file name on disk is still Big5 even though we changed the terminal locale). In this case, Python 3 will attempt to decode the file name as UTF-8. Naturally, it will fail, so the Big5 filename will be represented in memory with surrogates.

Now for the punchline: when we come to compare that string (containing surrogates) with the entry from the blacklist (without surrogates) they will not be equal. Yet, when we go on to open the file, the filename (with surrogates) is decoded perfectly back into valid Big5 and hence we get the contents of the blacklisted file.

In my opinion, the fact that the current encoding affects the results of string comparisons is deeply weird behaviour and could probably be the cause of subtle security bugs. This is just one reason that I’m wary about adopting PEP 383-like behaviour for GHC.

P.S. For those who believe that my code is broken because you should only compare normalised unicode strings, I will add that even after using unicodedata.normalize to normalise to NFC I get the same problem.

P.P.S I will further note that you get the same issue even if the blacklist and filename had been in UTF-8, but this time it gets broken from a terminal in the Big5 locale. I didn’t show it this way around because I understand that Python 3 may only have just recently started using the locale to decode argv, rather than being hardcoded to UTF-8.