Security implications of PEP 383
I’ve been looking into improving GHC’s support for non-ASCII text, and my investigations have lead to me to PEP 383.
One motivation behind this PEP is as follows: on Unix, the names of files, command line arguments, and environment variables should probably be treated as sequences of bytes. However, for good reasons it is quite natural for programs to act on them as if they were strings. This means that we have to choose some text encoding to use to interpret those byte sequences.
Unfortunately, whatever encoding you choose to use, it is quite likely that some byte sequences you encounter in practice will not in fact decode nicely using that encoding. An example would be a Big5 filename supplied as a command line argument to a program run in the UTF-8 locale.
In this case, what should happen? One sensible thing to do would be to fail, but this might be surprising. Python 3, with PEP 383, chooses to encode the non-decodable bytes as part of the string using surrogates. So if we try to parse a Big5 filename as a string we get a string full of surrogates representing the raw bytes we had to begin with.
This is a good thing to do because if that string is then immediately fed back into a function that just decodes the filename for use on the file system, the original byte sequence can be exactly reconstituted by decoding the surrogates back into bytes and using the locale encoding for the rest. If the user attempts to do something else with a string containing surrogates (such as e.g. display it to the terminal), then an exception will be raised.
This is a reasonably neat solution to a hard problem. However, it has weird implications. For example, consider this script that uses a black list to control access to some files:
Let’s say that the blacklist contains a single entry, for the file 你好 (encoded in Big5, naturally).
Seems simple enough, right? Although I store file names as Big5, I compare Python’s Unicode strings. And indeed this program works perfectly when run from a terminal in the Big5 locale, with Big5 file names.
However, consider what happens when the terminal is set to UTF-8 and we invoke the script with the command line argument 你好 (encoded in Big5 of course, because the file name on disk is still Big5 even though we changed the terminal locale). In this case, Python 3 will attempt to decode the file name as UTF-8. Naturally, it will fail, so the Big5 filename will be represented in memory with surrogates.
Now for the punchline: when we come to compare that string (containing surrogates) with the entry from the blacklist (without surrogates) they will not be equal. Yet, when we go on to open the file, the filename (with surrogates) is decoded perfectly back into valid Big5 and hence we get the contents of the blacklisted file.
In my opinion, the fact that the current encoding affects the results of string comparisons is deeply weird behaviour and could probably be the cause of subtle security bugs. This is just one reason that I’m wary about adopting PEP 383-like behaviour for GHC.
P.S. For those who believe that my code is broken because you should only compare normalised unicode strings, I will add that even after using unicodedata.normalize to normalise to NFC I get the same problem.
P.P.S I will further note that you get the same issue even if the blacklist and filename had been in UTF-8, but this time it gets broken from a terminal in the Big5 locale. I didn’t show it this way around because I understand that Python 3 may only have just recently started using the locale to decode argv, rather than being hardcoded to UTF-8.
The problem is not that your code is broken but that your system is broken. If filenames are encoded as big5, but your system thinks it's in a UTF-8 locale, then your borked no matter what you do. Something's gonna get you in the end. It's as broken as if your /etc/passwd was renamed to /etc/password without letting any of the software on the system know.
Steve: I'd like to think that, but it's reasonably common for users to use an OS that mostly uses UTF-8 but then set their locale to something like Big5 for the purposes of working with legacy applications/data that use Big5.
I'd dearly like to junk support for such systems though, it would make my life much simpler.
Nick, thanks for that info. It is certainly nice that there is a work around, and perhaps this indeed the best that can be done if you still want the convenience of representing filenames as strings.
Terry: thanks also for the link to the mailing list thread. It is certainly interesting, and the argument regarding latin1 is a compelling one -- this issue is indeed not specific to PEP383. So the dangerous operation seems to be comparing strings that were originally created from byte strings in two different encodings. It's not clear to me that it would be sensible for the language to check this (perhaps by throwing an exception if you try it).
The nice thing about the way Python 3.2 does things is that it exposes the tools you need to fix this: os.fsencode and os.fsdecode.
So to get your blacklist to work correctly, you would read in the list of filenames from the blacklist, then use "os.fsencode()" on each of them to convert them to bytes.
You would then do the same thing with your command line argument: use os.fsencode() to get the actual bytes that will be passed to a filesystem call on a *nix system.
Do the comparison in the bytes domain, and the vagaries of whether a string was decoded properly or not shouldn't matter.
If you know the correct encoding of the command line argument, you also have the option of redecoding those bytes with the correct encoding.
I believe making such code work on Windows as well is a matter of encoding to UTF-16 rather than using fsencode.
Discussed on python-dev list starting with
http://mail.python.org/pipermail/python-dev/2011-March/110215.html
Consensus: problem not specific to PEP383 but use of blacklist in world of multiple encodings (disguises).
Thanks for your reply Porges. An abstract path type would be good -- and used pervasively I think it could solve the problem in my post -- but it's not a magic bullet. We would have to rethink the standard library quite a lot: how do I get a path from a command line argument, for example? To do this safely you need to expose the command arguments as a list of bytes, which is rather user unfriendly.
One thing to note is that UTF-8 can roundtrip invalid UTF-16 perfectly, but not vice versa, so UTF-8 as an internal encoding makes a lot more sense here (especially on Windows, where the incoming paths will all be UTF-16).
I'd much rather have an opaque Path type which supports all the usual operations you want to perform. The user can convert this to a String for display purposes if and when they want (and it can fail then), but it should stay as a Path for most of its lifetime :) This won't fix the encoding mismatch example you've shown, however.