Category Archives: Ranting

Quirks of the Matlab file format

The Matlab file format has become something of a standard for data exchange in quant finance circles. It is not only handy for those who are using the Matlab interactive environment itself, but also to users working in a diverse spectrum of language, thanks to widespread availability of libraries for reading and writing the files. The format itself also has the handy property of supporting compression — an essential property for keeping disk usage reasonable with working with the highly compressible data that is typical of financial timeseries.

At work we have implemented our own high-performance Java library for reading and writing these files. The Mathworks have helpfully published a complete description of the format online, which makes this task for the most part straightforward. Unfortunately, the format also has some dark and undocumented corners that I spent quite some time investigating. This post is intended to record a couple of these oddities for posterity.

Unicode

The Matlab environment supports Unicode strings, and so consequently Matlab files can contain arbitrary Unicode strings. Unfortunately this is one area where the capabilities of Matlab itself and those intended by the Mathworks spec diverge somewhat. Specifically:

  1. While the spec documents a miUTF8 storage type, Matlab itself only seems to understand a very limited subset of UTF-8. For example, it can't even decode an example file which simply contains the UTF-8 encoded character sequence ←↑→↓↔. It turns out that Matlab cannot read codepoints that are encoded as three or more bytes! This means it can only understand U+0000 to U+07FF, leaving us in a sad situation when Matlab can't even understand the BMP.
  2. The miUTF32 storage type isn't supported at all. For example,
    this file is correctly formed according to the spec but unreadable in Matlab.
  3. UTF-16 mostly works. As it stands, this is really your only option if you want the ability to roundtrip Unicode via Matlab. One issue is that Matlab chars aren't really Unicode codepoints - they are sequences of UTF-16 code units. However, this is an issue shared by Python 2 and Java, so even though it is broken at least it is broken in the "normal" way.

Interestingly, most 3rd party libraries seem to implement these parts of the spec better than Matlab itself does — for example, scipy's loadmat and savemat functions have full support for all of these text storage data types. (Scipy does still have trouble with non-BMP characters however.)

Compression

As mentioned, .mat files have support for storing compressed matrices. These are simply implemented as nested zlib-compressed streams. Alas, it appears that the way that Matlab is invoking zlib is slightly broken, with the following consequences:

  • Matlab does not attempt to validate that the trailing ZLib checksum is present, and doesn't check it even if it is there.
  • If you attempt to open a file containing a ZLib stream that has experienced corruption such that the decompressed data is longer than Matlab was expecting, the error is silently ignored.
  • When writing out a .mat file, Matlab will sometimes not write the ZLib checksum. This happens very infrequently though — most files it creates do have a checksum as you would expect.

Until recently scipy's Matlab reader would not verify the checksum either, but I added support for this after we saw corrupted .mat files in the wild at work.

I've reported these compression and Unicode problems to the Mathworks and they have acknowledged that they are bugs, but at this time there is no ETA for a fix.

In Defence Of Clipperz

My last post, which detailed my switch to Mac, got a lot of attention from an unexpected audience: designers of password manager programs! Not only did I get a welcoming comment from the co-author of the wonderful 1Password, but I also received an email from Marco Barulli of Clipperz, an online password manager which I mentioned only briefly. Marco quite rightly wanted to point out that some of the things I said about his product could potentially be misleading, and so in the interests of accuracy I'm reproducing the meat of exchange (with his permission!) here in public:

Quote LeftI would like to thank you for the kind mention of Clipperz.

You wrote about a "minuscule password limit", but Clipperz has no limit on password/passphrase lengths.
Also, with regard to the lack of integration with other password managers, I would like to point out the following features:

You can easily move to Clipperz all your passwords and access them even if you are offline.

(Ideally you shouldn't need any longer a software-based password manager)

I hope you will give Clipperz a chance!Quote Right

Clipperz

As you can see, Marco was extremely courteous, despite my negative attitude towards something he has doubtless put a lot of hard work into! I had based my statement of "miniscule password limit" purely on my experience that it tended to slow to a crawl when more than around 100 entries were present in it (I have at least twice this many passwords!): this is due to the fact that the application is having to do some pretty intensive encryption and decryption in Javascript, of all things! However it seems that Clipperz has been working to improve this and that my experience may no longer be valid. Marco says:

Quote LeftI myself have more than 100 cards and both Firefox and Safari are doing a decent job encrypting and decrypting on my old G4 PowerBook. I'm sure that your new MacPro (what a beautiful beast!) will crunch your cards with ease. Quote Right

Finally, he mentioned some of the goodies Clipperz users can look forward to in the future:

Quote LeftNew features under development:

  • iPhone version
  • Tags & search
  • More intuitive interface
  • SharingQuote Right

It looks like exciting times are ahead! However, I think I shall hold out for My1Password as a web based solution to my password management needs simply due to the convenience of integration between my desktop and web password databases. Whatever Marco may say, I still believe you can get some valuable ease of use gains by having a piece of desktop software that can hook into all the disparate applications that you need to use passwords with on a daily basis.

However, if you don't share my point of view, or all the things you need passwords for are web based, I really can't do any better than to recommend Clipperz! It very functional and polished and is a remarkable technical achievement both in all that it manages to accomplish simply with Javascript and in its novel zero-knowledge architecture which means that only you can get to your passwords! Finally, I should have done this earlier, but I've now made a donation to the Clipperz team due to how helpful they were in solving my cross platform password problems in the past: this is just the kind of innovative software development that deserves our support!

Google 0wn5 Me

I've always used Google as a search engine: I can say that having tried out all its competitors it's results are simply the most relevant.

Then about half a year ago I decided I needed to keep my bookmarks synchronized between my laptop and desktop, and Google Browser Sync was just the thing for me and has worked perfectly.

Two months later I got fed up with only being able to check my RSS feeds on my desktop and moved those to Google Reader instead of Desktop Sidebar.

Now, about a month ago I got fed up with all the spam that was getting through Thunderbirds mail filters and switched to Gmail, uploading all my existing email archives at the same time with this tool. I then had much less spam, a much better ability to organize my mail, and the ability to get my email anywhere.
Then a week ago the link at the top of Gmail to Calendar lured me in.. now I depend on Google Calendar utterly to keep track of all the stuff I need to do that I used to keep around on bits of paper I would carry with me.

Two days ago I bit the bullet and made iGoogle my home page. I can now get my email, calendar and TODO items (which I used to keep written on the back of my hands!) from the moment I start a web browser.

And it's only this afternoon that I realized how much my life had come to depend on these guys. It's actually fairly scary how much damage they could do to me if they decided. But I'm now way too dependent on their integrated suite of tools for it to be worth my while to do anything about it. I really feel like I'm waking up to a web in a way I haven't really felt before, despite having used it for donkeys years and I'm sure that the masses who are as Google-agnostic as I was just over a year ago will start down the same path once they begin to own multiple internet connected devices and require ubiquitous data access.

I've ridiculed it in the past, but maybe they are on to something with this online office suite idea after all..?

NOD32 IMON Component Is Just Plain Dangerous

Recently I started to receive some rather strange error messages from some .NET programs I had written that made use of the WebRequest class. The exception was something like this:

System.IO.IOException: Unable to read data from the transport connection: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.. ---> System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
   at System.Net.UnsafeNclNativeMethods.OSSOCK.recv(IntPtr socketHandle, Byte* pinnedBuffer, Int32 len, SocketFlags socketFlags)
   at System.Net.Sockets.Socket.Receive(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags, SocketError& errorCode) 

This means that the native recv method was trying to muck with protected memory: a fairly bad and low level failure! This kind of thing shouldn't really be able to happen because the .NET Socket class is very robust and well tested, so I broke out the native code debugging features of Visual Studio to try and figure out what was going on. This involved allowing native code debugging on the project, and turning off "Just My Code" in the VS debugging options. Furthermore, I turned on automatic debugging symbol loading using the MS symbol server (located here). This done, I reran the application and look what I saw:

        ntdll.dll!_KiFastSystemCallRet@0()
        ntdll.dll!_NtWaitForSingleObject@12()  + 0xc bytes
        kernel32.dll!_WaitForSingleObjectEx@12()  + 0x84 bytes
        ntdll.dll!ExecuteHandler2@20()  + 0x26 bytes
        ntdll.dll!ExecuteHandler@20()  + 0x24 bytes
>    ntdll.dll!_KiUserExceptionDispatcher@8()  + 0xf bytes
        imon.dll!20b2472a()
        [Frames below may be incorrect and/or missing, no symbols loaded for imon.dll]
        imon.dll!20b20bca()
        imon.dll!20b06e21()
        imon.dll!20b23afa()
        imon.dll!20b23afa()
        imon.dll!20b239f1()
        imon.dll!20b239f1()
        imon.dll!20b239de()
        imon.dll!20b24d79()
        kernel32.dll!_MultiByteToWideChar@24()  + 0x76 bytes
        imon.dll!20b19418()
        imon.dll!20b212ae()
        imon.dll!20b0602a()
        [Managed to Native Transition]
        System.dll!System.Net.Sockets.Socket.Receive(byte[] buffer = {Dimensions:[2]}, int offset = 0, int size, System.Net.Sockets.SocketFlags socketFlags = None, out System.Net.Sockets.SocketError errorCode = Success) + 0x139 bytes
        System.dll!System.Net.Sockets.Socket.Receive(byte[] buffer, int offset, int size, System.Net.Sockets.SocketFlags socketFlags) + 0x1d bytes
        System.dll!System.Net.Sockets.NetworkStream.Read(byte[] buffer, int offset, int size) + 0x78 bytes

I wasn't expecting to see this because recv is actually located in Ws2_32.dll! It looked like the imon DLL was actually hooking this call and then blowing up internally due to some bug. I happened to know that this was part of the NOD32 internet protection suite I have installed, and indeed once that was disabled my programs no longer threw the exception! This is a clear case of a bug in their product, so I reported it to them. Unfortunately, their response was not particularly helpful:

In our next major release (3.0), we are doing away with IMON after
many years and replacing it with two more utilities.

In 1992, when NOD32 was introduced, very few programs operated at the
Winsock level. Today, in addition to Google and Microsoft, 100's of
other developers are creating software in this manner. That would be
fine, except for the fact that any app that operates here needs the
top spot in the stack, and only one program can have it.

As it is now, it can't be enabled at all on a server.

IMON was just the first layer of defense, a supplement. The strengths
of NOD32 are AMON, which scans every file that performs an action, as
it performs that action and the advanced heuristics which is stopping
90%+ of all new threats, before a definition is even written.

By quitting IMON now, you'll not only allow both programs to operate
together, but you'll also lose no coverage.

So what I'm taking away from this experience is that NOD32s IMON is just broken, and potentially dangerous: from now on I'll be turning it off on all my installations of NOD32, and I recommend you do the same.

XMLHTTPRequest + Authentication = Frustration

So I just spent the last 2 hours or so of my life buggering around with Ruby on Rails and trying to get it to do a RESTful login (i.e. one using HTTP Authorization headers, as opposed to the normal cookie stuff). There are some nice articles about pulling this feat off, such as here and here: the basic trick is to use XMLHTTPRequest to force the username/password from form fields into the browers authentication cache. However, it seems that if the resource your XMLHTTPRequest is trying to talk to never returns a 401 (Access Denied) then XMLHTTPRequest never feels the need to send the Authorization header at all, even if you specify a username and password for it. I'm really at a loss as to why it has this bizarre behaviour, so I'm really hoping I've misdiagnosed it, but it's looking unlikely.

This afternoon has been my first serious attempt to play with Rails, and the whole thing has been nothing but frustration! As well as the usual the-web-is-crap issues like the above, I've had to contend with documentation that is scattered over the Ruby and Rails websites, when it exists at all! Some of the stuff I've had to use (like the base.send :helper_method call to expose some things neatly to my views) seem vital but don't appear anywhere but as cursory mentions in changelogs. Furthermore, their habit of introducing breaking changes means some code examples I find don't work without some obscure patching, and when things go wrong there is so much framework magic going on I have a hard time debugging it! Hopefully this feeling will fade with time, as lots of other people seem to praise Rails to the heavens, but I can't remember even being this frustrated with a new technology :-).

Haskell Records Considered Grungy

Ugly field selection syntax
OK, the most trivial complaint first. If we have defined a record like this:

> data Bird = Bird { name :: String, wings :: Integer }

How do we go about accessing the 'name' and 'wings' fields of a record instance? If you are used to a language like C, you might say it would look something like this:

> (Bird { name = "Fred", wings = 2 }).name

Unfortunately, this isn't the case. Actually, declaring a record creates a named function which uses pattern matching to destroy a passed record and return the name. So access actually looks like this:

> name (Bird { name = "Fred", wings = 2 })

Prefix notation may please the Lisp fans, but for me at least it can get a bit confusing, especially when you are dealing with nested fields. In this case, the code you write looks like:

> (innerRecordField . outerRecordField) record

Which (when read left to right, as is natural) is entirely the wrong order of accessors. However, it is possible to argue this is just a bug in my brain from having spent too long staring at C code.. anyway, let's move onto more substantitive complaints!

Namespace pollution
Imagine you're writing a Haskell program to model poulty farmers who work as programmers in their spare time, so naturally you want to add to the Bird record above a Person record:

> data Person = Person { name :: String, knowsHaskell :: Boolean }

But I think you'll find the compiler has something to say about that....

Records.hs:4:23:
Multiple declarations of `Main.name'
Declared at: Records.hs:3:19
Records.hs:4:23

Ouch! This is because of the automatic creation of the 'name' function I alluded to earlier. Let's see what the Haskell compilers desugaring would look like:

> newtype Bird = Bird String Integer

> name :: Bird -> String
> name (Bird value, _) = value

> wings :: Bird -> Integer
> wings (Bird _, value) = value

> newtype Person = Person String Boolean

> name :: Person -> String
> name (Person value, _) = value

> knowsHaskell :: Person -> Boolean
> knowsHaskell (Person _, value) = value

As you can see, we have two name functions in the same scope: that's no good! In particular, this means you can't have records which share field names. However, using the magic of type classes we can hack up something approaching a solution. Let's desugar the records as before, but instead of those name functions add this lot:

> class NameField a where
> name :: a -> String

> instance NameField Bird where
> name (Bird value, _) = value

> instance NameField Person where
> name (Person value, _) = value

All we have done here is used the happy (and not entirely accidental) fact that the 'name' field is of type String in both records to create a type class with instances to let us extract it from both record types. A use of this would look something like:

> showName :: (NameField a) => a -> IO String
> showName hasNameField = print ("Name: " ++ (name hasNameField))

> showName (Person { name = "Simon Peyton-Jones", knowsHaskell = true })
> showName (Bird { name = "Clucker", wings = 2 })

Great stuff! Actually, we could use this hack to establish something like a subtype relationship on records, since any record with at least the fields of another could implement all of its field type classes (like the NameField type class, in this example). Another way this could be extended is to make use of the multiparameter type classes and functional dependency extensions to GHC to let the field types differ.

Of course, this is all just one hack on top of another. Actually, considerable brainpower has been expended on improving the Haskell record system, such as in a 2003 paper by the areforementioned Simon Peyton-Jones here. This proposal would have let you write something like this:

> showName :: (r <: { name :: String }) -> IO String
> showName { name = myName, .. } = print ("Name: " ++ myName)

The "r <: { name : String }" indicates any record which contains at least a field called name with type String can be consumed. The two dots ".." in the pattern match likewise indicate that fields other than name may be present. Note also the use of an anonymous record type: no data decleration was required in the code above. This is obviously a lot more concise than having to create the type classes yourselves, as we did, but actually we can make it even more concise by using another of the proposed extensions:

> showName { name, .. } = print ("Name: " ++ name)

Here, we omit the "name = myName" pattern match and make use of so-called "punning" to give us access to the name field: very nice! Unfortunately, all of this record-y goodness is speculative at least until Haskell' gets off the ground.

Record update is not first class

Haskell gives us a conventient syntax for record update. Lets say that one of our chickens strayed too close to the local nuclear reactor and sprouted an extra limb:

> exampleBird = Bird { name = "Son Of Clucker", wings = 2 }
> exampleBird { wings = 3 }

The last line above will return a Bird identical in all respects except that the wings will have been changed to 3. The naïve amongst us at this point might then think we could write something like:

> changeWings :: Integer -> Bird -> Bird
> changeWings x = { wings = x }

The intention here is to return a function that just sets a Bird records "wings" field to x. Unfortunately, this is not even remotely legal, which does make some sense since if it was record update should, to follow normal function application convention, look more like this:

> { wings = 3 } exampleBird

Right, I think that's got everything that's wrong about Haskell records off my chest: do you know of any points I've missed?

Edit: Corrected my pattern match syntax (whoops :-). Thanks, Saizan!
Edit 2: Clarified some points in response to jaybee's comments on the Reddit comments page.