Quirks of the Matlab file format

The Matlab file format has become something of a standard for data exchange in quant finance circles. It is not only handy for those who are using the Matlab interactive environment itself, but also to users working in a diverse spectrum of language, thanks to widespread availability of libraries for reading and writing the files. The format itself also has the handy property of supporting compression — an essential property for keeping disk usage reasonable with working with the highly compressible data that is typical of financial timeseries.

At work we have implemented our own high-performance Java library for reading and writing these files. The Mathworks have helpfully published a complete description of the format online, which makes this task for the most part straightforward. Unfortunately, the format also has some dark and undocumented corners that I spent quite some time investigating. This post is intended to record a couple of these oddities for posterity.

Unicode

The Matlab environment supports Unicode strings, and so consequently Matlab files can contain arbitrary Unicode strings. Unfortunately this is one area where the capabilities of Matlab itself and those intended by the Mathworks spec diverge somewhat. Specifically:

  1. While the spec documents a miUTF8 storage type, Matlab itself only seems to understand a very limited subset of UTF-8. For example, it can't even decode an example file which simply contains the UTF-8 encoded character sequence ←↑→↓↔. It turns out that Matlab cannot read codepoints that are encoded as three or more bytes! This means it can only understand U+0000 to U+07FF, leaving us in a sad situation when Matlab can't even understand the BMP.
  2. The miUTF32 storage type isn't supported at all. For example,
    this file is correctly formed according to the spec but unreadable in Matlab.
  3. UTF-16 mostly works. As it stands, this is really your only option if you want the ability to roundtrip Unicode via Matlab. One issue is that Matlab chars aren't really Unicode codepoints - they are sequences of UTF-16 code units. However, this is an issue shared by Python 2 and Java, so even though it is broken at least it is broken in the "normal" way.

Interestingly, most 3rd party libraries seem to implement these parts of the spec better than Matlab itself does — for example, scipy's loadmat and savemat functions have full support for all of these text storage data types. (Scipy does still have trouble with non-BMP characters however.)

Compression

As mentioned, .mat files have support for storing compressed matrices. These are simply implemented as nested zlib-compressed streams. Alas, it appears that the way that Matlab is invoking zlib is slightly broken, with the following consequences:

  • Matlab does not attempt to validate that the trailing ZLib checksum is present, and doesn't check it even if it is there.
  • If you attempt to open a file containing a ZLib stream that has experienced corruption such that the decompressed data is longer than Matlab was expecting, the error is silently ignored.
  • When writing out a .mat file, Matlab will sometimes not write the ZLib checksum. This happens very infrequently though — most files it creates do have a checksum as you would expect.

Until recently scipy's Matlab reader would not verify the checksum either, but I added support for this after we saw corrupted .mat files in the wild at work.

I've reported these compression and Unicode problems to the Mathworks and they have acknowledged that they are bugs, but at this time there is no ETA for a fix.