Compression of floating point timeseries

I recently had cause to investigate fast methods of storing and transferring financial timeseries. Naively, timeseries can be represented in memory or on disk as simple dense arrays of floating point numbers. This is an attractive representation with many nice properties:

  • Straightforward and widely used.
  • You have random access to the nth element of the timeseries with no further indexes required.
  • Excellent locality-of-reference for applications that process timeseries in time order, which is the common case.
  • Often natively supported by CPU vector instructions.

However, it is not a particularly space-efficient representation. Financial timeseries have considerable structure (e.g. Vodafone's price on T is likely to be very close to the price on T+1), and this structure can be exploited by compression algorithms to greatly reduce storage requirements. This is important either when you need to store a large number of timeseries, or need to transfer a smaller number of timeseries over a bandwidth-constrained network link.

Timeseries compression has recieved quite a bit of attention from both the academic/scientific programming community (see e.g. FPC and PFOR) and also practicioner communities such as the demoscene (see this presentation by a member of Farbrausch). This post summarises my findings about the effect that a number of easy-to-implement "filters" have on the final compression ratio.

In the context of compression algorithms, filters are simple invertible transformations that are applied to the stream in the hopes of making the stream more compressible by subsequent compressors. Perhaps the canonical example of a filter is the Burrows-Wheeler transform, which has the effect of moving runs of similar letters together. Some filters will turn a decompressed input stream (from the user) of length N into an output stream (fed to the compressor) of length N, but in general filters will actually have the effect of making the stream longer. The hope is that the gains to compressability are enough to recover the bytes lost to any encoding overhead imposed by the filter.

In my application, I was using the compression as part of an RPC protocol that would be used interactively, so I wanted keep decompression time very low, and for ease-of-deployment I wanted to get results in the context of Java without making use of any native code. Consequently I was interested in which choice of filter and compression algorithm would give a good tradeoff between performance and compression ratio.

I determined this experimentally. In my experiments, I used timeseries associated with 100 very liquid US stocks retrieved from Yahoo Finance, amounting to 69MB of CSVs split across 6 fields per stock (open/high/low/close and adjusted close prices, plus volume). This amounted to 12.9 million floating point numbers.

Choice of compressor

To decide which compressors were contenders, I compressed these price timeseries with a few pure-Java implementations of the algorithms:

Compressor Compression time (s) Decompression time (s) Compression ratio
None 0.0708 0.0637 1.000
Snappy (org.iq80.snappy:snappy-0.3) 0.187 0.115 0.843
Deflate BEST_SPEED (JDK 8) 4.59 4.27 0.602
Deflate DEFAULT_COMPRESSION (JDK 8) 5.46 4.29 0.582
Deflate BEST_COMPRESSION (JDK 8) 7.33 4.28 0.580
BZip2 MIN_BLOCKSIZE (org.apache.commons:commons-compress-1.10) 1.79 0.756 0.540
BZip2 MAX_BLOCKSIZE (org.apache.commons:commons-compress-1.10) 1.73 0.870 0.515
XZ PRESET_MIN (org.apache.commons:commons-compress-1.10 + org.tukaani:xz-1.5) 2.66 1.20 0.469
XZ PRESET_DEFAULT (org.apache.commons:commons-compress-1.10 + org.tukaani:xz-1.5) 9.56 1.15 0.419
XZ PRESET_MAX (org.apache.commons:commons-compress-1.10 + org.tukaani:xz-1.5) 9.83 1.13 0.419

These numbers were gathered from a custom benchmark harness which simply compresses and then decompresses the whole dataset once. However, I saw the same broad trends confirmed by a JMH benchmark of the same combined operation:

Compressor Compress/decompress time (s) JMH compress/decompress time (s)
None 0.135 0.127 ± 0.002
Snappy (org.iq80.snappy:snappy-0.3) 0.302 0.215 ± 0.003
Deflate BEST_SPEED (JDK 8) 8.86 8.55 ± 0.15
Deflate DEFAULT_COMPRESSION (JDK 8) 9.75 9.35 ± 0.09
Deflate BEST_COMPRESSION (JDK 8) 11.6 11.4 ± 0.1
BZip2 MIN_BLOCKSIZE (org.apache.commons:commons-compress-1.10) 2.55 3.10 ± 0.04
BZip2 MAX_BLOCKSIZE (org.apache.commons:commons-compress-1.10) 2.6 3.77 ± 0.31
XZ PRESET_MIN (org.apache.commons:commons-compress-1.10 + org.tukaani:xz-1.5) 3.86 4.08 ± 0.12
XZ PRESET_DEFAULT (org.apache.commons:commons-compress-1.10 + org.tukaani:xz-1.5) 10.7 11.1 ± 0.1
XZ PRESET_MAX (org.apache.commons:commons-compress-1.10 + org.tukaani:xz-1.5) 11.0 11.5 ± 0.4

What we see here is rather impressive performance from BZip2 and Snappy. I expected Snappy to do well, but BZip2's good showing surprised me. In some previous (unpublished) microbenchmarks I've not seen GZipInputStream (a thin wrapper around Deflate with DEFAULT_COMPRESSION) be quite so slow, and my results also seem to contradict other Java compression benchmarks.

One contributing factor may be that the structure of the timeseries I was working with in that unpublished benchmark was quite different: there was a lot more repetition (runs of NaNs and zeroes), and compression ratios were consequently higher.

In any event, based on these results I decided to continue my evaluation with both Snappy and BZip2 MIN_BLOCKSIZE. It's interesting to compare these two compressors because, unlike BZ2, Snappy doesn't perform any entropy encoding.

Filters

The two filters that I evaluated were transposition and zig-zagged delta encoding.

Transposition

The idea behind transposition (also known as "shuffling") is as follows. Let's say that we have three floating point numbers, each occuping 4 bytes:

On a big-endian system this will be represented in memory row-wise by the 4 consecutive bytes of the first float (MSB first), followed by the 4 bytes of the second float, and so on. In contrast, a transposed representation of the same data would encode all of the MSBs first, followed by all of the second-most-significant bytes, and so on, in a column-wise fashion:

The reason you might think that writing the data column-wise would improve compression is that you might expect that e.g. the most significant bytes of a series of floats in a timeseries to be very similar to each other. By moving these similar bytes closer together you increase the chance that compression algorithms will be able to find repeating patterns in them   undisturbed by the essentially random content of the LSB.

Field transposition

Analagous to the byte-level transposition described above, another thing we might try is transposition at the level of a float subcomponent. Recall that floating point numbers are divided into sign, exponent and mantissa components. For single precision floats this looks like:

Inspired by this, another thing we might try is transposing the data field-wise — i.e. serializing all the signs first, followed by all the exponents, then all the mantissas:

(Note that I'm inserting padding bits to keep multibit fields byte aligned — more on this later on.)

We might expect this transposition technique to improve compression by preventing changes in unrelated fields from causing us to be unable to spot patterns in the evolution of a certain field. A good example of where this might be useful is the sign bit: for many timeseries of interest we expect the sign bit to be uniformly 1 or 0 (i.e. all negative or all positive numbers). If we encoded the float without splitting it into fields, that that one very predictable bit is mixed in with 31 much more varied bits, which makes it much harder to spot this pattern.

Delta encoding

In delta encoding, you encode consecutive elements of a sequence not by their absolute value but rather by how much larger than they are than the previous element in the sequence. You might expect this to aid later compression of timeseries data because, although a timeseries might have an overall trend, you would expect the day-to-day variation to be essentially unchanging. For example, Vodafone's stock price might be generally trending up from 150p at the start of the year to 200p at the end, but you expect it won't usually change by more than 10p on any individual day within that year. Therefore, by delta-encoding the sequence you would expect to increase the probability of the sequence containing a repeated substring and hence its compressibility.

This idea can be combined with transposition, by applying the transposition to the deltas rather than the raw data to be compressed. If you do go this route, you might then apply a trick called zig-zagging (used in e.g. protocol buffers) and store your deltas such that small negative numbers are represented as small positive ints. Specifically, you might store the delta -1 as 1, 1 as 2, -2 as 3, 2 as 4 and so on. The reasoning behind this is that you expect your deltas to be both positive and negative, but certainly clustered around 0. By using zig-zagging, you tend to cause the MSB of your deltas to become 0, which then in turn leads to extremely compressible long runs of zeroes in your transposed version of those deltas.

Special cases

One particular floating point number is worth discussing: NaN. It is very common for financial timeseries to contain a few NaNs scattered throughout. For example, when a stock exchange is on holiday no official close prices will be published for the day, and this tends to be represented as a NaN in a timeseries of otherwise similar prices.

Because NaNs are both common and very dissimilar to other numbers that we might encounter, we might want to encode them with a special short representation. Specifically, I implemented a variant of the field transposition above, where the sign bit is actually stored extended to a two bit "descriptor" value with the following interpretation:

Bit 1 Bit 2 Interpretation
0 0 Zero
0 1 NaN
1 0 Positive
1 1 Negative

The mantissa and exponent are not stored if the descriptor is 0 or 1.

Note that this representation of NaNs erases the distinction between different NaN values, when in reality there are e.g. 16,777,214 distinct single precision NaNs. This technically makes this a lossy compression technique, but in practice it is rarely important to be able to distinguish between different NaN values. (The only application that I'm aware of that actually depends on the distinction between NaNs is LuaJIT.)

Methodology

In my experiments (available on Github) I tried all combinations of the following compression pipeline:

  1. Field transposition: on (start by splitting each number into 3 fields) or off (treat whole floating point number as a single field)?
  2. (Only if field transposition is being used.) Special cases: on or off?
  3. Delta encoding: off (store raw field contents) or on (stort each field as an offset from the previous field)? When delta encoding was turned on, I additionally used zig-zagging.
  4. Byte transposition: given that I have a field, should I transpose the bytes of that field? In fact, I exhaustively investigated all possible byte-aligned transpositions of each field.
  5. Compressor: BZ2 or Snappy?

I denote a byte-level transposition as a list of numbers summing to the number of bytes in one data item. So for example, a transposition for 4-byte numbers which wrote all of the LSBs first, followed by the all of the next-most-significant bytes etc would be written as [1, 1, 1, 1], while one that broke each 4-byte quantity into two 16-bit chunks would be written [2, 2], and the degenerate case of no transposition would be [4]. Note that numbers occur in the list in increasing order of the significance of the bytes in the item that they manipulate.

As discussed above, in the case where a literal or mantissa wasn't an exact multiple of 8 bits, my filters padded the field to the nearest byte boundary before sending it to the compressor. This means that the filtering process actually makes the data substantially larger (e.g. 52-bit double mantissas are padded to 56 bits, becoming 7.7% larger in the process). This not only makes the filtering code simpler, but also turns out to be essential for good compression when using Snappy, which is only able to detect byte-aligned repitition.

Without further ado, let's look at the results.

Dense timeseries

I begin by looking at dense timeseries where NaNs do not occur. With such data, it's clear that we won't gain from the "special cases" encoding above, so results in this section are derived from a version of the compression code where we just use 1 bit to encode the sign.

Single-precision floats

The minimum compressed size (in bytes) achieved for each combination of parameters is as follows:

Exponent Method Mantissa Method BZ2 Snappy
Delta Delta 6364312 9067141
Literal 6283216 8622587
Literal Delta 6372444 9071864
Literal 6306624 8626114

(This table says that, for example, if we delta-encode the float exponents but literal-encode the mantissas, then the best tranposition scheme achieved a compressed size of 6,283,216 bytes.)

The story here is that delta encoding is strictly better than literal encoding for the exponent, but conversely literal encoding is better for the mantissa. In fact, if we look at the performance of each possible mantisas transposition, we can see that delta encoding tends to underperform in those cases where the MSB is split off into its own column, rather than being packaged up with the second-most-significant byte. This result is consistent across both BZ2 and Snappy.

Mantissa Transposition Mantissa Method BZ2 Snappy
[1, 1, 1] Delta 7321926 9199766
Literal 7226293 9154821
[1, 2] Delta 7394645 9317258
Literal 7824557 9420206
[2, 1] Delta 6514554 9099462
Literal 6283216 8622587
[3] Delta 6364312 9067141
Literal 6753718 9475316

The other interesting feature of these results is that transposition tends to hurt BZ2 compression ratios. It always makes things worse with delta encoding, but even with literal encoding only one particular transposition ([2, 1]) actually strongly improves a BZ2 result. Things are a bit different for Snappy: although once again delta is always worse with transposition enabled, transpostion always aids Snappy in the literal case — though once again the effect is strongest with [2, 1] transposition.

The strong showing for [2, 1] transposition suggests to me that the lower-order bits of the mantissa are more correlated with each other than they are with the MSB. This sort of makes sense, since due to the fact that equities trade with a fixed tick size prices will actually be quantised into a relatively small number of values. This will tend to cause the lower order bits of the mantissa to become correlated.

Finally, we can ask what would happen if we didn't make the mantissa/exponent distinction at all and instead just packed those two fields together:

Method BZ2 Snappy
Delta 6254386 8847839
Literal 6366714 8497983

These numbers don't show any clear preference for either of the two approaches. For BZ2, delta performance is improved by not doing the splitting, at the cost of larger outputs when using the delta method, while for Snappy we have the opposite: literal performance is improved while delta performance is harmed. What is true is that the best case, the compressed sizes we observe here are better than the best achievable size in the no-split case.

In some ways it is quite surprising that delta encoding ever beats literal encoding in this scenario, because it's not clear that the deltas we compute here are actually meaningful and hence likely to generally be small.

We can also analyse the best available transpositions in this case. Considering BZ2 first:

BZ2 Rank Delta Transposition Size Literal Transposition Size
1 [4] 6254386 [2, 2] 6366714
2 [3, 1] 6260446 [2, 1, 1] 6438215
3 [2, 2] 6395954 [3, 1] 7033810
4 [2, 1, 1] 6402354 [4] 7109668
5 [1, 1, 1, 1] 7136612 [1, 1, 1, 1] 7327437
6 [1, 2, 1] 7227282 [1, 1, 2] 7405128
7 [1, 1, 2] 7285350 [1, 2, 1] 8039403
8 [1, 3] 7337277 [1, 3] 8386647

Just as above, delta encoding does best when transposition at all is used, and generally gets worse as the transposition gets more and more "fragmented". On the other hand, literal encoding does well with transpositions that tend to keep together the first two bytes (i.e. the exponent + the leading bits of the mantissa).

Now let's look at the performance of the unsplit data when compressed with Snappy:

Snappy Rank Delta Transposition Size Literal Transposition Size
1 [3, 1] 8847839 [2, 1, 1] 8497983
2 [2, 1, 1] 8883392 [1, 1, 1, 1] 9033135
3 [1, 1, 1, 1] 8979988 [1, 2, 1] 9311863
4 [1, 2, 1] 9093582 [3, 1] 9404027
5 [2, 2] 9650796 [2, 2] 10107042
6 [4] 9659888 [1, 1, 2] 10842190
7 [1, 1, 2] 9847987 [4] 10942085
8 [1, 3] 10524215 [1, 3] 11722159

The Snappy results are very different from the BZ2 case. Here, the same sort of transpositions tend to do well with both the literal and delta methods. The kinds of transpositions that are successful are those that keep together the exponent and the leading bits of the mantissa, though even fully-dispersed transpositions like [1, 1, 1, 1] put in a strong showing.

That's a lot of data, but what's the bottom line? For Snappy, splitting the floats into mantisas and exponent before processing does seem to have slightly more consistently small outputs than working with unsplit data. The BZ2 situation is less clear but only because the exact choice doesn't seem to make a ton of difference. Therefore, my recommendation for single-precision floats is to delta-encode exponents, and to use literal encoding for mantissas with [2, 1] transposition.

Double-precision floats

While there were only 4 different transpositions for single-precision floats, there are 2 ways to transpose a double-precision exponent, and 64 ways to transpose the mantissa. This makes the parameter search for double precision considerably more computationally expensive. The results are:

Exponent Method Mantissa Method BZ2 Snappy
Delta Delta 6485895 12463583
Literal 6500390 10437550
Literal Delta 6456152 12469132
Literal 6475579 10439869

These results show interesting differences between BZ2 and Snappy. For BZ2 there is not much in it, but it's consistently always better to literal-encode the exponent and delta-encode the mantissa. For Snappy, things are exactly the other way around: delta-encoding the exponent and literal-encoding the mantissa is optimal.

The choice of exponent transposition scheme has the following effect:

Exponent Transposition Exponent Method BZ2 Snappy
[1, 1] Delta 6485895 10437550
Literal 6492837 10439869
[2] Delta 6496032 10560467
Literal 6456152 10564737

It's not clear, but [1, 1] transposition might be optimal. Bear in mind that double exponents are only 11 bits long, so the lower 5 bits of the LSB being encoded here will always be 0. Using [1, 1] transposition might better help the compressor get a handle on this pattern.

When looking at the best mantissa transpositions, there are so many possible transpositions that we'll consider BZ2 and Snappy one by one, examining just the top 10 transposition choices for each. BZ2 first:

BZ2 Rank Delta Mantissa Transposition Size Literal Mantissa Transposition Size
1 [7] 6456152 [6, 1] 6475579
2 [6, 1] 6498686 [1, 5, 1] 6495555
3 [4, 3] 6962167 [2, 4, 1] 6510416
4 [4, 2, 1] 6994270 [1, 1, 4, 1] 6510416
5 [1, 6] 7040401 [3, 3, 1] 6692613
6 [3, 4] 7054835 [2, 1, 3, 1] 6692613
7 [1, 5, 1] 7092230 [1, 1, 1, 3, 1] 6692613
8 [2, 5] 7108000 [1, 2, 3, 1] 6692613
9 [2, 4, 1] 7176760 [1, 6] 6926231
10 [3, 3, 1] 7210100 [7] 6931475

We can see that literal encoding tends to beat delta encoding, though the very best size was in fact achieved via a simple untransposed delta representation. In both the literal and the delta case, the encodings that do well tend to keep the middle 5 bytes of the mantissa grouped together, which is support for our idea that these bytes tend to be highly correlated, with most of the information being encoded in the MSB.

Turning to Snappy:

Snappy Rank Delta Mantissa Transposition Size Literal Mantissa Transposition Size
1 [5, 1, 1] 12463583 [3, 2, 1, 1] 10437550
2 [1, 3, 2, 1] 12678887 [1, 1, 1, 2, 1, 1] 10437550
3 [6, 1] 12737838 [1, 2, 2, 1, 1] 10437550
4 [4, 2, 1] 12749067 [2, 1, 2, 1, 1] 10437550
5 [7] 12766804 [1, 1, 1, 1, 1, 1, 1] 10598739
6 [1, 3, 1, 1, 1] 12820360 [2, 1, 1, 1, 1, 1] 10598739
7 [3, 1, 2, 1] 12824154 [3, 1, 1, 1, 1] 10598739
8 [4, 1, 1, 1] 12890981 [1, 2, 1, 1, 1, 1] 10598739
9 [3, 3, 1] 12915900 [1, 1, 1, 1, 2, 1] 10715444
10 [3, 1, 1, 1, 1] 12946767 [3, 1, 2, 1] 10715444

The Snappy results are strikingly different from the BZ2 ones. In this case, just like BZ2, literal encoding tends to beat delta encoding, but the difference is much more pronounced than the BZ2 case. Furthermore, the kinds of transpositions that minimize the size of the literal encoded data here are very different from the transpositions that were successful with BZ2: in that case we wanted to keep the middle bytes together, while here the scheme [1, 1, 1, 1, 1, 1, 1] where every byte has it's own column is not far from optimal.

And now considering results for the case where we do not split the floating point number into mantissa/exponent components:

Method BZ2 Snappy
Delta 6401147 11835533
Literal 6326562 9985079

These results show a clear preference for literal encoding, which is definitiely what we expect, given that delta encoding is not obviously meaningful for a unsplit number. We also see results that are universally better than those for split case: it seems that splitting the number into fields is actually a fairly large pessimisation! This is probably caused by the internal fragmentation implied by our byte-alignment of the data, which is a much greater penalty for doubles than it was for singles. It would be interesting to repeat the experiment without byte-alignment.

We can examine which transposition schemes do best in the unsplit case. BZ2 first:

BZ2 Rank Delta Transposition Size Literal Transposition Size
1 [8] 6401147 [6, 2] 6326562
2 [7, 1] 6407935 [1, 5, 2] 6351132
3 [6, 2] 6419062 [2, 4, 2] 6360533
4 [6, 1, 1] 6440333 [1, 1, 4, 2] 6360533
5 [4, 3, 1] 6834375 [3, 3, 2] 6562859
6 [4, 4] 6866167 [1, 2, 3, 2] 6562859
7 [4, 2, 1, 1] 6903123 [2, 1, 3, 2] 6562859
8 [4, 2, 2] 6903322 [1, 1, 1, 3, 2] 6562859
9 [1, 7] 6979216 [6, 1, 1] 6598104
10 [1, 6, 1] 6983998 [1, 5, 1, 1] 6621003

Recall that double precision floating point numbers have 11 bits of exponent and 52 bits of mantissa. We can actually see that showing up in the literal results above: the transpositions that do best are those that either pack together the exponent and the first bits of the mantissa, or have a seperate column for just the exponent information (e.g. [1, 5, 2] or [2, 4, 2]).

And Snappy:

Snappy Rank Delta Transposition Size Literal Transposition Size
1 [5, 1, 1, 1] 11835533 [1, 1, 1, 2, 1, 1, 1] 9985079
2 [1, 3, 2, 1, 1] 12137223 [2, 1, 2, 1, 1, 1] 9985079
3 [6, 1, 1] 12224432 [3, 2, 1, 1, 1] 9985079
4 [4, 2, 1, 1] 12253670 [1, 2, 2, 1, 1, 1] 9985079
5 [1, 3, 1, 1, 1, 1] 12280165 [1, 1, 1, 1, 1, 1, 1, 1] 10232996
6 [3, 1, 2, 1, 1] 12281694 [2, 1, 1, 1, 1, 1, 1] 10232996
7 [5, 1, 2] 12338551 [1, 2, 1, 1, 1, 1, 1] 10232996
8 [4, 1, 1, 1, 1] 12399017 [3, 1, 1, 1, 1, 1] 10232996
9 [3, 1, 1, 1, 1, 1] 12409691 [1, 1, 1, 1, 2, 1, 1] 10343471
10 [2, 2, 2, 1, 1] 12434147 [2, 1, 1, 2, 1, 1] 10343471

Here we see the same pattern as we did above: Snappy seems to prefer "more transposed" transpositions than BZ2 does, and we even see a strong showing for the maximal split [1, 1, 1, 1, 1, 1, 1, 1].

To summarize: for doubles, it seems that regardless of which compressor you use, you are better off not splitting into mantissa/exponent portions, and just literal encoding the whole thing. If using Snappy, [1, 1, 1, 1, 1, 1, 1, 1] transposition seems to be the way to go, but the situation is less clear with BZ2: [6, 2] did well in our tests but it wasn't a runaway winner.

If for some reason you did want to use splitting, if you are also going to use BZ2, then [6, 1] literal encoding for the mantissas and literal encoding for the exponents seems like a sensible choice. If you are a Snappy user, then I would suggest that a principled choice would be to use [1, 1, 1, 1, 1, 1, 1] literal encoding for the mantissas and likewise [1, 1] literal encoding for the exponent.

Sparse timeseries

Let's now look at the sparse timeseries case, where many of the values in the timeseries are NaN. In this case, we're interested in evaluating how useful the "special case" optimization above is in improving compression ratios.

To evaluate this, I replaced a fraction of numbers in my test dataset with NaNs and looked at the best possible size result for a few such fractions. The compressed size in each case is:

No Split Split w/ Special Cases Split wout/ Special Cases
0.00 9985079 10445497 10437550
0.10 10819117 9909251 11501526
0.50 8848393 6238097 10255811
0.75 5732076 3628543 7218287

Note that for convenience here the only compressor I tested was Snappy — i.e. BZ2 was not tested. I also didn't implement special cases in the no-split case, because an artifact of my implementation is that the special-casing is done at the same time as the float is split into its three component fields (sign, mantissa, exponent).

As we introduce small numbers of NaNs to the data, both the nosplit and non-special-cased data get larger. This is expected, because we're replacing predictable timeseries values at random with totally dissimilar values and hence adding entropy. The special-cased split shrinks because this increasing entropy is compensated for by the very short codes we have chosen for NaNs (for which we do pay a very small penalty in the NaNless case). At very high numbers of NaNs, the compressed data for all methods shrinks as NaNs become the rule rather than the exception.

High numbers of NaNs (10% plus) is probably a realistic fraction for real world financial data, so it definitely does seem like implementing special cases is worthwhile. The improvement would probably be considerably less if we looked at BZ2-based results, though.

Closing thoughts

One general observation is that delta encoding is very rarely the best choice, and when it is the best, the gains are usually marginal when compared to literal encoding. This is interesting because Fabian Giesen came to exactly the same conclusion (that delta encoding is redundant when you can do transposition) in the excellent presentation that I linked to earlier.

By applying these techniques to the dataset I was dealing with at work, I was able to get a nice compression ratio on the order of 10%-20% over and above that I could achieve with naive use of Snappy, so I consider the work a success, but don't intend to do any more research in the area. However, there are definitely more experiments that could be done in this vein. In particular, interesting questions are:

  • How robust are the findings of this post when applied to other datasets?
  • What if we don't byte-align everything? Does that improve the BZ2 case? (My prelimiary experiments showed that it made Snappy considerably worse.)
  • Why exactly are the results for BZ2 and Snappy so different? Presumably it relates to the lack of an entropy encoder is Snappy, but it is not totally clear to me how this leads to the results above.

Easy publishing to Maven Central with Gradle

I recently released my first open source library for Java, MDBI. I learnt a lot about the Java open-source ecosystem as part of this process, and this blog summarises that in the hope that it will be useful to others. Specifically, the post will explain how to set up a project using the modern Gradle build system to build code and deploy it to the standard Maven Central repository from the command line really easily.

Getting started

In the Haskell ecosystem, everyone uses Cabal and Hackage, which are developed by the same people and tightly integrated. In contrast, Java's ecosystem is a bit more fragmented: build systems and package repositiories are managed by different organisations, and you need to do a bit of integration work to join everything up.

In particular, in order to get started we're going to have to sign up with two different websites: Sonatype OSS and Bintray:

  • No-one can publish directly to Maven Central: instead you need to publish your project to an "approved repository", from where it will be synced to Central. Sonatype OSS is an approved repository that Sonatype (the company that runs Maven Central) provide free of charge specifically for open-source projects. We will use this to get our artifacts into Central, so go and follow the sign-up instructions now.

    Your application will be manually reviewed by a Sonatype employee and approved within one or two working days. If you want an example of what this process looks like you can take a look at the ticket I raised for my MDBI project.

  • Sonatype OSS is a functional enough way to get your artifacts onto Central, but it has some irritating features. In particular, when you want to make a release you need to first push your artifacts to OSS, and then use an ugly and confusing web interface called Sonatype Nexus to actually "promote" this to Central. I wanted the release to Central to be totally automated, and the easiest way to use that is to have a 3rd party deal with pushing to and then promoting from OSS. For this reason, you should also sign up with Bintray (you can do this with one click if you have a GitHub account).

    Bintray is run by a company called JFrog and basically seems to be a Nexus alternative. JFrog run a Maven repository called JCenter, and it's easy to publish to that via Bintray. Once it's on JCenter we'll be able to push and promote it on Sonatype OSS fully automatically.

We also need to create a Bintray "package" within your Bintray Maven repository. Do this via the Bintray interface — it should be self-explanatory. Use the button on the package page to request it be linked to JCenter (this was approved within a couple of hours for me).

We'll also need a GPG public/private key pair. Let's set that up now:

  1. Open up a terminal and run gpg --gen-key. Accept all the defaults about the algorithm to use, and enter a name, email and
    passphrase of your choosing.
  2. If you run gpg --list-public-keys you should see something like this:

    /Users/mbolingbroke/.gnupg/pubring.gpg
    --------------------------------------
    pub   2048R/3117F02B 2015-11-18
    uid                  Max Bolingbroke <batterseapower@hotmail.com>
    sub   2048R/15245385 2015-11-18

    Whatever is in place of 3117F02B is the name of your key. I'll call this $KEYNAME from now on.

  3. Run gpg --keyserver hkp://pool.sks-keyservers.net --send-keys $KEYNAME to publish your key.
  4. Run gpg -a --export-key $KEYNAME and gpg -a --export-secret-key $KEYNAME to get your public and secret keys as ASCII text. Edit your Bintray account and paste these into the "GPG Signing" part of the settings.
  5. Edit your personal Maven repository on Bintray and select the option to "GPG Sign uploaded files automatically". Don't use Bintray's public/private key pair.

Now you have your Bintray and OSS accounts we can move on to setting up Gradle.

Gradle setup

The key problem we're trying to solve with our Gradle build is producing a set of JARs that meet the Maven Central requirements. What this boils down to is ensuring that we provide:

  • The actual JAR file that people will run.
  • Source JARs containing the code that we built.
  • Javadoc JARs containing compiled the HTML help files.
  • GPG signatures for all of the above. (This is why we created a GPG key above.)
  • A POM file containing project metadata.

To satisfy these requirements we're going to use gradle-nexus-plugin. The resulting (unsigned, but otherwise Central-compliant) artifacts will then be uploaded to Bintray (and eventually Sonatype OSS + Central) using gradle-bintray-plugin. I also use one more plugin — Palantir's gradle-gitsemver — to avoid having to update the Gradle file whenever the version number changes. Our Gradle file begins by pulling all those plugins in:

buildscript {
    repositories {
        jcenter()
        maven { url "http://dl.bintray.com/palantir/releases" }
    }
    dependencies {
        classpath 'com.bmuschko:gradle-nexus-plugin:2.3.1'
        classpath 'com.jfrog.bintray.gradle:gradle-bintray-plugin:1.4'
        classpath 'com.palantir:gradle-gitsemver:0.7.0'
    }
}

apply plugin: 'java'
apply plugin: 'com.bmuschko.nexus'
apply plugin: 'com.jfrog.bintray'
apply plugin: 'gitsemver'

Now we have the usual Gradle configuration describing how to build the JAR. Note the use of the semverVersion() function (provided by the gradle-gitsemver plugin) which returns a version number derived from from the most recent Git tag of the form
vX.Y.Z. Despite the name of the plugin, there is no requirement to actually adhere to the principles of Semantic Versioning to
use it: the only requirements for the version numbers are syntactic.

version semverVersion()
group 'uk.co.omega-prime'

def projectName = 'mdbi'
def projectDescription = 'Max\'s DataBase Interface: a simple but powerful JDBC wrapper inspired by JDBI'

sourceCompatibility = 1.8

jar {
    baseName = projectName
    manifest {
        attributes 'Implementation-Title': projectName,
                   'Implementation-Version': version
    }
}

repositories {
    mavenCentral()
}

dependencies {
    compile group: 'com.google.code.findbugs', name: 'jsr305', version: '3.0.1'
    testCompile group: 'org.xerial', name: 'sqlite-jdbc', version: '3.8.11.2'
    testCompile group: 'junit', name: 'junit', version: '4.12'
}

(Obviously your group, project name, description, dependencies etc will differ from this. Hopefully it's clear which parts of this example Gradle file you'll need to change for your project and which you can copy verbatim.)

Now we need to configure gradle-nexus-plugin to generate the POM. Just by the act of including the plugin we have already arranged for the appropriate JARs to be generated, but the plugin can't figure out the full contents of the POM by itself.

modifyPom {
    project {
        name projectName
        description projectDescription
        url 'http://batterseapower.github.io/mdbi/'

        scm {
            url 'https://github.com/batterseapower/mdbi'
            connection 'scm:https://batterseapower@github.com/batterseapower/mdbi.git'
            developerConnection 'scm:git://github.com/batterseapower/mdbi.git'
        }

        licenses {
            license {
                name 'The Apache Software License, Version 2.0'
                url 'http://www.apache.org/licenses/LICENSE-2.0.txt'
                distribution 'repo'
            }
        }

        developers {
            developer {
                id 'batterseapower'
                name 'Max Bolingbroke'
                email 'batterseapower@hotmail.com'
            }
        }
    }
}

nexus {
    sign = false
}

Note that I've explicitly turned off the automatic artifact signing capability of the Nexus plugin. Theoretically we should be able to keep this turned on, and sign everything locally before pushing to Bintray. This would mean that we wouldn't have to give Bintray our private key. In practice, if you sign things locally Bintray seems to mangle the signature filenames so they become unusable...

Finally, we need to configure the Bintray sync:

if (hasProperty('bintrayUsername') || System.getenv().containsKey('BINTRAY_USER')) {
    // Used by the bintray plugin
    bintray {
        user = System.getenv().getOrDefault('BINTRAY_USER', bintrayUsername)
        key  = System.getenv().getOrDefault('BINTRAY_KEY', bintrayApiKey)
        publish = true

        pkg {
            repo = 'maven'
            name = projectName
            licenses = ['Apache-2.0']
            vcsUrl = 'https://github.com/batterseapower/mdbi.git'

            version {
                name = project.version
                desc = projectDescription
                released = new Date()

                mavenCentralSync {
                    user     = System.getenv().getOrDefault('SONATYPE_USER', nexusUsername)
                    password = System.getenv().getOrDefault('SONATYPE_PASSWORD', nexusPassword)
                }
            }
        }

        configurations = ['archives']
    }
}

We do this conditionally because we still want people to be able to use the Gradle file even if they don't have your a username and password set up. In order to make these credentials available to the script when run on your machine, you need to create a ~/.gradle/gradle.properties file with contents like this:

# These 3 are optional: they'll be needed if you ever use the nexus plugin with 'sign = true' (the default)
signing.keyId=<GPG $KEYNAME from earlier>
signing.password=<your GPG passphrase>
signing.secretKeyRingFile=<absolute path to your ~/.gnupg/secring.gpg file (or whatever you called it)>

nexusUsername=<username for Sonatype OSS>
nexusPassword=<password for Sonatype OSS>

bintrayUsername=<username for Bintray>
bintrayApiKey=<Bintray API key, found in the "API key" section of https://bintray.com/profile/edit>

You can see the complete, commented, Gradle file that I'm using in my project on Github.

Your first release

We should now be ready to go (assuming your Sonatype OSS and JCenter setup requests have been approved). Let's make a release! Go to the terminal and type:

git tag v1.0.0
gradle bintrayUpload

If everything works, you'll get a BUILD SUCCESSFUL message after a minute or so. Your new version should be visible on the Bintray package page (and JCenter) immediately, and will appear on Maven Central shortly afterwards.

If you want to go the whole hog and have your continuous integration (e.g. the excellent Travis) make these automatic deploys after every passing build, this guide for SBT looks useful. However, I didn't go this route so I can't say it how it could be adapted for Gradle.

A nice benefit of publishing to Maven Central is that javadoc.io will host your docs for free totally automatically. Check it out!

Overall I found the process of painlessly publishing Java open source to Maven Central needlessly confusing, with many more moving parts than I was expecting. The periods of waiting for 3rd parties to approve my project were also a little frustrating, though it fairness the turnaround time was quite impressive given that they were doing the work for free. Hopefully this guide will help make the process a little less frustrating for other Gradle users in the future.

Beware: java.nio.file.WatchService is subtly broken on Linux

This blog describes a bug that I reported to Oracle a month or so ago but still doesn't seem to have made it's way through to the official tracker.

The problem is that on Linux, file system events that should be being delivered by WatchService events can be silently discarded or be delivered against the wrong WatchKey. So for example, it's possible to register two directories, A and B, with a WatchService waiting for ENTRY_CREATE events, then create a file A/C but get an event with the WatchKey for B and WatchEvent.context C.

The reason for this is a bug in the JDK's LinuxWatchService. This class wraps an inotify instance, and also a thread that spins using poll to wait for either for:

  • A file system event to be delivered on the inotify FD, or
  • A byte to arrive on a FD corresponding to a pipe which is owned by the LinuxWatchService

Whenever a registration request is made by the user of the LinuxWatchService, the request is enqueued and then a single byte is written to the other end of this pipe to wake up the background thread, which will then make the actual registration with the kernel.

The core loop of this background thread is where the bug lies. The loop body looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// wait for close or inotify event
nReady = poll(ifd, socketpair[0]);
 
// read from inotify
try {
    bytesRead = read(ifd, address, BUFFER_SIZE);
} catch (UnixException x) {
    if (x.errno() != EAGAIN)
        throw x;
    bytesRead = 0;
}
 
// process any pending requests
if ((nReady > 1) || (nReady == 1 && bytesRead == 0)) {
    try {
        read(socketpair[0], address, BUFFER_SIZE);
        boolean shutdown = processRequests();
        if (shutdown)
            break;
    } catch (UnixException x) {
        if (x.errno() != UnixConstants.EAGAIN)
            throw x;
    }
}
 
// iterate over buffer to decode events
int offset = 0;
while (offset < bytesRead) {
    long event = address + offset;
    int wd = unsafe.getInt(event + OFFSETOF_WD);
    int mask = unsafe.getInt(event + OFFSETOF_MASK);
    int len = unsafe.getInt(event + OFFSETOF_LEN);
 
    // Omitted: the code that actually does something with the inotify event
}
// wait for close or inotify event
nReady = poll(ifd, socketpair[0]);

// read from inotify
try {
    bytesRead = read(ifd, address, BUFFER_SIZE);
} catch (UnixException x) {
    if (x.errno() != EAGAIN)
        throw x;
    bytesRead = 0;
}

// process any pending requests
if ((nReady > 1) || (nReady == 1 && bytesRead == 0)) {
    try {
        read(socketpair[0], address, BUFFER_SIZE);
        boolean shutdown = processRequests();
        if (shutdown)
            break;
    } catch (UnixException x) {
        if (x.errno() != UnixConstants.EAGAIN)
            throw x;
    }
}

// iterate over buffer to decode events
int offset = 0;
while (offset < bytesRead) {
    long event = address + offset;
    int wd = unsafe.getInt(event + OFFSETOF_WD);
    int mask = unsafe.getInt(event + OFFSETOF_MASK);
    int len = unsafe.getInt(event + OFFSETOF_LEN);

    // Omitted: the code that actually does something with the inotify event
}

The issue is that two read calls are made by this body — once with the inotify FD ifd, and once with the pipe FD socketpair[0]. If data happens to be available both via the pipe and via inotify, then the read from the pipe will corrupt the first few bytes of the inotify event stream! As it happens, the first few bytes of an event denote which watch descriptor the event is for, and so the issue usually manifests as an event being delivered against the wrong directory (or, if the resulting watch descriptor is not actually valid, the event being ignored entirely).

Note that this issue can only occur if you are registering watches while simultaneously receiving events. If your program just sets up some watches at startup and then never registers/cancels watches again you probably won't be affected. This, plus the fact that it is only triggered by registration requests and events arriving very close together, is probably why this bug has gone undetected since the very first release of the WatchService code.

I've worked around this myself by using the inotify API directly via JNA. This reimplementation also let me solve a unrelated WatchService "feature", which is that WatchKey.watchable can point to the wrong path in the event that a directory is renamed. So if you create a directory A, start watching it for EVENT_CREATE events, rename the directory to B, and then create a file B/C the WatchKey.watchable you get from the WatchService will be A rather than B, so naive code will derive the incorrect full path A/C for the new file.

In my implementation, a WatchKey is invalidated if the directory is watches is renamed, so a user of the class has the opportunity to reregister the new path with the correct WatchKey.watchable if they so desire. I think this is much saner behaviour!

Asynchronous and non-blocking IO

This post aims to explain the difference between asynchronous and non-blocking IO, with particular reference to their implementation in Java. These two styles of IO API are closely related but have a number of important differences, especially when it comes to OS support.

Asynchronous IO

Asynchronous IO refers to an interface where you supply a callback to an IO operation, which is invoked when the operation completes. This invocation often happens to an entirely different thread to the one that originally made the request, but this is not necessarily the case. Asynchronous IO is a manifestation of the "proactor" pattern.

One common way to implement asynchronous IO is to have a thread pool whose threads are used to make the normal blocking IO requests, and execute the appropriate callbacks when these return. The less common implementation approach is to avoid a thread pool, and just push the actual asynchronous operations down into the kernel. This alternative solution obviously has the disadvantage that it depends on operating system specific support for making async operations, but has the following advantages:

  • The maximum number of in-flight requests is not bounded by the size of your thread pool
  • The overhead of creating thread pool threads is avoided (e.g. you need not reserve any memory for the thread stacks, and you don't pay the extra context switching cost associated with having more schedulable entities)
  • You expose more information to the kernel, which it can potentially use to make good choices about how to do the IO operations — e.g. by minimizing the distance that the disk head needs to travel to satisfy your requests, or by using native command queueing.

Operating system support for asynchronous IO is mixed:

  • Linux has at least two implementations of async IO:
    • POSIX AIO (aio_read et al). This is implemented on Linux by glibc, but other POSIX systems (Solaris, OS X etc) have their own implementations. The glibc implementation is simply a thread pool based one — I'm not sure about the other systems.
    • Linux kernel AIO (io_submit et al). No thread pool is used here, but it has quite a few limitations (e.g. it only works for files, not sockets, and has alignment restrictions on file reads) and does not seem to be used much in practice.

    There is a good discussion of the *nix AIO situation on the libtorrent blog, summarised by the same writer on Stack Overflow here. The experience of this author was that the limitations and poor implementation quality of the various *nix AIO implementations are such that you are much better off just using your own thread pool to issue blocking operations.

  • Windows provides a mechanism called completion ports for performing asynchronous IO. With this system:
    1. You start up a thread pool and arrange for each thread to spin calling GetQueuedCompletionStatus
    2. You make IO requests using the normal Windows APIs (e.g. ReadFile and WSARecv), with the small added twist that you supply a special LPOVERLAPPED parameter indicating that the calls should be non-blocking and the result should be reported to the thread pool
    3. As IO completes, thread pool threads blocked on GetQueuedCompletionStatus are woken up as necessary to process completion events

    Windows intelligently schedules how it delivers GetQueuedCompletionStatus wakeups, such that it tries to roughly keep the same number of threads active at any time. This avoids excessive context switching and scheduler transitions — things are arranged so that a thread which has just processed a completion event will likely be able to immediately grab a new work item. With this arrangement, your pool can be much smaller than the number of IO operations you want to have in-flight: you only need to have as many threads as are required to process completion events.

In Java, support for asynchronous IO was added as part of the NIO2 work in JDK7, and the appropriate APIs are exposed by the AsynchronousChannel class. On *nix, AsynchronousFileChannel and AsynchronousSocketChannel are implemented using the standard thread pool approach (the pools are owned by an AsynchronousChannelGroup). On Windows, completion ports are used — in this case, the AsynchronousChannelGroup thread poll is used as the GetQueuedCompletionStatus listeners.

If you happen to be stuck on JDK6, your only option is to ignore completion ports and roll your own thread pool to dispatch operations on e.g. standard synchronous FileChannels. However, if you do this you may find that you don't actually get much concurrency on Windows. This happens because FileChannel.read(ByteBuffer, long) is needlessly crippled by taking a lock on the whole FileChannel. This lock is needless because FileChannel is otherwise a thread-safe class, and in order to make sure your positioned read isn't interfering with the other readers you don't need to lock — you simply need to issue a ReadFile call with a custom position by using one of the fields of the LPOVERLAPPED struct parameter. Note that the *nix implementation of FileChannel.read does the right thing and simply issues a pread call without locking.

Non-blocking IO

Non-blocking IO refers to an interface where IO operations will return immediately with a special error code if called when they are in a state that would otherwise cause them to block. So for example, a non-blocking recv will return immediately with a EAGAIN or EWOULDBLOCK error code if no data is available on the socket, and likewise send will return immediately with an error if the OS send buffers are full. Generally APIs providing non-blocking IO will also provide some sort of interface where you can efficiently wait for certain operations to enter a state where invoking the non-blocking IO operation will actually make some progress rather than immediately returning. APIs in this style are implementations of the reactor pattern.

No OS that I know of implements non-blocking IO for file IO, but support for socket IO is generally reasonable:

  • Non-blocking read and writes are available via the POSIX O_NONBLOCK operating mode, which can be set on file descriptors (FDs) representing sockets and FIFOs.

  • POSIX provides select and poll which let you wait for reads and writes to be ready on several FDs. (The difference between these two is pretty much just that select lets you wait for a number of FDs up to FD_SETSIZE, while poll can wait for as many FDs as you are allowed to create.)

    Select and poll have the major disadvantage that when the kernel returns from one of these calls, you only know the number of FDs that got triggered — not which specific FDs have become unblocked. This means you later have to do a linear time scan across each of the FDs you supplied to figure out which one you actually need to use.

  • This limitation motivated the development of several successor interfaces. BSD & OS X got kqueue, Solaris got /dev/poll, and Linux got epoll. Roughly speaking, these interfaces lets you build up a set of FDs you are interested in watching, and then make a call that returns to you a list those of FDs in the set that were actually triggered.

    There's lots of good info about these mechanisms at the classic C10K page. If you like hearing someone who clearly knows what he is talking about rant for 10 minutes about syscalls, this Bryan Cantrill bit about epoll is quite amusing.

  • Unfortunately, Windows never got one of these successor mechanisms: only select is supported. It is possible to do an epoll-like thing by kicking off an operation that would normally block (e.g. WSARecv) with a specially prepared LPOVERLAPPED parameter, such that you can wait it to complete using WSAWaitForMultipleEvents. Like epoll, when this wait returns it gives you a notion of which of the sockets of interest caused the wakeup. Unfortunately, this API won't let you wait for more than 64 events — if you want to wait for more you need to create child threads that recursively call WSAWaitForMultipleEvents, and then wait on those threads!

  • The reason that Windows support is a bit lacking here is that they seem to expect you to use an asynchronous IO mechanism instead: either completion ports, or completion handlers. (Completion handlers are implemented using the windows APC mechanism and are a form of callback that don't require a thread pool — instead, they are executed in the spare CPU time when the thread that issued the IO operation is otherwise suspended, e.g. in a call to WaitForMultipleObjectsEx).

In Java, non-blocking IO has been exposed via SelectableChannel since JDK4. As I mentioned above, OS support for non-blocking IO on files is nonexistant — correspondingly, Java's SocketChannel extends SelectableChannel, but FileChannel does not.

The JDK implements SelectableChannel using whatever the platform-appropriate API is (i.e. epoll, kqueue, /dev/poll, poll or select). The Windows implementation is based on select — to ameliorate the fact that select requires a linear scan, the JDK creates a new thread for every 1024 sockets being waited on.

Conclusions

Let's say that you want to do Java IO in a non-synchronous way. The bottom line is:

  • If you want to do IO against files, your only option is asynchronous IO. You'll need to roll it yourself with JDK6 and below (and the resulting implementation won't be as concurrent as you expect Windows). On the other hand, with Java 7 and up you can just use the built-in mechanisms, and what you'll get is basically as good as the state-of-the-art.

  • If you want to do IO against sockets, an ideal solution would use non-blocking IO on *nix and asynchronous IO on Windows. This is obviously a bit awkward to do, since it involves working with two rather different APIs. There might be some project akin to libuv that wraps these two mechanisms up into a single API you can write against, but I don't know of it if so.

    The Netty project is an interesting data point. This high performance Java server is based principally on non-blocking IO, but they did make an abortive attempt to use async IO instead at one point — it was backed out because there was no performance advantage to using async IO instead of non-blocking IO on Linux. Some users report that the now-removed asynchronous IO code drastically reduces CPU usage on Windows, but others report that Java's dubious select-based implementation of Windows non-blocking IO is good enough.

Quirks of the Matlab file format

The Matlab file format has become something of a standard for data exchange in quant finance circles. It is not only handy for those who are using the Matlab interactive environment itself, but also to users working in a diverse spectrum of language, thanks to widespread availability of libraries for reading and writing the files. The format itself also has the handy property of supporting compression — an essential property for keeping disk usage reasonable with working with the highly compressible data that is typical of financial timeseries.

At work we have implemented our own high-performance Java library for reading and writing these files. The Mathworks have helpfully published a complete description of the format online, which makes this task for the most part straightforward. Unfortunately, the format also has some dark and undocumented corners that I spent quite some time investigating. This post is intended to record a couple of these oddities for posterity.

Unicode

The Matlab environment supports Unicode strings, and so consequently Matlab files can contain arbitrary Unicode strings. Unfortunately this is one area where the capabilities of Matlab itself and those intended by the Mathworks spec diverge somewhat. Specifically:

  1. While the spec documents a miUTF8 storage type, Matlab itself only seems to understand a very limited subset of UTF-8. For example, it can't even decode an example file which simply contains the UTF-8 encoded character sequence ←↑→↓↔. It turns out that Matlab cannot read codepoints that are encoded as three or more bytes! This means it can only understand U+0000 to U+07FF, leaving us in a sad situation when Matlab can't even understand the BMP.
  2. The miUTF32 storage type isn't supported at all. For example,
    this file is correctly formed according to the spec but unreadable in Matlab.
  3. UTF-16 mostly works. As it stands, this is really your only option if you want the ability to roundtrip Unicode via Matlab. One issue is that Matlab chars aren't really Unicode codepoints - they are sequences of UTF-16 code units. However, this is an issue shared by Python 2 and Java, so even though it is broken at least it is broken in the "normal" way.

Interestingly, most 3rd party libraries seem to implement these parts of the spec better than Matlab itself does — for example, scipy's loadmat and savemat functions have full support for all of these text storage data types. (Scipy does still have trouble with non-BMP characters however.)

Compression

As mentioned, .mat files have support for storing compressed matrices. These are simply implemented as nested zlib-compressed streams. Alas, it appears that the way that Matlab is invoking zlib is slightly broken, with the following consequences:

  • Matlab does not attempt to validate that the trailing ZLib checksum is present, and doesn't check it even if it is there.
  • If you attempt to open a file containing a ZLib stream that has experienced corruption such that the decompressed data is longer than Matlab was expecting, the error is silently ignored.
  • When writing out a .mat file, Matlab will sometimes not write the ZLib checksum. This happens very infrequently though — most files it creates do have a checksum as you would expect.

Until recently scipy's Matlab reader would not verify the checksum either, but I added support for this after we saw corrupted .mat files in the wild at work.

I've reported these compression and Unicode problems to the Mathworks and they have acknowledged that they are bugs, but at this time there is no ETA for a fix.

Rpath emulation: absolute DLL references on Windows

When creating an executable or shared library on Linux, it’s possible to include an ELF RPATH header which tells the dynamic linker where to search for the any shared libraries that you reference. This is a pretty handy feature because it can be used to nail down exactly which shared library you will link against, without leaving anything up to chance at runtime.

Unfortunately, Windows does not have an equivalent feature. However, it does have an undocumented feature which may be enough to replace your use of rpath if you are porting software from Linux.

Executables or DLLs or Windows always reference any DLLs that they import by name only. So, the import table for an executable will refer to kernel32.dll rather than C:\Windows\kernel32.dll. Window’s dynamic loader will look for a file with the appropriate name in the DLL search path as usual. (For full details on DLL import tables and more, you can check out my previous in depth post.)

However, Window’s dynamic loader will, as a completely undocumented (and presumably unsupported) feature, also accept absolute paths in the import table. This is game-changing because it means that you can hard-code exactly which DLL you want to refer to, just like you would be able to with rpath on Linux.

Demonstration

To demonstrate this technique, we’re going to need code for a DLL and a referring EXE:

$ cat library.c
#include <stdio.h>

__declspec(dllexport) int librarycall(void) {
        printf("Made library call!\n");
        return 0;
}

$ cat rpath.c
__declspec(dllimport) int librarycall(void);

int main(int argc, char **argv) {
        return librarycall();
}

If we were building a DLL and EXE normally, we would do this:

<code>gcc -c library.c
gcc -shared -o library.dll library.o
gcc -o rpath rpath.c -L./ -llibrary</code>

This all works fine:

<code>$ ./rpath
Made library call!</code>

However, as you would expect, if you move library.dll elsewhere, the EXE will fail to start:

<code>$ mv library.dll C:/library.dll
$ ./rpath
/home/Max/rpath/rpath.exe: error while loading shared libraries: library.dll: cannot open shared object file: No such file or directory</code>

Now let’s work some magic! If we open up rpath.exe in a hex editor, we see something like this:

Let’s just tweak that a bit to change the relative path to library.dll to an absolute path. Luckily there is enough padding to make it fit:

The EXE will now work perfectly!

<code>$ ./rpath
Made library call!</code>

In practice

Knowing that this feature exists is one thing. Actually making use of it in a reliable way is another. The problem is that to my knowledge no linkers are capable of creating a DLL or EXE which include an absolute path in their import tables. Sometimes we will be lucky enough that the linker creates an EXE or DLL with enough padding in it for us to manually edit in an absolute path, but with the method above there is no guarantee that this will be possible.

In order to exploit this technique robustly, we’re going to use a little trick with import libraries. Instead of using GCC’s ability to link directly to a DLL, we will generate an import library for the DLL, which we will call library.lib:

<code>$ dlltool --output-lib library.lib --dllname veryverylongdllname.dll library.o</code>

When you use dlltool you either need to write a .def file for the DLL you are creating an import library for, or you need to supply all the object files that were used to create the DLL. I’ve taken the second route here and just told dlltool that the our DLL was built from library.o.

Now we have an import library, we can do our hex-editing trick again, but this time on the library. Before:

And after (note that I have null-terminated the new absolute path):

The beauty of editing the import library rather than the output of the linker is that using the --dllname option we can ensure that the import library contains as much space as we need to fit the entire absolute path of the DLL, no matter how long it may be. This is the key to making robust use of absolute paths in DLL loading, even if linkers don’t support them!

Now we have the import library, we can link rpath.exe again, but this time using the import library rather than library.dll:

<code>$ gcc -o rpath rpath.c library.lib
$ ./rpath
Made library call!</code>

Yes, it really is using the DLL on the C: drive:

<code>$ mv C:/library.dll C:/foo.dll
$ ./rpath
/home/Max/rpath/rpath.exe: error while loading shared libraries: C:\library.dll: cannot open shared object file: No such file or directory</code>

Conclusion

I haven’t seen this technique for using absolute paths for DLL references anywhere on the web, so it doesn’t seem to be widely known. However, it works beautifully on Windows 7 and probably on all other versions of Windows as well.

I may apply these techniques to the Glasgow Haskell Compiler in order to improve the support for Haskell shared objects on Windows: more information on this topic can be found on the GHC wiki.

GHC-specific Alias Analysis for LLVM

The setup

A few years ago, David Terei did some great work adding a LLVM backend to the Glasgow Haskell Compiler. The idea with this is that instead of writing our own optimiser and assembly-code generators for our custom three-address-code, we can just translate into LLVM IR and have LLVM do the heavy lifting. In theory, this means that GHC will be able to compile for many different CPUs, and will benefit from the smart optimisations the LLVM team have implemented.

The portability part has definitely worked out for us: for example, a couple of people have successfully got GHC to compile for the ARM by using the LLVM backend. However, the promise of LLVM being able to speed up our generated code has never really been fully borne out. LLVM-generated code does tend to be better than that produced by GHCs own backends, but this is mostly because LLVM is doing much better register allocation (it is much smarter about reusing the “pinned registers” required that form part of the interface between GHC’s generated code and the garbage collector).

The reason that LLVM does not optimise as much as we would like is often to do with aliasing. In particular, LLVM conservatively assumes that GHC’s stack (which is explicitly represented in the generated code as an array of words) and the heap may alias.

What’s the problem?

A concrete example of this is the following Haskell program:

module Main(main) where

import Data.Array.Base
import Data.Array.IO
import Data.Array.MArray

main :: IO ()
main = do
    arr <- newArray_ (0, 200)
    go arr 2 0 100

go :: IOUArray Int Int -> Int -> Int -> Int -> IO ()
go arr stride x y | x < y     = do unsafeWrite arr (x * stride) 1337
                                   go arr stride (x + 1) y
                  | otherwise = return ()

This loop compiles to fairly good Core:

Main.main_$s$wa =
  \ (@ sg0_sKA::Data.Array.Base.STUArray
                  GHC.Prim.RealWorld GHC.Types.Int GHC.Types.Int
                  ~
                Data.Array.IO.Internals.IOUArray GHC.Types.Int GHC.Types.Int)
    (sc_sKs :: GHC.Prim.State# GHC.Prim.RealWorld)
    (sc1_sKt :: GHC.Prim.Int#)
    (sc2_sKu :: GHC.Prim.Int#)
    (sc3_sKv :: GHC.Prim.Int#)
    (sc4_sKw :: GHC.Types.Int)
    (sc5_sKx :: GHC.Types.Int)
    (sc6_sKy :: GHC.Types.Int)
    (sc7_sKz :: GHC.Prim.MutableByteArray# GHC.Prim.RealWorld) ->
    case GHC.Prim.<# sc2_sKu sc1_sKt of _ {
      GHC.Bool.False -> (# sc_sKs, GHC.Unit.() #);
      GHC.Bool.True ->
        case GHC.Prim.writeIntArray#
               @ GHC.Prim.RealWorld
               sc7_sKz
               (GHC.Prim.*# sc2_sKu sc3_sKv)
               1337
               sc_sKs
        of s2#_aHo { __DEFAULT ->
        Main.main_$s$wa
          @ (sym
               Data.Array.IO.Internals.NTCo:IOUArray GHC.Types.Int GHC.Types.Int)
          s2#_aHo
          sc1_sKt
          (GHC.Prim.+# sc2_sKu 1)
          sc3_sKv
          sc4_sKw
          sc5_sKx
          sc6_sKy
          sc7_sKz
        }
    }

One weird thing about this Core is that it passes around a number of dead arguments (sc4_sKw, sc5_sKx and sc6_sKy). This is a known bug in GHC, and is caused by a phase ordering problem. However, this particular infelicity should not prevent LLVM from being able to do the classic loop optimisation of strength reduction on our code.

The particular strength reduction we would like to perform si to replace the multiplication GHC.Prim.*# sc2_sKu sc3_sKv in the main_$s$wa loop with an addition. This is possible because the left operand sc2_sKu is a loop induction variable, increasing by 1 every iteration. Thus, on every iteration the value of the multiplication GHC.Prim.*# sc2_sKu sc3_sKv is just the value of the multiplication on the previous loop, plus sc3_sKv. Thus, by adding a loop variable that records the value of the multiplication on the previous iteration, we can replace the multiplication by an addition.

Unfortunately, LLVM currently can’t strength-reduce this loop in the suggested way. As we will soon see, this is due to aliasing.

Why does the problem happen?

We can immediately see the problem if we look at the optimised LLVM code for this loop:

c1TW.lr.ph:
  ...
  %ln1TL1 = load i64* %Sp_Arg, align 8
  ...

c1TW:                                             ; preds = %c1TW.lr.ph, %c1TW
  %ln1TL4 = phi i64 [ %ln1TL1, %c1TW.lr.ph ], [ %ln1UF, %c1TW ]
  %ln1Uy = mul i64 %ln1Uu, %ln1TL4
  %ln1Uz = add i64 %ln1Uw, %ln1Uy
  %ln1UA = inttoptr i64 %ln1Uz to i64*
  store i64 1337, i64* %ln1UA, align 8
  %ln1UE = load i64* %Sp_Arg, align 8
  %ln1UF = add i64 %ln1UE, 1
  store i64 %ln1UF, i64* %Sp_Arg, align 8
  %ln1TP = load i64* %ln1TN, align 8
  %ln1TQ = icmp slt i64 %ln1UF, %ln1TP
  br i1 %ln1TQ, label %c1TW, label %n1TX.loopexit

The strength reduction optimisation depends on one of the operands to the multiplication being a loop induction variable. In our case, we expect that sc2_sKu will be such a variable. However, looking at the LLVM code we can see that the equivalent LLVM variable, %ln1TL4, has its induction-ness hidden because it is reloaded from the stack by load i64* %Sp_Arg on every iteration.

You might wonder why the store to the same stack location by store i64 %ln1UF, i64* %Sp_Arg is not forwarded to this load by LLVM. If this were to happen, we could get code like this:

c1TW.lr.ph:
  ...
  %ln1TL1 = load i64* %Sp_Arg, align 8
  %ln1UE.ph = load i64* %Sp_Arg, align 8
  ...

c1TW:                                             ; preds = %c1TW.lr.ph, %c1TW
  %ln1TL4 = phi i64 [ %ln1TL1, %c1TW.lr.ph ], [ %ln1UF, %c1TW ]
  %ln1UE = phi i64 [ %ln1UE.ph, %c1TW.lr.ph ], [ %ln1UF, %c1TW ]
  %ln1Uy = mul i64 %ln1Uu, %ln1TL4
  %ln1Uz = add i64 %ln1Uw, %ln1Uy
  %ln1UA = inttoptr i64 %ln1Uz to i64*
  store i64 1337, i64* %ln1UA, align 8
  %ln1UF = add i64 %ln1UE, 1
  store i64 %ln1UF, i64* %Sp_Arg, align 8
  %ln1TP = load i64* %ln1TN, align 8
  %ln1TQ = icmp slt i64 %ln1UF, %ln1TP
  br i1 %ln1TQ, label %c1TW, label %n1TX.loopexit

In this code the fact that %ln1UE is an induction variable is obvious, and not obscured by an intermediate load from memory. And indeed, LLVM is able to strength-reduce this loop!

The reason that LLVM does not forward this load is because in general it is unsafe, since the store to %ln1UA might alias it if %ln1UA were equal to %Sp_Arg. The ridiculous thing about this is that we know that in the code generated by GHC, the stack pointer will never be stored away anywhere, so it can’t possible alias with the unknown pointer %ln1UA and LLVM is being unnecessarily conservative.

The solution

LLVM is a beautiful bit of software, and it provides exactly the extensibility point we require to resolve this problem: we can write our own alias analysis pass that knows that GHC’s stack never alias with any another non-stack pointer and dynamically load it into the LLVM optimisation tool chain.

This is exactly what I’ve done. The code is available as a Gist, and interested parties (who use OS X!) can build it like so:

<code>g++ -D__STDC_LIMIT_MACROS -D__STDC_CONSTANT_MACROS -fno-exceptions -fno-rtti -fno-common -Wall \
-Wl,-flat_namespace -dynamiclib GHCAliasAnalysis.cpp -o GHCAliasAnalysis.dylib -lLLVM-`llvm-config --version`
</code>

Once built, we can dynamically load the resulting dylib into LLVMs opt tool using the -load option, and then use the new -ghc-aa flag to tell LLVM to use our alias analyser as a complement to the default one. Unfortunately, due to an infelicity in LLVM, we have to specify -ghc-aa in between every single optimisation pass if we want to be sure that it is used. So the final command line to opt, including all passes done by the standard -O2 optimisation level, and the -loop-reduce strength-reduction pass, needs to look something like this:

<code>opt -load GHCAliasAnalysis.dylib -S -no-aa -tbaa -basicaa -ghc-aa \
-globalopt -ghc-aa -ghc-aa -ipsccp -ghc-aa -deadargelim -ghc-aa -instcombine -ghc-aa -simplifycfg \
-ghc-aa -basiccg -ghc-aa -prune-eh -ghc-aa -inline -ghc-aa -functionattrs -ghc-aa -scalarrepl-ssa \
-ghc-aa -domtree -ghc-aa -early-cse -ghc-aa -simplify-libcalls -ghc-aa -lazy-value-info -ghc-aa \
-jump-threading -ghc-aa -correlated-propagation -ghc-aa -simplifycfg -ghc-aa -instcombine -ghc-aa \
-tailcallelim -ghc-aa -simplifycfg -ghc-aa -reassociate -ghc-aa -domtree -ghc-aa -loops -ghc-aa \
-loop-simplify -ghc-aa -lcssa -ghc-aa -loop-rotate -ghc-aa -licm -ghc-aa -lcssa -ghc-aa -loop-unswitch \
-ghc-aa -instcombine -ghc-aa -scalar-evolution -ghc-aa -loop-simplify -ghc-aa -lcssa -ghc-aa -indvars \
-ghc-aa -loop-idiom -ghc-aa -loop-deletion -ghc-aa -loop-unroll -ghc-aa -memdep -ghc-aa -gvn -ghc-aa \
-memdep -ghc-aa -memcpyopt -ghc-aa -sccp -ghc-aa -instcombine -ghc-aa -lazy-value-info -ghc-aa \
-jump-threading -ghc-aa -correlated-propagation -ghc-aa -domtree -ghc-aa -memdep -ghc-aa -dse \
-ghc-aa -adce -ghc-aa -simplifycfg -ghc-aa -instcombine -ghc-aa -strip-dead-prototypes -ghc-aa \
-constmerge -loop-reduce
</code>

(Yes, I know this is ridiculous! I hope the LLVM developers fix this soon.)

With my new alias analysis pass, LLVM is able to produce the following beautiful code for the loop:

<code>c1TW:                                             ; preds = %c1TW, %c1TW.lr.ph
  %lsr.iv = phi i64 [ %lsr.iv.next, %c1TW ], [ %5, %c1TW.lr.ph ]
  %ln1UF1 = phi i64 [ %ln1TL1, %c1TW.lr.ph ], [ %ln1UF, %c1TW ]
  %ln1UA = inttoptr i64 %lsr.iv to i64*
  store i64 1337, i64* %ln1UA, align 8
  %ln1UF = add i64 %ln1UF1, 1
  %lsr.iv.next = add i64 %lsr.iv, %6
  %ln1TQ = icmp slt i64 %ln1UF, %ln1TP2
  br i1 %ln1TQ, label %c1TW, label %n1TX.loopexit
</code>

Note that the original loop contained a store and two loads, but the optimised loop contains only a single store: our new alias analysis has allowed the loads to be floated out of the loop. This has in turn allowed LLVM to discover the loop induction variable and apply strength reduction - note that the final loop never uses the multiplication instruction!

The final program runs 8.8% faster than the version that is compiled without the custom alias analysis.

Conclusions

My custom alias analyser for GHC-generated code gives LLVM much more room for applying its existing powerful optimisation. There is plenty of scope for improvement, though:

  1. I’d really like people to report their experiences using with this alias analyser and the LLVM backend. Do you see a big speed boost on your data-parallel Haskell programs, for example?

  2. Of course, I would like this alias analyser to included with GHC so you can all seamlessly benefit from it. I’ll be working with GHC HQ to make this happen.

  3. I think there is still scope for getting even more useful information about GHC-generated code into LLVM. For example, currently LLVM is unable to eliminate stores to stack locations that we can see will never be accessed because we make a tail call to another function with a stack pointer that points above these locations. I can think of at least two ways to express this to LLVM, and this would produce another nice gain.

    If would also be great if we could teach LLVM something about the garbage collector, as currently if your loop does any allocation at all the presence of calls to the GC pessimises the output code a lot.

I was partly inspired to do this by Ben Lippmeier, whose paper at the Haskell Symposium this year had to do strength-reduction manually at the Haskell level because LLVM wasn’t working for him. I hope I’ve fixed that issue.

Performance problems were also a focus of the discussions about the future of Haskell at ICFP. I’ve been to these discussions three years in a row, and several topics keep cropping back up: performance, and the fact that Hackage 2.0 still isn’t released. I’ve grown tired of hearing so much talk about the issues with little-to-no action to resolve them, so I spent this post-ICFP week doing my best to fix them. I first wrote a documentation build bot for the Hackage 2.0 effort, and then moved on to the LLVM performance issues - if everyone helps to move these issues along then hopefully we can finally talk about some different problems next year!