Category Archives: Solutions

Compression of floating point timeseries

I recently had cause to investigate fast methods of storing and transferring financial timeseries. Naively, timeseries can be represented in memory or on disk as simple dense arrays of floating point numbers. This is an attractive representation with many nice properties:

  • Straightforward and widely used.
  • You have random access to the nth element of the timeseries with no further indexes required.
  • Excellent locality-of-reference for applications that process timeseries in time order, which is the common case.
  • Often natively supported by CPU vector instructions.

However, it is not a particularly space-efficient representation. Financial timeseries have considerable structure (e.g. Vodafone's price on T is likely to be very close to the price on T+1), and this structure can be exploited by compression algorithms to greatly reduce storage requirements. This is important either when you need to store a large number of timeseries, or need to transfer a smaller number of timeseries over a bandwidth-constrained network link.

Timeseries compression has recieved quite a bit of attention from both the academic/scientific programming community (see e.g. FPC and PFOR) and also practicioner communities such as the demoscene (see this presentation by a member of Farbrausch). This post summarises my findings about the effect that a number of easy-to-implement "filters" have on the final compression ratio.

In the context of compression algorithms, filters are simple invertible transformations that are applied to the stream in the hopes of making the stream more compressible by subsequent compressors. Perhaps the canonical example of a filter is the Burrows-Wheeler transform, which has the effect of moving runs of similar letters together. Some filters will turn a decompressed input stream (from the user) of length N into an output stream (fed to the compressor) of length N, but in general filters will actually have the effect of making the stream longer. The hope is that the gains to compressability are enough to recover the bytes lost to any encoding overhead imposed by the filter.

In my application, I was using the compression as part of an RPC protocol that would be used interactively, so I wanted keep decompression time very low, and for ease-of-deployment I wanted to get results in the context of Java without making use of any native code. Consequently I was interested in which choice of filter and compression algorithm would give a good tradeoff between performance and compression ratio.

I determined this experimentally. In my experiments, I used timeseries associated with 100 very liquid US stocks retrieved from Yahoo Finance, amounting to 69MB of CSVs split across 6 fields per stock (open/high/low/close and adjusted close prices, plus volume). This amounted to 12.9 million floating point numbers.

Choice of compressor

To decide which compressors were contenders, I compressed these price timeseries with a few pure-Java implementations of the algorithms:

Compressor Compression time (s) Decompression time (s) Compression ratio
None 0.0708 0.0637 1.000
Snappy (org.iq80.snappy:snappy-0.3) 0.187 0.115 0.843
Deflate BEST_SPEED (JDK 8) 4.59 4.27 0.602
Deflate DEFAULT_COMPRESSION (JDK 8) 5.46 4.29 0.582
Deflate BEST_COMPRESSION (JDK 8) 7.33 4.28 0.580
BZip2 MIN_BLOCKSIZE (org.apache.commons:commons-compress-1.10) 1.79 0.756 0.540
BZip2 MAX_BLOCKSIZE (org.apache.commons:commons-compress-1.10) 1.73 0.870 0.515
XZ PRESET_MIN (org.apache.commons:commons-compress-1.10 + org.tukaani:xz-1.5) 2.66 1.20 0.469
XZ PRESET_DEFAULT (org.apache.commons:commons-compress-1.10 + org.tukaani:xz-1.5) 9.56 1.15 0.419
XZ PRESET_MAX (org.apache.commons:commons-compress-1.10 + org.tukaani:xz-1.5) 9.83 1.13 0.419

These numbers were gathered from a custom benchmark harness which simply compresses and then decompresses the whole dataset once. However, I saw the same broad trends confirmed by a JMH benchmark of the same combined operation:

Compressor Compress/decompress time (s) JMH compress/decompress time (s)
None 0.135 0.127 ± 0.002
Snappy (org.iq80.snappy:snappy-0.3) 0.302 0.215 ± 0.003
Deflate BEST_SPEED (JDK 8) 8.86 8.55 ± 0.15
Deflate DEFAULT_COMPRESSION (JDK 8) 9.75 9.35 ± 0.09
Deflate BEST_COMPRESSION (JDK 8) 11.6 11.4 ± 0.1
BZip2 MIN_BLOCKSIZE (org.apache.commons:commons-compress-1.10) 2.55 3.10 ± 0.04
BZip2 MAX_BLOCKSIZE (org.apache.commons:commons-compress-1.10) 2.6 3.77 ± 0.31
XZ PRESET_MIN (org.apache.commons:commons-compress-1.10 + org.tukaani:xz-1.5) 3.86 4.08 ± 0.12
XZ PRESET_DEFAULT (org.apache.commons:commons-compress-1.10 + org.tukaani:xz-1.5) 10.7 11.1 ± 0.1
XZ PRESET_MAX (org.apache.commons:commons-compress-1.10 + org.tukaani:xz-1.5) 11.0 11.5 ± 0.4

What we see here is rather impressive performance from BZip2 and Snappy. I expected Snappy to do well, but BZip2's good showing surprised me. In some previous (unpublished) microbenchmarks I've not seen GZipInputStream (a thin wrapper around Deflate with DEFAULT_COMPRESSION) be quite so slow, and my results also seem to contradict other Java compression benchmarks.

One contributing factor may be that the structure of the timeseries I was working with in that unpublished benchmark was quite different: there was a lot more repetition (runs of NaNs and zeroes), and compression ratios were consequently higher.

In any event, based on these results I decided to continue my evaluation with both Snappy and BZip2 MIN_BLOCKSIZE. It's interesting to compare these two compressors because, unlike BZ2, Snappy doesn't perform any entropy encoding.

Filters

The two filters that I evaluated were transposition and zig-zagged delta encoding.

Transposition

The idea behind transposition (also known as "shuffling") is as follows. Let's say that we have three floating point numbers, each occuping 4 bytes:

On a big-endian system this will be represented in memory row-wise by the 4 consecutive bytes of the first float (MSB first), followed by the 4 bytes of the second float, and so on. In contrast, a transposed representation of the same data would encode all of the MSBs first, followed by all of the second-most-significant bytes, and so on, in a column-wise fashion:

The reason you might think that writing the data column-wise would improve compression is that you might expect that e.g. the most significant bytes of a series of floats in a timeseries to be very similar to each other. By moving these similar bytes closer together you increase the chance that compression algorithms will be able to find repeating patterns in them   undisturbed by the essentially random content of the LSB.

Field transposition

Analagous to the byte-level transposition described above, another thing we might try is transposition at the level of a float subcomponent. Recall that floating point numbers are divided into sign, exponent and mantissa components. For single precision floats this looks like:

Inspired by this, another thing we might try is transposing the data field-wise — i.e. serializing all the signs first, followed by all the exponents, then all the mantissas:

(Note that I'm inserting padding bits to keep multibit fields byte aligned — more on this later on.)

We might expect this transposition technique to improve compression by preventing changes in unrelated fields from causing us to be unable to spot patterns in the evolution of a certain field. A good example of where this might be useful is the sign bit: for many timeseries of interest we expect the sign bit to be uniformly 1 or 0 (i.e. all negative or all positive numbers). If we encoded the float without splitting it into fields, that that one very predictable bit is mixed in with 31 much more varied bits, which makes it much harder to spot this pattern.

Delta encoding

In delta encoding, you encode consecutive elements of a sequence not by their absolute value but rather by how much larger than they are than the previous element in the sequence. You might expect this to aid later compression of timeseries data because, although a timeseries might have an overall trend, you would expect the day-to-day variation to be essentially unchanging. For example, Vodafone's stock price might be generally trending up from 150p at the start of the year to 200p at the end, but you expect it won't usually change by more than 10p on any individual day within that year. Therefore, by delta-encoding the sequence you would expect to increase the probability of the sequence containing a repeated substring and hence its compressibility.

This idea can be combined with transposition, by applying the transposition to the deltas rather than the raw data to be compressed. If you do go this route, you might then apply a trick called zig-zagging (used in e.g. protocol buffers) and store your deltas such that small negative numbers are represented as small positive ints. Specifically, you might store the delta -1 as 1, 1 as 2, -2 as 3, 2 as 4 and so on. The reasoning behind this is that you expect your deltas to be both positive and negative, but certainly clustered around 0. By using zig-zagging, you tend to cause the MSB of your deltas to become 0, which then in turn leads to extremely compressible long runs of zeroes in your transposed version of those deltas.

Special cases

One particular floating point number is worth discussing: NaN. It is very common for financial timeseries to contain a few NaNs scattered throughout. For example, when a stock exchange is on holiday no official close prices will be published for the day, and this tends to be represented as a NaN in a timeseries of otherwise similar prices.

Because NaNs are both common and very dissimilar to other numbers that we might encounter, we might want to encode them with a special short representation. Specifically, I implemented a variant of the field transposition above, where the sign bit is actually stored extended to a two bit "descriptor" value with the following interpretation:

Bit 1 Bit 2 Interpretation
0 0 Zero
0 1 NaN
1 0 Positive
1 1 Negative

The mantissa and exponent are not stored if the descriptor is 0 or 1.

Note that this representation of NaNs erases the distinction between different NaN values, when in reality there are e.g. 16,777,214 distinct single precision NaNs. This technically makes this a lossy compression technique, but in practice it is rarely important to be able to distinguish between different NaN values. (The only application that I'm aware of that actually depends on the distinction between NaNs is LuaJIT.)

Methodology

In my experiments (available on Github) I tried all combinations of the following compression pipeline:

  1. Field transposition: on (start by splitting each number into 3 fields) or off (treat whole floating point number as a single field)?
  2. (Only if field transposition is being used.) Special cases: on or off?
  3. Delta encoding: off (store raw field contents) or on (stort each field as an offset from the previous field)? When delta encoding was turned on, I additionally used zig-zagging.
  4. Byte transposition: given that I have a field, should I transpose the bytes of that field? In fact, I exhaustively investigated all possible byte-aligned transpositions of each field.
  5. Compressor: BZ2 or Snappy?

I denote a byte-level transposition as a list of numbers summing to the number of bytes in one data item. So for example, a transposition for 4-byte numbers which wrote all of the LSBs first, followed by the all of the next-most-significant bytes etc would be written as [1, 1, 1, 1], while one that broke each 4-byte quantity into two 16-bit chunks would be written [2, 2], and the degenerate case of no transposition would be [4]. Note that numbers occur in the list in increasing order of the significance of the bytes in the item that they manipulate.

As discussed above, in the case where a literal or mantissa wasn't an exact multiple of 8 bits, my filters padded the field to the nearest byte boundary before sending it to the compressor. This means that the filtering process actually makes the data substantially larger (e.g. 52-bit double mantissas are padded to 56 bits, becoming 7.7% larger in the process). This not only makes the filtering code simpler, but also turns out to be essential for good compression when using Snappy, which is only able to detect byte-aligned repitition.

Without further ado, let's look at the results.

Dense timeseries

I begin by looking at dense timeseries where NaNs do not occur. With such data, it's clear that we won't gain from the "special cases" encoding above, so results in this section are derived from a version of the compression code where we just use 1 bit to encode the sign.

Single-precision floats

The minimum compressed size (in bytes) achieved for each combination of parameters is as follows:

Exponent Method Mantissa Method BZ2 Snappy
Delta Delta 6364312 9067141
Literal 6283216 8622587
Literal Delta 6372444 9071864
Literal 6306624 8626114

(This table says that, for example, if we delta-encode the float exponents but literal-encode the mantissas, then the best tranposition scheme achieved a compressed size of 6,283,216 bytes.)

The story here is that delta encoding is strictly better than literal encoding for the exponent, but conversely literal encoding is better for the mantissa. In fact, if we look at the performance of each possible mantisas transposition, we can see that delta encoding tends to underperform in those cases where the MSB is split off into its own column, rather than being packaged up with the second-most-significant byte. This result is consistent across both BZ2 and Snappy.

Mantissa Transposition Mantissa Method BZ2 Snappy
[1, 1, 1] Delta 7321926 9199766
Literal 7226293 9154821
[1, 2] Delta 7394645 9317258
Literal 7824557 9420206
[2, 1] Delta 6514554 9099462
Literal 6283216 8622587
[3] Delta 6364312 9067141
Literal 6753718 9475316

The other interesting feature of these results is that transposition tends to hurt BZ2 compression ratios. It always makes things worse with delta encoding, but even with literal encoding only one particular transposition ([2, 1]) actually strongly improves a BZ2 result. Things are a bit different for Snappy: although once again delta is always worse with transposition enabled, transpostion always aids Snappy in the literal case — though once again the effect is strongest with [2, 1] transposition.

The strong showing for [2, 1] transposition suggests to me that the lower-order bits of the mantissa are more correlated with each other than they are with the MSB. This sort of makes sense, since due to the fact that equities trade with a fixed tick size prices will actually be quantised into a relatively small number of values. This will tend to cause the lower order bits of the mantissa to become correlated.

Finally, we can ask what would happen if we didn't make the mantissa/exponent distinction at all and instead just packed those two fields together:

Method BZ2 Snappy
Delta 6254386 8847839
Literal 6366714 8497983

These numbers don't show any clear preference for either of the two approaches. For BZ2, delta performance is improved by not doing the splitting, at the cost of larger outputs when using the delta method, while for Snappy we have the opposite: literal performance is improved while delta performance is harmed. What is true is that the best case, the compressed sizes we observe here are better than the best achievable size in the no-split case.

In some ways it is quite surprising that delta encoding ever beats literal encoding in this scenario, because it's not clear that the deltas we compute here are actually meaningful and hence likely to generally be small.

We can also analyse the best available transpositions in this case. Considering BZ2 first:

BZ2 Rank Delta Transposition Size Literal Transposition Size
1 [4] 6254386 [2, 2] 6366714
2 [3, 1] 6260446 [2, 1, 1] 6438215
3 [2, 2] 6395954 [3, 1] 7033810
4 [2, 1, 1] 6402354 [4] 7109668
5 [1, 1, 1, 1] 7136612 [1, 1, 1, 1] 7327437
6 [1, 2, 1] 7227282 [1, 1, 2] 7405128
7 [1, 1, 2] 7285350 [1, 2, 1] 8039403
8 [1, 3] 7337277 [1, 3] 8386647

Just as above, delta encoding does best when transposition at all is used, and generally gets worse as the transposition gets more and more "fragmented". On the other hand, literal encoding does well with transpositions that tend to keep together the first two bytes (i.e. the exponent + the leading bits of the mantissa).

Now let's look at the performance of the unsplit data when compressed with Snappy:

Snappy Rank Delta Transposition Size Literal Transposition Size
1 [3, 1] 8847839 [2, 1, 1] 8497983
2 [2, 1, 1] 8883392 [1, 1, 1, 1] 9033135
3 [1, 1, 1, 1] 8979988 [1, 2, 1] 9311863
4 [1, 2, 1] 9093582 [3, 1] 9404027
5 [2, 2] 9650796 [2, 2] 10107042
6 [4] 9659888 [1, 1, 2] 10842190
7 [1, 1, 2] 9847987 [4] 10942085
8 [1, 3] 10524215 [1, 3] 11722159

The Snappy results are very different from the BZ2 case. Here, the same sort of transpositions tend to do well with both the literal and delta methods. The kinds of transpositions that are successful are those that keep together the exponent and the leading bits of the mantissa, though even fully-dispersed transpositions like [1, 1, 1, 1] put in a strong showing.

That's a lot of data, but what's the bottom line? For Snappy, splitting the floats into mantisas and exponent before processing does seem to have slightly more consistently small outputs than working with unsplit data. The BZ2 situation is less clear but only because the exact choice doesn't seem to make a ton of difference. Therefore, my recommendation for single-precision floats is to delta-encode exponents, and to use literal encoding for mantissas with [2, 1] transposition.

Double-precision floats

While there were only 4 different transpositions for single-precision floats, there are 2 ways to transpose a double-precision exponent, and 64 ways to transpose the mantissa. This makes the parameter search for double precision considerably more computationally expensive. The results are:

Exponent Method Mantissa Method BZ2 Snappy
Delta Delta 6485895 12463583
Literal 6500390 10437550
Literal Delta 6456152 12469132
Literal 6475579 10439869

These results show interesting differences between BZ2 and Snappy. For BZ2 there is not much in it, but it's consistently always better to literal-encode the exponent and delta-encode the mantissa. For Snappy, things are exactly the other way around: delta-encoding the exponent and literal-encoding the mantissa is optimal.

The choice of exponent transposition scheme has the following effect:

Exponent Transposition Exponent Method BZ2 Snappy
[1, 1] Delta 6485895 10437550
Literal 6492837 10439869
[2] Delta 6496032 10560467
Literal 6456152 10564737

It's not clear, but [1, 1] transposition might be optimal. Bear in mind that double exponents are only 11 bits long, so the lower 5 bits of the LSB being encoded here will always be 0. Using [1, 1] transposition might better help the compressor get a handle on this pattern.

When looking at the best mantissa transpositions, there are so many possible transpositions that we'll consider BZ2 and Snappy one by one, examining just the top 10 transposition choices for each. BZ2 first:

BZ2 Rank Delta Mantissa Transposition Size Literal Mantissa Transposition Size
1 [7] 6456152 [6, 1] 6475579
2 [6, 1] 6498686 [1, 5, 1] 6495555
3 [4, 3] 6962167 [2, 4, 1] 6510416
4 [4, 2, 1] 6994270 [1, 1, 4, 1] 6510416
5 [1, 6] 7040401 [3, 3, 1] 6692613
6 [3, 4] 7054835 [2, 1, 3, 1] 6692613
7 [1, 5, 1] 7092230 [1, 1, 1, 3, 1] 6692613
8 [2, 5] 7108000 [1, 2, 3, 1] 6692613
9 [2, 4, 1] 7176760 [1, 6] 6926231
10 [3, 3, 1] 7210100 [7] 6931475

We can see that literal encoding tends to beat delta encoding, though the very best size was in fact achieved via a simple untransposed delta representation. In both the literal and the delta case, the encodings that do well tend to keep the middle 5 bytes of the mantissa grouped together, which is support for our idea that these bytes tend to be highly correlated, with most of the information being encoded in the MSB.

Turning to Snappy:

Snappy Rank Delta Mantissa Transposition Size Literal Mantissa Transposition Size
1 [5, 1, 1] 12463583 [3, 2, 1, 1] 10437550
2 [1, 3, 2, 1] 12678887 [1, 1, 1, 2, 1, 1] 10437550
3 [6, 1] 12737838 [1, 2, 2, 1, 1] 10437550
4 [4, 2, 1] 12749067 [2, 1, 2, 1, 1] 10437550
5 [7] 12766804 [1, 1, 1, 1, 1, 1, 1] 10598739
6 [1, 3, 1, 1, 1] 12820360 [2, 1, 1, 1, 1, 1] 10598739
7 [3, 1, 2, 1] 12824154 [3, 1, 1, 1, 1] 10598739
8 [4, 1, 1, 1] 12890981 [1, 2, 1, 1, 1, 1] 10598739
9 [3, 3, 1] 12915900 [1, 1, 1, 1, 2, 1] 10715444
10 [3, 1, 1, 1, 1] 12946767 [3, 1, 2, 1] 10715444

The Snappy results are strikingly different from the BZ2 ones. In this case, just like BZ2, literal encoding tends to beat delta encoding, but the difference is much more pronounced than the BZ2 case. Furthermore, the kinds of transpositions that minimize the size of the literal encoded data here are very different from the transpositions that were successful with BZ2: in that case we wanted to keep the middle bytes together, while here the scheme [1, 1, 1, 1, 1, 1, 1] where every byte has it's own column is not far from optimal.

And now considering results for the case where we do not split the floating point number into mantissa/exponent components:

Method BZ2 Snappy
Delta 6401147 11835533
Literal 6326562 9985079

These results show a clear preference for literal encoding, which is definitiely what we expect, given that delta encoding is not obviously meaningful for a unsplit number. We also see results that are universally better than those for split case: it seems that splitting the number into fields is actually a fairly large pessimisation! This is probably caused by the internal fragmentation implied by our byte-alignment of the data, which is a much greater penalty for doubles than it was for singles. It would be interesting to repeat the experiment without byte-alignment.

We can examine which transposition schemes do best in the unsplit case. BZ2 first:

BZ2 Rank Delta Transposition Size Literal Transposition Size
1 [8] 6401147 [6, 2] 6326562
2 [7, 1] 6407935 [1, 5, 2] 6351132
3 [6, 2] 6419062 [2, 4, 2] 6360533
4 [6, 1, 1] 6440333 [1, 1, 4, 2] 6360533
5 [4, 3, 1] 6834375 [3, 3, 2] 6562859
6 [4, 4] 6866167 [1, 2, 3, 2] 6562859
7 [4, 2, 1, 1] 6903123 [2, 1, 3, 2] 6562859
8 [4, 2, 2] 6903322 [1, 1, 1, 3, 2] 6562859
9 [1, 7] 6979216 [6, 1, 1] 6598104
10 [1, 6, 1] 6983998 [1, 5, 1, 1] 6621003

Recall that double precision floating point numbers have 11 bits of exponent and 52 bits of mantissa. We can actually see that showing up in the literal results above: the transpositions that do best are those that either pack together the exponent and the first bits of the mantissa, or have a seperate column for just the exponent information (e.g. [1, 5, 2] or [2, 4, 2]).

And Snappy:

Snappy Rank Delta Transposition Size Literal Transposition Size
1 [5, 1, 1, 1] 11835533 [1, 1, 1, 2, 1, 1, 1] 9985079
2 [1, 3, 2, 1, 1] 12137223 [2, 1, 2, 1, 1, 1] 9985079
3 [6, 1, 1] 12224432 [3, 2, 1, 1, 1] 9985079
4 [4, 2, 1, 1] 12253670 [1, 2, 2, 1, 1, 1] 9985079
5 [1, 3, 1, 1, 1, 1] 12280165 [1, 1, 1, 1, 1, 1, 1, 1] 10232996
6 [3, 1, 2, 1, 1] 12281694 [2, 1, 1, 1, 1, 1, 1] 10232996
7 [5, 1, 2] 12338551 [1, 2, 1, 1, 1, 1, 1] 10232996
8 [4, 1, 1, 1, 1] 12399017 [3, 1, 1, 1, 1, 1] 10232996
9 [3, 1, 1, 1, 1, 1] 12409691 [1, 1, 1, 1, 2, 1, 1] 10343471
10 [2, 2, 2, 1, 1] 12434147 [2, 1, 1, 2, 1, 1] 10343471

Here we see the same pattern as we did above: Snappy seems to prefer "more transposed" transpositions than BZ2 does, and we even see a strong showing for the maximal split [1, 1, 1, 1, 1, 1, 1, 1].

To summarize: for doubles, it seems that regardless of which compressor you use, you are better off not splitting into mantissa/exponent portions, and just literal encoding the whole thing. If using Snappy, [1, 1, 1, 1, 1, 1, 1, 1] transposition seems to be the way to go, but the situation is less clear with BZ2: [6, 2] did well in our tests but it wasn't a runaway winner.

If for some reason you did want to use splitting, if you are also going to use BZ2, then [6, 1] literal encoding for the mantissas and literal encoding for the exponents seems like a sensible choice. If you are a Snappy user, then I would suggest that a principled choice would be to use [1, 1, 1, 1, 1, 1, 1] literal encoding for the mantissas and likewise [1, 1] literal encoding for the exponent.

Sparse timeseries

Let's now look at the sparse timeseries case, where many of the values in the timeseries are NaN. In this case, we're interested in evaluating how useful the "special case" optimization above is in improving compression ratios.

To evaluate this, I replaced a fraction of numbers in my test dataset with NaNs and looked at the best possible size result for a few such fractions. The compressed size in each case is:

No Split Split w/ Special Cases Split wout/ Special Cases
0.00 9985079 10445497 10437550
0.10 10819117 9909251 11501526
0.50 8848393 6238097 10255811
0.75 5732076 3628543 7218287

Note that for convenience here the only compressor I tested was Snappy — i.e. BZ2 was not tested. I also didn't implement special cases in the no-split case, because an artifact of my implementation is that the special-casing is done at the same time as the float is split into its three component fields (sign, mantissa, exponent).

As we introduce small numbers of NaNs to the data, both the nosplit and non-special-cased data get larger. This is expected, because we're replacing predictable timeseries values at random with totally dissimilar values and hence adding entropy. The special-cased split shrinks because this increasing entropy is compensated for by the very short codes we have chosen for NaNs (for which we do pay a very small penalty in the NaNless case). At very high numbers of NaNs, the compressed data for all methods shrinks as NaNs become the rule rather than the exception.

High numbers of NaNs (10% plus) is probably a realistic fraction for real world financial data, so it definitely does seem like implementing special cases is worthwhile. The improvement would probably be considerably less if we looked at BZ2-based results, though.

Closing thoughts

One general observation is that delta encoding is very rarely the best choice, and when it is the best, the gains are usually marginal when compared to literal encoding. This is interesting because Fabian Giesen came to exactly the same conclusion (that delta encoding is redundant when you can do transposition) in the excellent presentation that I linked to earlier.

By applying these techniques to the dataset I was dealing with at work, I was able to get a nice compression ratio on the order of 10%-20% over and above that I could achieve with naive use of Snappy, so I consider the work a success, but don't intend to do any more research in the area. However, there are definitely more experiments that could be done in this vein. In particular, interesting questions are:

  • How robust are the findings of this post when applied to other datasets?
  • What if we don't byte-align everything? Does that improve the BZ2 case? (My prelimiary experiments showed that it made Snappy considerably worse.)
  • Why exactly are the results for BZ2 and Snappy so different? Presumably it relates to the lack of an entropy encoder is Snappy, but it is not totally clear to me how this leads to the results above.

Rpath emulation: absolute DLL references on Windows

When creating an executable or shared library on Linux, it’s possible to include an ELF RPATH header which tells the dynamic linker where to search for the any shared libraries that you reference. This is a pretty handy feature because it can be used to nail down exactly which shared library you will link against, without leaving anything up to chance at runtime.

Unfortunately, Windows does not have an equivalent feature. However, it does have an undocumented feature which may be enough to replace your use of rpath if you are porting software from Linux.

Executables or DLLs or Windows always reference any DLLs that they import by name only. So, the import table for an executable will refer to kernel32.dll rather than C:\Windows\kernel32.dll. Window’s dynamic loader will look for a file with the appropriate name in the DLL search path as usual. (For full details on DLL import tables and more, you can check out my previous in depth post.)

However, Window’s dynamic loader will, as a completely undocumented (and presumably unsupported) feature, also accept absolute paths in the import table. This is game-changing because it means that you can hard-code exactly which DLL you want to refer to, just like you would be able to with rpath on Linux.

Demonstration

To demonstrate this technique, we’re going to need code for a DLL and a referring EXE:

$ cat library.c
#include <stdio.h>

__declspec(dllexport) int librarycall(void) {
        printf("Made library call!\n");
        return 0;
}

$ cat rpath.c
__declspec(dllimport) int librarycall(void);

int main(int argc, char **argv) {
        return librarycall();
}

If we were building a DLL and EXE normally, we would do this:

<code>gcc -c library.c
gcc -shared -o library.dll library.o
gcc -o rpath rpath.c -L./ -llibrary</code>

This all works fine:

<code>$ ./rpath
Made library call!</code>

However, as you would expect, if you move library.dll elsewhere, the EXE will fail to start:

<code>$ mv library.dll C:/library.dll
$ ./rpath
/home/Max/rpath/rpath.exe: error while loading shared libraries: library.dll: cannot open shared object file: No such file or directory</code>

Now let’s work some magic! If we open up rpath.exe in a hex editor, we see something like this:

Let’s just tweak that a bit to change the relative path to library.dll to an absolute path. Luckily there is enough padding to make it fit:

The EXE will now work perfectly!

<code>$ ./rpath
Made library call!</code>

In practice

Knowing that this feature exists is one thing. Actually making use of it in a reliable way is another. The problem is that to my knowledge no linkers are capable of creating a DLL or EXE which include an absolute path in their import tables. Sometimes we will be lucky enough that the linker creates an EXE or DLL with enough padding in it for us to manually edit in an absolute path, but with the method above there is no guarantee that this will be possible.

In order to exploit this technique robustly, we’re going to use a little trick with import libraries. Instead of using GCC’s ability to link directly to a DLL, we will generate an import library for the DLL, which we will call library.lib:

<code>$ dlltool --output-lib library.lib --dllname veryverylongdllname.dll library.o</code>

When you use dlltool you either need to write a .def file for the DLL you are creating an import library for, or you need to supply all the object files that were used to create the DLL. I’ve taken the second route here and just told dlltool that the our DLL was built from library.o.

Now we have an import library, we can do our hex-editing trick again, but this time on the library. Before:

And after (note that I have null-terminated the new absolute path):

The beauty of editing the import library rather than the output of the linker is that using the --dllname option we can ensure that the import library contains as much space as we need to fit the entire absolute path of the DLL, no matter how long it may be. This is the key to making robust use of absolute paths in DLL loading, even if linkers don’t support them!

Now we have the import library, we can link rpath.exe again, but this time using the import library rather than library.dll:

<code>$ gcc -o rpath rpath.c library.lib
$ ./rpath
Made library call!</code>

Yes, it really is using the DLL on the C: drive:

<code>$ mv C:/library.dll C:/foo.dll
$ ./rpath
/home/Max/rpath/rpath.exe: error while loading shared libraries: C:\library.dll: cannot open shared object file: No such file or directory</code>

Conclusion

I haven’t seen this technique for using absolute paths for DLL references anywhere on the web, so it doesn’t seem to be widely known. However, it works beautifully on Windows 7 and probably on all other versions of Windows as well.

I may apply these techniques to the Glasgow Haskell Compiler in order to improve the support for Haskell shared objects on Windows: more information on this topic can be found on the GHC wiki.

Fixing "files could not be moved" error in Boot Camp Assistant

Recently I've been trying to install Windows on an OS X laptop by using Boot Camp. However, every time the Boot Camp Assistant would tell me that "some files could not be moved" during the creation of the Windows partition. The most commonly suggested solution is a total reinstall of OS X, which I was absolutely not willing to perform.

I read online that this problem could sometimes be solved by using iDefrag in "Compact" mode to manually move all your files to the front of the disk before using the Assistant. However, buying and running this £24 software had absolutely no effect on the problem.

Looking at iDefrag's summary view, it seems like there was a single unmovable "alternate volume label" block right at the end of the drive which might account for the problem. Anyway, I never got to the bottom of this, since I found an alternate solution that worked: hold down Apple+S during startup to enter single user mode and then run these commands:

/sbin/fsck -fy
exit

This just repairs any filesystem errors on your disk. It looks like filesystem errors were the true culprits, not unmovable files, since running the Assistant after this let me create the partition with no problems - Windows is installing right now.

How to build 32/64 bit fat (universal) binaries

The OS X version of the Glasgow Haskell Compiler compiles Haskell into 32-bit code. Unfortunately, this means that if you are on a system where it is the default for libraries to be built in 64-bit mode, you tend to get errors when linking Haskell code telling you that you are trying to link 32-bit code against 64-bit code.

The best solution to this problem is to build all libraries you intend to link to from Haskell code as universal binaries that include both 32-bit and 64-bit versions of the code. These libraries will then work seamlessly with both Haskell code and also when pulled in as part of the build process for non-Haskell 64-bit executables.

If you can install the library using MacPorts, this is easy to do. Instead of doing:

sudo port install mylibrary

Just do:

sudo port install mylibrary +universal

However, if the library you want is not available through MacPorts or the MacPorts version is not up to date you will need to know how to build these universal libraries for yourself. This is the process that I aim to explain in this post. I'm going to use igraph as my example library because it's what I needed to install (I needed to install the unreleased v0.6).

The easy method

If you are lucky, building a universal library is as simple as changing how you invoke make. Run the library's configure scripts etc as usual, and then invoke make as follows:

make CXXFLAGS="-arch i386 -arch x86_64" CFLAGS="-arch i386 -arch x86_64" LDFLAGS="-arch i386 -arch x86_64"

The -arch flags tell GCC and the linker to build and link both versions of the library. If this works, you are done. In the case of igraph, this wasn't quite enough - the above command failed with this error:

gcc-4.2: -E, -S, -save-temps and -M options are not allowed with multiple -arch flags

The reason that this occurs is because igraph invokes GCC with the -M series of flags that generate makefile dependency rules from the C code - but GCC doesn't like generating those rules for two architectures simultaneously. Luckily, there was an easy workaround in my case - I just needed to reconfigure igraph as follows:

./configure --disable-dependency-tracking

The --disable-dependency-tracking flag just stops Automake from determining the dependencies of each C file as it compiles it. It is totally harmless to disable this because that dependency information is only used in order to rebuild less stuff upon subsequent invocations of make - the worst that happens when you disable it is that if you make more than once you will have to wait a bit longer. For more information on this feature see also the relevant section of the Automake manual.

After reconfiguring in this manner, the original make invocation worked correctly for igraph.

The hard method

The above method may perhaps fail for some libraries, in which case you can use this more arduous manual method. The idea is to run the library's build process from scratch twice: once to get the 32-bit library and once for the 64-bit library. We can then use the lipo to glue together the build artifacts from the two runs.

We start by building the 32-bit version:

make clean
make CXXFLAGS=-m32 CFLAGS=-m32 LDFLAGS=-m32 -j12

We now need to store the 32-bit build artifacts somewhere. Exactly which files you have to save will vary according to the library you are building, but for igraph this was sufficient:

mkdir -p ~/Junk/32 ~/Junk/64
cp src/.libs/libigraph.{a,0.dylib} ~/Junk/32

Now do the 64-bit build and once again save the artifacts somewhere:

make clean
make CXXFLAGS=-m64 CFLAGS=-m64 LDFLAGS=-m64 -j12
cp src/.libs/libigraph.{a,0.dylib} ~/Junk/64

Finally we can use lipo to finish up:

lipo -create ~/Junk/{32,64}/libigraph.a -output src/.libs/libigraph.a
lipo -create ~/Junk/{32,64}/libigraph.0.dylib -output src/.libs/libigraph.0.dylib

At this point, you can do sudo make install and get a universal version of the library installed.

If you want to check that your libraries are indeed universal, you can use lipo -info:

$ lipo -info src/.libs/libigraph.a
Architectures in the fat file: src/.libs/libigraph.a are: i386 x86_64

Conclusions

Building universal 32-bit/64-bit binaries is apparently fairly straightforward but it was tricky to find documentation for the process. I hope this article helps others who need to get this done.

Solving GHC iconv problems on OS X 10.6

A problem that has plagued my GHC installation for a while is that whenever I tried to install any non-trivial package I would get a horrible link error like this:

Undefined symbols:
  "_iconv_close", referenced from:
      _hs_iconv_close in libHSbase-4.3.1.0.a(iconv.o)
     (maybe you meant: _hs_iconv_close)
  "_iconv_open", referenced from:
      _hs_iconv_open in libHSbase-4.3.1.0.a(iconv.o)
     (maybe you meant: _hs_iconv_open)
  "_iconv", referenced from:
      _hs_iconv in libHSbase-4.3.1.0.a(iconv.o)
     (maybe you meant: _hs_iconv_open, _hs_iconv_close , _hs_iconv )
  "_locale_charset", referenced from:
      _localeEncoding in libHSbase-4.3.1.0.a(PrelIOUtils.o)
ld: symbol(s) not found
collect2: ld returned 1 exit status

The reason for this is a combination of several factors:

  • The base library that comes with the GHC binary distribution wants to link against the standard Mac iconv
  • I have installed MacPorts libiconv, which renames the function that is named iconv_open in the standard iconv to libiconv_open
  • The Haskell library being installed by cabal depends transitively on some library that was built with something like extra-lib-dirs: /opt/local/lib, which causes -L/opt/local/lib to be passed to the linker
  • The linker's -L/opt/local/lib option occurs before -L/usr/lib, so the linker prefers to link against the MacPorts libiconv instead of the system one

In my case, it was the Haskell readline wrapper that was causing /opt/local/lib to be pulled in. I had to link the Haskell readline against MacPorts readline because the standard Mac libreadline is actually libeditline, which is almost-but-not-quite compatible and misses some crucial features.

There are several ways to fix the problem:

  • Perhaps you don't really need the MacPorts libiconv. In this case, you can stop it from being used by just doing port deactivate libiconv. This is the route I took.
  • Perhaps it's OK to link this particular library against the system libraries in preference to the MacPorts one. In this case, you can configure the package with cabal configure --extra-lib-dir=/usr/lib, so /usr/lib is searched before the MacPorts directory. This may fail if the package that needed -L/opt/local/lib requires a MacPorts version of some library that is also present in /usr/lib, though.
  • You could build GHC yourself and link it against the MacPorts library versions. This is not for the faint-hearted, but if the version of GHC you need is in MacPorts I imagine that you can just do port install ghc

I'm glad I've finally got this sorted out. If you are still having trouble, you might find some helpful information in the threads that finally helped me to overcome the issue and prompted this writeup.

Ditaa support for gitit

I hacked together a quick plugin for the most excellent gitit wiki today. It's written in Haskell, so it's an absolute pleasure to write code for it.

What I added support for is a neat little tool called ditaa (DIagrams Through Ascii Art). Basically, in the markdown source of your Gitit wiki you can now write something like the following:

~~~ {.ditaa}
+--------+   +-------+    +-------+
|        | --+ ditaa +--> |       |
|  Text  |   +-------+    |diagram|
|Document|   |!magic!|    |       |
|     {d}|   |       |    |       |
+---+----+   +-------+    +-------+
    :                         ^
    |       Lots of work      |
    +-------------------------+
~~~

The plugin will then call out to the ditaa command line tool (written in Java, boo!) to render that to a beautiful image:

A ditaa image on a Gitit page

To get this set up for yourself, try the following from the root of your Gitit wiki:

git clone git://github.com/batterseapower/gitit-plugins.git batterseapower-plugins
wget http://downloads.sourceforge.net/project/ditaa/ditaa/0.9/ditaa0_9.zip?use_mirror=kent -O ditaa0_9.zip
unzip ditaa0_9.zip

Now edit your Gitit configuration file so the plugins list includes my plugin:

plugins: batterseapower-plugins/Ditaa.hs

That's it - restart Gitit and you should be ready to go!

OMake and Literate Haskell

I've often thought it would be a nice idea to mash together Eelco Dolstra's ideas about purely-functional software deployment with functional reactive programming to produce a purely functional build system. In this world, compilers are pure functions from source (.c files) to object (.o files) and these dependencies are encoded explicitly by argument-passing in the program rather than happening implicitly via the file system.

Having grappled with GNU Make somewhat for the last week, I thought I'd take another look at making this work nicely and made a little progress. However, that is not the topic of this post! In the course of research, Google turned up an interesting find: OMake, a build system written in OCaml. It has a number of niceish features over GNU Make but the thing that really caught my eye was built-in support for LaTeX building - just the problem I'd had over the last week. Thus inspired, I had a go at integrating this support with one of my Literate Haskell documents, and this post will preserve the result for posterity 🙂

OMake syntax is somewhat Make like, so our OMakefile opens with something you may find familiar:

########################################################################
# Phony targets are scoped, so you probably want to declare them first.
#

.PHONY: all clean
.DEFAULT: all

The notion of scoping is somewhat different from that Make, and it includes the nice feature that you can introduce local scopes into your OMakefile with a "section" declaration - I don't use any of that good stuff here though.

Next I need to tell OMake how to build .lhs files, since it only understands .tex. I do this by declaring a function that generates rules when it is called (pretty cool!):

########################################################################
# Lhs2Tex builder.
#

LHS2TEX = lhs2TeX
LHSSTYLE = tt

Lhs2Tex(tex, lhs) =
    # Add the suffixes
    tex_name = $(addsuffix .tex, $(tex))
    lhs_name = $(addsuffix .lhs, $(lhs))

    # The build rule
    $(tex_name): $(lhs_name)
        $(LHS2TEX) --$(LHSSTYLE) -o $@ < $+

    # Return the tex file name
    value $(tex_name)

At this point I could also use the .SCANNER target to tell OMake how to determine the dependencies of a .lhs file, but I didn't need that for my situation, so I didn't bother to do it. Instead, let's plunge onwards into the declaration of the rules particular to my program:

########################################################################
# LaTeX configuration.
#

DOCUMENT = closure-elimination
GENERATED_TEX = closure-elimination

foreach(tex, $(GENERATED_TEX)):
    Lhs2Tex($(tex), $(tex))

LaTeXDocument($(DOCUMENT), closure-elimination)

This turns my file closure-elimination.lhs into closure-elimination.tex and then finally into a PDF using the built-in LaTeXDocument function. I've used a foreach loop to call my Lhs2Tex function just to show you that this can be done - the OMake system includes a whole rich programming language of it's own, which even includes first-class functions!

Finally I need to define those phony targets I declared way up at the start of the post:

########################################################################
# Phony targets
#

all: $(DOCUMENT).pdf

clean:
  rm -f \
     $(addsuffix .tex, $(GENERATED_TEX)) \
     $(addsuffixes .pdf .dvi .aux .log .out, $(DOCUMENT)) \
     *.omc

My "clean" target is somewhat unsatisfactory. I rather think that the use of LaTeXDocument should mean that the LaTeX junk (.pdf, .dvi, .aux, .log, .out) should be automatically cleaned somehow, but I can't work out how to achieve that. (My functional build system would consider this stuff part of the "local side effects" of a build command - encoded through some sort of monad structure - and hence fair game for any "clean".)

Overall, OMake looks like a nice system - indeed, anything that handles the tedious business of running LaTeX over my document right number of times is doing better than GNU Make in my book. I'll certainly consider using it in future, larger scale projects - but I'm a bit concerned the code has been abandonded, as the last release was in August of 2007...

Other alternatives for LaTeX projects include (if you want to stick with GNU Make) the Ultimate LaTeX Makefile. Another option might be cute-looking thing I just discovered called rubber, which is a specialised LaTeX build tool.

Upgrading An Unactivated Windows Install To Parallels 4.0

This is a pretty obscure problem, but I'm going to put a post up about it on the off chance I can help someone else out. My regular reader (hi dad!) will probably find this of no interest and should give it a miss 🙂

The situation I found myself in was upgrading a Boot Camp install of Windows Vista for the new release of Parallels Desktop 4.0 - no big deal, you may think. Unfortunately, I had forgotten that that particular install of Windows Vista wasn't activated, which caused the automatic upgrade process to bork, dropping me back to manual mode.

To complete the upgrade I needed to run the Parallels Tools setup executable. However, since I hadn't activated, I could only log in as far as getting the "please activate Windows now" screen. As it happened, I knew that I could get rid of this screen by feeding it the details of a Windows Vista license I own, but in order to do that I needed an Internet connection (I don't think my PAYG phone had enough credit on it for an extended Microsoft call centre experience). However, to get an Internet connection I had to install the Parallels Ethernet Connection drivers, and hence the Tools. Catch 22!

The workaround is convoluted, to say the least. First, we need a command prompt in the restricted Vista activation session. You do this by clicking any of the links in the activation window: they should cause a browser to open. From here, you can ask the browser to "Open a file" and direct it to C:\Windows\System32\cmd.exe - this should initiate "download" of the executable. Click the option to run the file and voila!

Now you have a command prompt the fun really begins. You might think you could just type D:\setup.exe and the Tools would begin installing, but life just isn't that simple - in Their infinite wisdom, Microsoft have imposed quotas on the resource consumption of the session they set up for the purposes of activation. This is probably the Right Thing to do from their POV, but it's just a pain in the arse for us.

The workaround is to get the internet connection working, so you can do the activation and hence lift the resource limits. To do this, create a floppy disk image containing the Windows 2000 drivers for a Realtek 8029AS adapter (you should be able to get those from here, until Realtek break their incoming links again). Personally I did this by using another virtual machine to download the files and extract them onto a new floppy disk image (you can create a blank image on the Floppy Drive tab of the VM settings). I would make the fruits of this labour available to you as a simple download if it were not for (unfounded?) fear of Realtek's highly trained attack lawyers.

Once you have the requisite image in your sweaty virtual paws you can proceed to mount it into the Vista VM. To finish up, type compmgmt.msc into that command prompt and update the drivers for the detected network adapter by searching for new ones on the A:\ drive.

You should now be free to run the online activation and break the Catch 22, allowing installation of the Tools - at this point feel free to help yourself to a cup of coffee and a ginger-snap biscuit to celebrate a difficult job done well (I know I did...).

I'm really quite suprised that I had to jump through this many hoops - the Realtek drivers allegedly come with Vista, for one thing. But - c'est la vie! It's also quite pleasing that the humble, long outmoded, floppy drive still has a place in solving modern IT problems 🙂

Hackage Releases Made Easy

The Haskell community has built up a great resource: the Hackage Haskell package database, where we recently hit the 500-package mark!

One of those 500 packages was mine, I added another to their number just an hour ago, and I've got two more in the oven. Given, then, that I'm starting to maintain a few packages, I went to the trouble of automating the Hackage release process, and in this post I'm going to briefly walk through setting up this automated environment.

  1. Install cabal-upload from Hackage. I'm afraid that at the time of writing this is not perfectly simple because it won't build with GHC 6.8 or above: this can be fixed with a new .cabal file, however, which I've made available here. (Edit: I've just noticed that this functionality seems to have been added to Cabal itself! You may just be able to use cabal upload. However, I'm not sure what the right config file location is for the next step).
  2. Add a file containing your Hackage username and password in the format ("username","password") called ~/.cabal-upload/auth.
  3. Copy the following shell script into a file called release in the root of your project (the same directory as the Setup.lhs file):
    #!/bin/bash
    #
    
    echo "Have you updated the version number? Type 'yes' if you have!"
    read version_response
    
    if [ "$version_response" != "yes" ]; then
        echo "Go and update the version number"
        exit 1
    fi
    
    sdist_output=`runghc Setup.lhs sdist`
    
    if [ "$?" != "0" ]; then
        echo "Cabal sdist failed, aborting"
        exit 1
    fi
    
    # Want to find a line like:
    # Source tarball created: dist/ansi-terminal-0.1.tar.gz
    
    # Test this with:
    # runghc Setup.lhs sdist | grep ...
    filename=`echo $sdist_output | sed 's/.*Source tarball created: \([^ ]*\).*/\1/'`
    echo "Filename: $filename"
    
    if [ "$filename" = "$sdist_output" ]; then
        echo "Could not find filename, aborting"
        exit 1
    fi
    
    # Test this with:
    # echo dist/ansi-terminal-0.1.tar.gz | sed ...
    version=`echo $filename | sed 's/^[^0-9]*\([0-9\.]*\).tar.gz$/\1/'`
    echo "Version: $version"
    
    if [ "$version" = "$filename" ]; then
        echo "Could not find version, aborting"
        exit 1
    fi
    
    echo "This is your last chance to abort! I'm going to upload in 10 seconds"
    sleep 10
    
    git tag "v$version"
    
    if [ "$?" != "0" ]; then
        echo "Git tag failed, aborting"
        exit 1
    fi
    
    # You need to have stored your Hackage username and password as directed by cabal-upload
    # I use -v5 because otherwise the error messages can be cryptic 🙂
    cabal-upload -v5 $filename
    
    if [ "$?" != "0" ]; then
        echo "Hackage upload failed, aborting"
        exit 1
    fi
    
    # Success!
    exit 0
  4. When you're ready to release something, simply run the shell script! Not only will this package up your project and upload it to Hackage, it will also add a version tag to your Git repository (obviously you should change this bit if you are using another VCS!).

If you would like to follow my continuing adventures in Haskell open source, please check out my GitHub profile! Patches gratefully accepted 🙂

Fixing File Associations Eaten By Parallels Desktop

I use OS X, and so for those rare cases where I must deign to run a program designed for Windows I make use of my copy of it hosted in Parallels Desktop. Now, Parallels has been generally good to me, but I recently came across the problem documented in this forum thread where its ability to launch Mac applications from Windows caused some basic file associations in Windows-land to stop working. For example, Excel files were listed as "xls_auto_file" by Explorer and double clicking on them only gave me the "Open With" dialog: not very helpful.

The thread does suggest a means of solving the problem, by creating a batch file that re-associates a white-list of some of the possibly affected extensions, like this:

REM Restores MS Office File Type Associations
assoc .doc=Word.Document.8
assoc .dochtml=wordhtmlfile
assoc .docmhtml=wordmhtmlfile
assoc .docxml=wordxmlfile
assoc .dot=Word.Template.8
assoc .pot=PowerPoint.Template.8
assoc .pps=PowerPoint.SlideShow.8
assoc .ppt=PowerPoint.Show.8
assoc .rtf=Word.RTF.8
assoc .wbk=Word.Backup.8
assoc .xlc=Excel.Chart.8
assoc .xlm=Excel.Macrosheet
assoc .xls=Excel.Sheet.8
assoc .xlt=Excel.Template
assoc .xlw=Excel.Workspace

However, this felt a bit ad-hoc to me, and in particular some of the extensions that were affected for me were not on the list. Thus, like any good programmer I went off and whipped up a utility designed to undo the damage inflicted on my poor defenseless HKEY_CLASSES_ROOT by Parallels.



Using it is pretty simple: just click "Scan" and then "Fix Selected" repeatedly until the scan finds nothing more of interest. Now, before I give you a download link to the utility, I need to give you the standard disclaimer to which you must agree in order to use it:

This utility is provided on an 'as is' basis, without warranties of any kind, and no warranty, express or implied, is given that the operation of the utility is correct or safe to use. I do not accept any liability for any error or omission. Use of the utility is at your own risk.

Sorry for the legalese but this utility is modifying your registry and hence (though I feel it is highly unlikely) could get something quite, quite wrong. You would be also wise to have a backup of the HKEY_CLASSES_ROOT registry branch to restore in case you aren't happy with the changes made by the utility.

That said, without further ado here is the utility and its source code (C# 3.0), both of which (being of my sole authorship) I release into the public domain. I hope someone else finds it useful!