All posts by Max

Being concrete about the benefits of tax efficient index investment

In my last post I discussed the methods that a UK individual could use to make investments. There were plenty of different methods, all with their own unique tradeoffs.

In this post I'm going to focus just on the issue of tax efficiency. Let me remind you that I'm definitely not a tax professional and this post just reflects my current understanding of the situation. You probably shouldn't rely on it to be correct, and should seek independent advice before using any of this info.

That being said, what I have done is write a simulation of an unleveraged FTSE 100 investment from 1989 to 2016-11-01, as achieved via four methods:

  1. Index-tracking ETF
  2. Index future
  3. Spread bet
  4. CFD

With a 100,000 GBP initial investment, the most efficient investing method (spread betting) has a final account value of 610,718 GBP: 163k higher than the least efficient investment method (index futures), which had a final value of 447,629 GBP. That's a 36% difference!

CFDs and ETFs came somewhere in the middle: the CFD investor would have had 521,090 GBP at the end, while the ETF holder would have 491,957 GBP — and this is with the generous assumption that the fees charged by the ETF provider are 0%.

Spread bets win for a simple reason: they don't pay any capital gains or income tax at all. With an unleveraged investment, the high financing costs of a spread bet are irrelevant. Note that my model assumes that your spread bet provider pays you the full value of a dividend if you had a long position. It is by means guaranteed that this applies to your provider, but there are few companies out there that do do things this way: I will cover a few in the last part of this post.

Why are index futures so inefficient? The reason is that index future returns are net of the risk free rate. This means that you have to stick your unmargined money in a bank account to earn back the risk free rate, and that means you end up paying income tax — the most onerous of all the taxes. Over the sample period the futures strategy ends up paying 115,309 GBP in income tax alone. The capital gains tax obligations are a relatively modest 19,236 GBP. The trading costs of this option are also relatively high (thanks to the quarterly contract roll), at 2,156 GBP but this is dwarfed by the tax charges.

Note that my backtest period includes a period of rather high interest rates in the UK (rates fluctuated around 10% at the beginning of the period). It's likely that investing in futures is more tax-efficient nowadays than it was historically.

CFDs benefit from being able to treat dividend payments on the underlying as capital gains. The CFD investor would have paid only 49,914 GBP in capital gains tax over the period, and no income tax at all. The fact that this number is roughly half the total tax burden of the index future investor reflects the fact that the higher rate of income tax is about twice the rate of capital gains tax.

Finally, the ETF investor would have paid a mix: 46,314 GBP in income tax, and 16,897 GBP in capital gains. This is a total tax burden not much higher than that paid by the CFD investor, but it differs in that the CFD investor's capital gains liabilities mostly arise towards the end of the test (2013 and later), while the ETF is dribbling dividend income away to the taxman almost every year since inception (actually, 1994 is the first year in which the ETF dividend income exceeds the tax-free threshold).

Naturally, for those who are willing and able to invest all their money within an ISA, all of this discussion is irrelevant — in this case, all dividends and capital gains will be tax free anyway, so you may as well just buy an ETF. Individuals who have hit their ISA contribution cap, or who want to do things that are incompatible with ISAs (e.g. hold futures and options, or use leverage) may however find this information useful.

Note finally that my tests make a number of assumptions:

  • You are a UK higher rate taxpayer
  • Today's tax regime applies across all of history
  • For the index future results: that you can save in an easy-access account offering 0.5% above LIBOR
  • You realize your gains somehow to take full advantage of your annual tax free allowance
  • I couldn't find monthly FTSE 100 index returns anywhere (I know this sounds weird, but I really did try quite hard and they were nowhere to be found) so I backed them out from Quandl's FTSE 100 index future data by assuming an annual dividend yield of 3.83%. At least this should mean that my results aren't affected by shocks to the level of market-expected dividiends.

Spread-Betting Providers

From the above we can see that spread betting can be advantageous. However, this conclusion is sensitive to the amount of dividends that the provider passes on to the better. My computed final account value of 610,718 GBP assumes 100% of dividends are passed on, but if just 90% are passed on then you will have only 552,525 GBP at the end — still good, but not much better than a CFD investment. At 85% retention final value is 525,472 GBP, and if your provider is cheeky enough to retain 80% then final value would be 499,698 GBP — almost as bad as holding an ETF.

I gathered some info from around the web about the charges imposed by various spread-betting providers. For long term investors the relevant bits of info are the financing rate and the fraction of dividends that are passed through to you (for a long position). The below table compares a few providers on these criteria, again assuming an investment into the FTSE 100.

(Note that I expect that you would only pay the financing cost on the value of your position that exceeds your cash deposit, so the financing rate may not be at all important for a totally unleveraged investor.)

Provider Financing Cost Dividend Passthrough
Ayondo 2.5% 100%
CityIndex 2.5% I expect 90% given "net dividends", which matches another source
CMC Markets 3% 100%
CoreSpreads 2.5% 90%
ETX 3% 100% but another (older) source says 90%
GKFX ?? 100%?, see also here
IG 2.5% I expect 90% given "net dividends", but other sources say 85%
InterTrader 2.5% 80%
LCG 2.5% Perhaps 80%
SpreadEx ?? 100%

Without considering any other factors, Ayondo seems like the best deal, with full dividend passthrough and low financing costs.

Tax-efficient and financing-efficient UK individual investing

In my last post I gave an example of a situation where individual investors might want to borrow money for investment purposes. This post will give an overview of the methods that individuals can use to achieve that leverage efficiently. I will also cover tax considerations, some of which may be relevant even to unleveraged positions. Much of what I cover here will be UK specific, particularly when it comes to taxes.

Before we begin I should probably say that I'm not a tax accountant, a lawyer, a professional financial advisor, or anything else: I'm just a guy with access to Google and an interest in efficiency. You should probably speak to a professional before acting on any on the info in this article! I do work for an investment company, but I'm certainly not speaking for them here, and the information in this post has little-to-no relevance to their business. This is simply a summary of my understanding based on my research — I haven't actually tried most of these methods in practice. I would appreciate feedback if you notice any errors.

Secured Lending

A secured loan such as a mortage or HELOC is the form of borrowing that is probably most familiar to people. Because these loans are backed by an asset (i.e. probably your house), you can get very good interest rates: I see 2 year fixed teaser rates as low as 1.2% AER, which is less than 1% above the overnight GBP LIBOR rate of 0.225%.

The obvious downside of this form of borrowing is that the amount you can borrow is limited by the amount of home equity you have.

Margin

Many stock brokerages offer margin accounts to their customers. A margin account is one where you are allowed to borrow to invest more than you have deposited into the account. The borrowed capital is secured by the equity in the account, which must meet a minimum value threshold ("margin requirement"), normally defined as some fraction of the total notional value of the account.

The broker I'm most familiar with is Interactive Brokers (IB). Roughly speaking, their rules allow a margin account to borrow up to 100% of the value of the equity in the account (i.e. achieve 2x leverage). The interest rates charged are fairly low: right now, for GBP borrowing they charge 1.5% above LIBOR. The rates get more competitive if you borrow more — loans above GBP 80,000 only attract a charge of 1% over LIBOR.

If you're using a broker's margin facility you obviously need to accept their schedule of trading costs too. Luckily, IB's fees are just as competitive as their margin charges, and start at around 6 GBP for an equity trade.

Futures

If you want to invest in an asset on which a liquid futures contract exists, this can be a very cheap way to achieve leverage. At the time of writing, a single FTSE 100 index future contract has face value of around 73,000 GBP. The returns on this contract will closely match the returns on investing that same face value in an index tracking ETF. However, unlike with an ETF investment, if you purchase one of these futures contracts, you don't need to invest those tens of thousands of pounds upfront — instead, you just need to deposit a certain amount of margin with your broker. Right now, IB only require about 6,500 GBP of deposited margin for a FTSE position held overnight, so you can potentially achieve 10x leverage without paying any financing costs.

If you use futures to make long-term investments in assets it is important to understand how the returns you earn on futures differs from that on the underlying asset. By an arbitrage argument you can show that the price of a futures contract should be equal to the forward price F:

Where S is the spot price of the underlying instrument (in our example, this would be the FTSE 100 index), T is the time to expiry of the contract, r is the risk-free rate and q is the "cost of carry". The cost of carry is essentially a measure of the return you earn just by holding a position in the underlying. For an equity index future like the FTSE the cost of carry will be positive because by holding the components of the FTSE you actually earn dividends. For commodity futures the cost of carry may be negative because you will actually have to pay to store your oil or whatever.

If the spot price of the underlying stays unchanged, the daily return R on the contract will be:

This illustrates the key difference between holding the underlying and the future. With the future, you don't just earn the returns of the underlying asset. — the value of your contract also decays each day by an amount related to the difference between the cost of carry and the risk free rate. If this decay costs you money, the future is said to be in "contango", otherwise it is in "backwardation". You can somewhat offset the decay due to the risk free rate part of this by depositing the notional amount of your investment in an account that earns the risk free rate. However, even if you do this, you wouldn't expect the returns on your position to perfectly match those of the ETF because the forward price is determined based on the expected risk free rate and cost of carry. If interest rates are unexpectedly low, or dividend payouts are unexpectedly high, then your futures investment will underperform the equivalent ETF, so you are bearing some additional risk with the futures investment.

The other issue with futures contracts is that if you hold them for the long term you will need to deal with the fact that they have a limited lifespan. For example, the June 2017 FTSE 100 contract expires on the 16th of June. On or before that date you will need to sell your position in the June contract and buy an equivalent one in a later one (e.g. the September 2017) contract, or else you'll stop earning any returns from the 17th June onwards. This regular roll process incurs transaction costs which acts as a drag on your investment. Thankfully, futures contracts are generally very cheap to trade: not only does the bid-ask spread tend to be tight, but brokerage fees are lower — IB only charge GBP 1.70 per contract to trade FTSE 100 futures, and most futures have extremely tight bid-ask spreads that are essentially negligible from the perspective of a long term investor.

One problem that makes index future investment particularly tricky for the individual investor is that these contracts generally have rather large notional value. The ~70k GBP value of one FTSE contract mentioned above is quite typical. So if you only have a small account, you can't really use futures unless you're willing to accept enormous leverage and all the risks that entails.

Contracts For Difference

Contracts For Difference (CFDs) are an instrument you can buy from a counterparty who specialises in them. Big UK names in this area are IG, CityIndex and CMC Markets, though IB also offers them. Like a futures contract, these products let you earn the returns on a big notional investment in an asset without putting down the full amount of that investment upfront — instead, you just need to deposit some margin. Depending on the provider and the reference asset, the margin requirements can be very low: CityIndex seems to only require 0.5% margin for a UK index investment, allowing for a frankly crazy 200x level of leverage. IB only require 5% margin.

Also like a future, CFDs are not available on just any underlying. It's easy to bet on equity indexes, FX, and big-cap stocks with CFDs. It is also reasonably common to find bond or commodity CFDs, but not all providers will offer a full range here (IB don't offer any). The other characteristic that CFDs share with futures is low trading costs: for index CFDs, providers commonly only charge a spread of 1 index point, i.e. about 0.01% for the FTSE 100. IB as usual offer a good price of only 0.005% per trade for version of the FTSE 100.

Now we turn to the differences between CFDs and futures. For starters, unlike futures, CFDs do have financing costs, and they are chunky: typical rates from CityIndex and friends are 2%-2.5% above LIBOR, with IB again offering an usually good deal by only charging a 1.5% spread. On the plus side, if you hold a position in an asset via a CFD you will recieve dividends on that underlying, something that is not true if using a future.

Spread Bets

Spread bets are a bit of a UK specific way to lever yourself up. Many companies offering CFDs in the UK also offer spread bets. These are essentially CFDs in all but name, and will face almost identical trading and financing costs as compared to the equivalent CFD product. They are also generally available on exactly the same set of underlyings. The key difference between a CFD and a spread bet is that spread bets are treated as gambling rather than investing by the tax system, with the consequence that earnings via one of these instruments are subject to neither capital gains nor income tax!

I will return to the issue of tax later, as there is quite a lot to say on the topic.

Options

Options are a slightly more complex way to gain leverage than the above alternatives. The idea here is that if you want to make a leveraged long bet on e.g. the FTSE, you can achieve that by buying a long-dated call option with a strike price somewhere around the current level of the index. Because the strike price is high, you will be able to purchase the option relatively cheaply, but you can potentially recieve a very high return. For example, let's say the FTSE is around 7000: you might be able to buy an option on 1x the index expiring in two years with a strike of 7000 for around 400 GBP. If the index is up 10% to 7700 at that time, then you will earn a profit of 700-400 = 300 GBP i.e. a 75% return on your investment, so you have effectively have 7.5x leverage.

Like futures or CFDs, options are only available on certain underlying assets. In the US, you can buy options on equity indexes with expiries up to three years in the future: these are known as LEAPs. Exchange-traded options with long expiries are also available in other countries: for example, the ICE lists FTSE 100 options expiring a couple of years ahead.

Also like with futures, investments via options won't earn any dividends, but neither will they attract financing charges. Individual investors may have difficulty with the fact that these options have notional values in excess of 50,000 GBP (in the US, S&P 500 mini options with smaller notionals are available, but they only list about 1 year out).

There are two other ways of purchasing options that may be more suitable for the UK small investor as they let you take a position in a smaller size. Firstly, companies offering spread bets tend to also sell options. I haven't looked into this, but given how expensive their spread bet financing is, I would not be surprised if their options were substantially overpriced. Secondly, you can purchase a "covered warrant", which is essentially an exchange-listed option targeted at individual investors. Societe Generale offers them via the London Stock Exchange: i.e. these options can be purchased just like a regular stock.

I did have a brief look into whether covered warrants offered good value for money. Specifically, I looked into the cost of SE91, a call option on the FTSE 100 with strike price 8000 and expiry December 2018 (the longest dated option available at the time of writing). When I looked at it, the warrant was quoting at around 0.25 GBP with a spread of 0.002 GBP (1%):

The equivalent exchange-traded option was had a mid price of about 156 GBP on a spread of about 50 GBP (32%):

The exchange-traded option is for a notional exposure 1000x larger than the warrant, which explains the order-of-magnitude difference between the prices. Taking this into account, the warrant looks pretty expensive, with even the bid price of 0.2488 GBP being higher than the equivalent exchange-traded option ask of 0.1805 GBP.

We can quantify exactly how much more expensive the warrant is by using the Black-Scholes option valuation model. Given the level and volatility of the FTSE at the time, the model implies a fair market value for the option of 153 GBP, which lies within the bid-ask spread we actually observe on the exchange:

SocGen are trying to charge us about 250 GBP for equivalent exposure. To put this in the same terms as the financing costs for the other instruments (i.e. as a spread to LIBOR), we can tweak the borrow cost assumption in this model until we get the right price out:

So it looks like the SocGen options are effectively offering leverage at a cost of LIBOR plus 2.75%, which is not a good deal. Trading them might still make sense so long as you are intending to hold them short-term, because they have much narrower bid-ask spreads than the exchange-traded equivalent, but in this case you'd probably end up better off buying the options OTC from a spread betting company.

I won't consider options further as frankly speaking I find them harder to analyse than the alternatives.

UK Taxes On Investments

There are four forms of tax that are relevant to investors operating in the UK:

  • Stamp duty: payable upon purchasing an asset
  • Dividend tax: payable upon recieving dividends from an asset
  • Income tax: payable upon recieving non-dividend income from an asset
  • Capital gains tax: payable upon sale of an asset

Stamp duty is the simplest of the three. It's a flat 0.5% charge upon the purchase of shares in individual companies. It is not payable on the purchase of ETFs, futures contracts, or spread bets, so mostly not very relevant.

The amount of dividend tax you pay depends on your total income in a year, and can range from 0% (if your dividends amount to less than the current tax-free allowance of 5,000 GBP) to 38.1% (if you pay "additional rate" tax of 45% on income above 150,000 GBP).

Income tax is payable on income from an asset that is not considered to be a dividend. Basically, if the asset is a bond, or a fund more than 60% invested in bonds, you will have to pay income tax instead of dividend tax. Income tax can range from 0% (if you earn less than the current Personal Allowance of 11,500 GBP) to 60% (if the income from dividends pushes you into the 100,000 GBP to 123,000 GBP band where the Personal Allowance is withdrawn). For more info see HMRC and this discussion of the marginal tax rate. It's not 100% clear to me what the tax treatment is on the final repayment of principal made by a bond issuer. I suspect the final repayment is treated as a capital gain, and for UK government debt at least it seems that no capital gains tax is payable.

Capital gains tax is payable on realised gains in excess of the annual 11,300 GBP threshold. Higher rate taxpayers (i.e. those earning above 45,000 GBP) will pay 20% on anything above this. Those who don't pay higher rate tax may only pay 10% on some amount of their gains.

Capital gains tax is perhaps the trickiest of the taxes. Firstly, you need to know that it's calculated based on your net realized loss during a year. So if you make a gain by selling some asset, you can avoid paying tax on that by selling another asset on which you have booked a loss. If you realise a net loss during a year, that can be carried forward indefinitely to be set against future capital gains.

Secondly, note that that the tax free amount of 11,300 GBP is a "use it or lose it" proposition: if you don't have 11,300 GBP of gains to report in a year then you won't be able to make use of it, and it will vanish forever. This ends up being another reason to invest in a diversified portfolio of assets: if you are diversified then you're likely to have some asset that you can liquidate during a tax year to take advantage of the allowance (just be careful that you don't fall foul of the "bed and breakfasting" rules — see this guide to realizing capital gains for more info).

One general theme of all this is that you generally end up paying less tax on capital appreciation than on dividends.

Tax Efficient Investing

For concreteness, let's say we are interested in making a (either leveraged or unleveraged) investment in equity indexes. How do these taxes apply to the investing methods discussed above, i.e. ETFs, futures, CFDs and spread bets? As already mentioned, none of these assets attract stamp duty. But what about dividend and capital gains tax?

ETFs are relatively straightforward: you pay dividend tax on the distributions, and capital gains tax upon selling an ETF that has increased in value. This may mean that it is more tax efficient to purchase an ETF that reinvests divends for you (like CUKX) rather than one that distributes them (such as ISF).

Futures contracts are straightforward: there are no dividends, so you simply pay capital gains tax. One potential problem is that you won't have much control over when you realise gains for capital gains purposes because you'll probably be rolling the contracts quarterly anyway. Furthermore, if you have taken that portion of your equity that does not go towards the margin requirement, and invested it in an interest-bearing account, then you will have to pay income tax on any interest income. For tax purposes it might be most efficient to invest in a zero-coupon government bond which will not attract either income tax or capital gains tax, but this might be more trouble than it is worth.

The tax treatment of CFDs is interesting. All cashflows due to the CFD are considered to be capital gains by HMRC — what's surprising is that this this includes both the interest you pay to support the position, and any payments you receive as a result of the underlying making a dividend payment. This makes CFDs rather attractive: you can end up paying capital gains tax rates on dividend income, and benefit from being able to use you interest payments to reduce capital gains liability, reducing the effective cost of margin by up to 20%.

Finally we come to spread bets: as mentioned earlier, bets are subject to different rules, so you don't pay any tax at all on these. The flip side to this is if of course that if you make a loss, you aren't able to offset it against capital gains elsewhere. It's not totally clear to me whether this treatment applies to payments made on the spread bet as a result of dividend adjustment, but it looks like it may do. This is why some spread betting providers (e.g. CoreSpreads) only pay out 80% or 90% of the value of any dividend to the punter. One last thing to note is that the spread betting providers themselves pay a betting duty of 3% on the difference between punter's losses and profits: this will of course be passed on to you in the form of higher fees.

Summary

This is a lot of info to take in, so I've tried to summarize the most important points below. Trading costs assume a 100,000 GBP investment in the FTSE 100.

Secured Lending Margin Futures CFDs Spread Bets
Available underlying Anything Anything Equity indexes, debt, commodities, FX, certain equities (though liquidity may be limited) Equity indexes, debt (sometimes), commodities (sometimes), FX, certain equities
Approximate max leverage 4x (assuming 75% LTV) 2x 10x 200x 200x
Financing cost above LIBOR 1% 1% to 1.5% 0% 1.5% to 2.5% (20% less if treatable as capital loss) 2% to 2.5%
Other holding costs 0.09% (Vanguard's VUKE ongoing charge)

0.014% (quarterly roll costs) 0% 0%
Trading costs (FTSE 100) 0.09% (VUKE 0.06% bid-ask spread, 0.03% commission) 0.0017% 0.005% 0.01%
Dividend treatment Paid in full by ETF provider None, but expected dividends become a positive carry on holding the contract Paid in full Generally paid in full but some providers may withhold 10%-20%

A return-boosting idea

As a final note, here's something I just noticed and haven't seen mentioned anywhere else. If investing via a CFD, spread bet or futures contract, you only need to deposit margin with your broker. If you're only using 1x leverage, this means that 90% of the notional value of your investment is free for use elsewhere, so long as you are able to move it back to the margin account if needed.

What's interesting is that as an individual investor it's straightforward to find bank accounts that pay more than the risk free rate — even though these accounts do enjoy full backing from a sovereign government, and so are risk free in practice. For example, right now I can see an easy-access (aka demand deposit) account from RCI Bank accruing interest daily and paying an AER of 1.1% on balances up to 1 million pounds — i.e. about 0.9% above LIBOR. This is higher than the financing cost of a futures position (though not a CFD or spread bet), so it seems to me that there is reason to believe that the returns on a futures investment will actually beat out the equivalent ETF, so long as you do invest the "spare" equity in this way.

The case for leverage in personal investing

The standard advice for personal investing that I see all around the web is to put your money into one or more low cost equity index tracking funds. Commentators also sometimes recommend an allocation to bonds (e.g. a 60/40 split between stocks and bonds), though the popularity of this advice seems to become less common with every passing month of the bull market.

However, the more I learn about investment, the more I come to think that this answer is suboptimal. To see why, let's consider a simplified world where we have exactly two assets to which we can allocate our wealth: stocks and bonds.

Portfolio Theory

The historical evidence (see e.g. the excellent book Expected Returns) is that the returns on bonds and equities are uncorrelated, with bonds having lower volatility (aka standard deviation) than equities — US treasuries experienced an annualized volatility of 4.7% a year between 1990 and 2009, while US equities had a volatility of 15.5% over the same period.

Given their lower volatility, bonds are clearly less risky than equities. However, you would hope that if you invest in equities rather than bonds, then your willingness to accept the inherently higher risks is compensated for by a higher expected return. This idea of can be captured mathematically as the Sharpe ratio, which measures the reward you receive per unit risk taken. Specifically, the Sharpe ratio S is equal to the ratio between the expected "excess return" of the investment, and the standard deviation of those returns. The excess return is defined as the amount of the expected return R above a risk-free rate Rf (e.g. the rate of return you can get by lending overnight in the money markets). Putting it all together we get this formula for S:

Thanks Wikipedia :)

All other things being equal, you probably want to invest in assets with as high a Sharpe ratio as possible.

It can be tricky to figure out what the Sharpe ratio is for investments of interest, but history suggests that stocks and bonds both have similar Sharpe ratios of roughly 0.3. What's more, the correlation between their returns is close to 0. This last fact is important because portfolio theory shows that you can form an investment with high Sharpe ratio by holding a diversified portfolio of two or more uncorrelated assets with low Sharpe ratios. Assuming that historical returns, volatilities and correlations are good guides to the future, the portfolio of stocks and bonds with highest Sharpe is one that holds roughly $3 of bonds for every $1 of stocks: this ratio comes about because the volatility of stocks is approximately 3 times that of bonds. This optimal portfolio has Sharpe ratio of about 0.42 i.e. 1.41 (= sqrt(2)) times higher than that either asset class by itself.

You can get a feel for how this optimization process works by playing with my online portfolio theory tool.

Leverage

What's striking about this optimal portfolio is that it's very different from the normal advice: it puts 75% of capital into bonds, much more than even the most conservative conventional advice of a 40% allocation. The obvious objection to the bond-heavy optimal portfolio is of course that it will have very low expected return compared to one with a bigger weighting on equities. This is absolutely true: we expect the volatility of the optimal portfolio to be around 0.75*(bond volatility) + 0.25*(stock volatility) = 0.75*4.7% + 0.25*15.5% = 7.4%. Because this is roughly half the volatility of equities by themselves, we'd expect the excess return on the portfolio to therefore only be about (0.42/0.3)/2 = 1.41/2 = 0.7 times that of a pure-equity allocation, which definitely sounds like bad news for the optimal portfolio.

However, this is a solvable problem — to recover the high expected returns we desire, we simply have to borrow money to invest greater notional amounts into the portfolio. If we borrow, then using 2x leverage (i.e. borrowing so as to invest $2 for every $1 of capital we actually control) would scale the volatility on our portfolio up from 7.4% to the equity-like levels of 14.8%. If we assume that we could borrow at the risk-free rate, then because our portfolio has Sharpe ratio higher than that of plain equities, the expected excess return in this scenario would be 41% higher than that of equities alone. So we earn better returns than with a pure-equity play even though we are running similar risks.

Everyone has probably been told at some point that diversification is good, but the way it is usually explained is by saying that diversification reduces your risk, which sounds worthy but sort of boring ☺. When you realise that this risk reduction means that you free up some "risk budget" which you can use to achieve extra returns via leverage, diversification starts sounding more exciting!

Of course, in reality we are unlikely to be able to borrow at the risk-free rate, but there are ways to borrow that are only just slightly more expensive than this — depending on the currency, companies and investment funds regularly borrow at rates as low as 0.5% above the risk-free rate. Even as an individual investor, there are ways in which you can borrow quite cost-effectively — this is a topic I will cover in a future post. The cost of leverage is of course a constraint that we should bear in bind, though: the fact that we face lending costs rules out activities such as levering up short term bonds (e.g. US treasury bills) which have very low volatility (because they take on very little interest rate risk).

Risk Parity

The approach to investing outlined above can be roughly summarized as:

  1. Assume that all investible assets have the same Sharpe ratio
  2. Therefore decide to allocate to them in an amount inversely proportional to their volatility (following the advice of portfolio theory and mean-variance optimization for Sharpe ratio maximization)
  3. Leverage up the resulting portfolio to achieve a particular desired level of risk

This method is also known as "risk parity". Famously, it's the strategy used by Bridgewater's All Weather hedge fund, which has returned a Sharpe of roughly 0.5 since inception in 1996. (All Weather invests in asset classes other than stocks and bonds, so we would expect it to have a higher Sharpe than our earlier prediction of 0.42, simply due to the extra diversification.)

What's particularly interesting about risk parity is that it's not actually immediately obvious that taking bigger risks with your money leads to a sufficient extra level of return to compensate you for those risks. For example, take the case of stocks and bonds. From 1990 to 2009, US equities returned a (geometric) mean of 8.5% per year while treasuries returned 6.8%: so equities did earn a higher return, but one that doesn't seem commensurate with the 3 times higher volatility experienced. Furthermore, global equities (which had similar volatility to US equities) actually only returned 5.9% i.e. considerably less than US bonds!

The fact that risk-taking is under-compensated is actually a well-known anomaly: an excellent paper on the subject is Betting Against Beta which suggests the reason may be because many investors are either unwilling or unable to use leverage. Whatever the cause, it's good news for risk parity, because this means that the low-risk assets you are levering up actually have a higher Sharpe ratio than the high-risk assets that you (relatively speaking) disprefer, so your portfolio's Sharpe ratio will be even better than you would naively expect. This subject is explored further in this readable paper on risk parity from AQR.

Faster ordered maps for Java

Sorted maps are useful alternatives to standard unordered hashmaps. Not only do they tend to make your programs more deterministic, they also make some kinds of queries very efficient. For example, one thing we frequently want to do at work is find the most recent observation of a sparse timeseries as of a particular time. If the series is represented as an ordered mapping from time to value, then this this question is easily answered in log time by a bisection on the mapping.

A disadvantage to using ordered maps is that they tend to have higher constant factors than simple unordered ones. Java is no exception to this rule: as we will see below, the standard HashMap outperforms the ordered TreeMap equivalent by two or three times.

My new open source Java library, btreemap, is an attempt to ameliorate these constant factors. As the name suggests, it is based on B-tree technology rather than the red-black trees that are used in TreeMap. These "mechanically sympathetic" balanced tree data structures improve cache locality by using tree nodes with high fanout.

My library offers both boxed (BTreeMap) and type-specialized (e.g. IntIntBTreeMap) unboxed variants of the core data structure. Benchmarking it against some competing sorted collections (including fastutil and MapDB 1) reveals it beats them by a good margin, though it still carries a performance penalty versus the simple HashMap:

So switching from TreeMap to IntIntBTreeMap may be worth a 2x performance increase. Nice!

These benchmarks:

  • Use JMH to ensure e.g. that the JVM is warmed up
  • Were run on an int to int map with 100k keys
  • Do not include lowerKey numbers for HashMap or Int2IntRBTreeMap, which do not support the operation
  • Are in memory-only and do not make use of the persistence features of MapDB

Performance benefits depend on the amount of data you are working with. Small working sets may fit into level 1 or 2 cache so will pay a relatively small penalty for a lack of cache locality. These graph show how throughput depends on the number of keys there are in the working set, where keys are distributed between a varying number of fixed size (100 key) maps. B-trees do not start to show a performance advantage until we reach 10k keys or so:

There are a few interesting things about the implementation:

  • The obvious way to represent a tree node is as an object with two fields: a fixed size array of children, and a size (number of children that are present). However, in Java this means taking two indirections when you want to access a child (you need to first load the address of the array from the object, then load the child at an offset from that base address). Instead, I define tree nodes as a class with one field for each possible child, and then use sun.misc.Unsafe for fast random access to these fields. This change made get about 10% faster in my testing.
  • The internal nodes store the links to their children in sorted order. Therefore, you'd expect binary search to be a good way to find the child associated with a particular tree. In practice I found that linear search was 20% or more faster, probably due to branch prediction improvements.
  • To avoid copy-pasting the code for each unboxed version of the data structure, I had to come up with a horrible templating language partly based on JTwig. Value types can't come fast enough!

4 things you didn't know you could do with Java

Java is often described as a simple programming language. While this is arguably true, it has still retained the ability to surprise me after using it full time for years.

This blog post describes four features that are obscure enough that they've surprised either me or one of my seasoned colleagues.

Abstract over thrown exception type

Yes, throws clauses may contain type variables. This means that this sort of thing is admissable:

interface ExceptionalSupplier<T, E extends Throwable> {
    T supply() throws E;
}
 
class FileStreamSupplier
  implements ExceptionalSupplier<FileInputStream, IOException> {
    @Override
    public FileInputStream supply() throws IOException {
        return new FileInputStream(new File("foo.txt"));
    }
}
interface ExceptionalSupplier<T, E extends Throwable> {
    T supply() throws E;
}

class FileStreamSupplier
  implements ExceptionalSupplier<FileInputStream, IOException> {
    @Override
    public FileInputStream supply() throws IOException {
        return new FileInputStream(new File("foo.txt"));
    }
}

As the example suggests, this pattern is frequently useful if you want to do functional abstraction without having to rethrow exceptions as RuntimeException everywhere.
The place people are most likely to encounter this for the first time is when looking at the Throwables utilities in Guava.

Intersection types

This refers to the ability to write down a type that is the set intersection of two other types — so the type A & B is only inhabited by values that are instances of both A and B. Java lets you use this sort of type within generic bounds, but not anywhere else:

class IntersectionType {
    public <T extends List & Iterator> void consume(T weirdThing) {
        weirdThing.iterator().next();
        weirdThing.next();
    }
}
class IntersectionType {
    public <T extends List & Iterator> void consume(T weirdThing) {
        weirdThing.iterator().next();
        weirdThing.next();
    }
}

One place this comes up in practice is where you need to know that something is both of some useful type and also, for resource managment purposes, Closeable.

Constructor type parameters

Constructors are somewhat analagous to static methods. Did you know that just like static methods, constructors can take type arguments? Observe:

class ConstructorTyArgs {
    private final List<String> strings;
 
    <T> ConstructorTyArgs(List<T> xs, Function<T, String> f) {
        strings = xs.stream().map(f).collect(Collectors.toList());
    }
 
    public static void useSite() {
        new <Integer> ConstructorTyArgs(
            Arrays.asList(1, 2, 3),
            x -> x.toString() + "!");
    }
}
class ConstructorTyArgs {
    private final List<String> strings;

    <T> ConstructorTyArgs(List<T> xs, Function<T, String> f) {
        strings = xs.stream().map(f).collect(Collectors.toList());
    }

    public static void useSite() {
        new <Integer> ConstructorTyArgs(
        	Arrays.asList(1, 2, 3),
        	x -> x.toString() + "!");
    }
}

This feature is useless enough that I've never felt any desire to do this. In fact, I only noticed it when I was reading a formal grammar for Java.

Note that the type parameters you write here are not the same as the type parameters of the enclosing class (if any). This means that unfortunately there is no way to write the static create method below as a constructor, since it requires refining the bounds on the class type parameters:

class Comparablish<T> {
    private final T value;
    private final Comparator<T> comparator;
 
    public Comparablish(T value, Comparator<T> comparator) {
        this.value = value;
        this.comparator = comparator;
    }
    
    static <T extends Comparable<T>> Comparablish<T> create(T value) {
        return new Comparablish<T>(value, Comparator.naturalOrder());
    }
}
class Comparablish<T> {
    private final T value;
    private final Comparator<T> comparator;

    public Comparablish(T value, Comparator<T> comparator) {
        this.value = value;
        this.comparator = comparator;
    }
    
    static <T extends Comparable<T>> Comparablish<T> create(T value) {
        return new Comparablish<T>(value, Comparator.naturalOrder());
    }
}

Inline classes

I'm sure everyone is aware that you can declare anonymous inner classes within a method body like this:

class AnonymousInnerClass {
    public int method() {
        return new Object() {
            int foo() { return 1; }
        }.foo();
    }
}
class AnonymousInnerClass {
    public int method() {
        return new Object() {
            int foo() { return 1; }
        }.foo();
    }
}

But did you know that you can also declare named inner classes too?

class InlineClass {
    public int method() {
        class MyIterator implements Iterator<Integer> {
            private int i = 0;
            @Override public Integer next() { return i++; }
            @Override public boolean hasNext() { return false; }
        }
 
        return new MyIterator().next() + new MyIterator().next();
    }
}
class InlineClass {
    public int method() {
        class MyIterator implements Iterator<Integer> {
            private int i = 0;
            @Override public Integer next() { return i++; }
            @Override public boolean hasNext() { return false; }
        }

        return new MyIterator().next() + new MyIterator().next();
    }
}

The inner class (which may not be static) can close over local variables and is subject to the same scoping rules as a variable. In particular, this means that a named inner class can use itself recursively from within it's definition, but you can't declare a mutually recursive group of multiple classes like this:

public int rec() {
    class A {
        public int f() { return new B().g(); }
    }
    class B {
        public int g() { return new A().f(); }
    }
 
    return new B().g();
}
public int rec() {
    class A {
        public int f() { return new B().g(); }
    }
    class B {
        public int g() { return new A().f(); }
    }

    return new B().g();
}

A Cambridge Computer Science degree summarised in 58 crib sheets


From 2005 to 2008 I was an undergraduate studying Computer Science at Cambridge.
My method of preparing for the exams was to summarise each lecture course into just a few sides of A4, which I'd then commit to memory in their entirety.

To make them shorter and hence easier to memorise, I'd omit all but truly essential information from each crib sheet. For example, I wouldn't include any formula if it was easily derivable from first principles, and I certainly didn't waste any words on conceptual explanations. As a consequence, these sheets certainly aren't the best choice for those learning a subject for the first time, but they might come in handy as a refresher for those with some familiarity with the subject.

So without further ado, here is my summary of a complete Cambridge Computer Science degree in 58 crib sheets:

Advanced System Topics pdf lyx
Algorithms pdf doc
Algorithms II pdf doc
Artifical Intelligence I pdf doc
Bioinformatics pdf lyx
Business Studies pdf lyx
C And C++ pdf doc
Comparative Architectures pdf lyx
Compiler Construction pdf doc
Computation Theory pdf doc
Computer Design pdf doc
Computer Graphics pdf doc
Computer Systems Modelling pdf lyx
Computer Vision pdf lyx
Concepts In Programming Languages pdf doc
Concurrent Systems And Applications pdf doc
Databases pdf doc
Denotational Semantics pdf lyx
Digital Communications pdf doc
Digital Communications II pdf lyx
Digital Electronics pdf doc
Digital Signal Processing pdf lyx
Discrete Mathematics I pdf doc
Discrete Mathematics II pdf doc
Distributed Systems pdf lyx
ECAD pdf doc
Economics And Law pdf doc
Floating Point Computation pdf doc
Foundations Of Computer Science pdf doc
Foundations Of Functional Programming pdf doc
Human Computer Interaction pdf lyx
Information Retrieval pdf lyx
Information Theory And Coding pdf lyx
Introduction To Security pdf doc
Logic And Proof pdf doc
Mathematical Methods For CS pdf doc
Mathematics I pdf doc
Mathematics II pdf doc
Mathematics III pdf doc
Mechanics And Relativity pdf doc
Natural Language Processing pdf lyx
Operating Systems pdf doc
Optimising Compilers pdf lyx
Oscillations And Waves pdf doc
Probability pdf doc
Professional Practice And Ethics pdf doc
Programming In Java pdf doc
Prolog pdf doc
Quantum And Statistical Mechanics pdf doc
Regular Languages And Finite Automata pdf doc
Semantics Of Programming Languages pdf doc
Software Design pdf doc
Software Engineering pdf doc
Specification And Verification I pdf lyx
Specification And Verification II pdf lyx
Topics In Concurrency pdf lyx
Types pdf lyx
VLSI Design pdf doc

Because I only created crib sheets for subjects that I thought I might potentially choose to answer questions on during the exam, this list does not cover every available course (though it's probably at least 70% of them). The other thing to note is that Cambridge requires Computer Science students to take some courses in natural science during your first year: the crib sheets that I've included (e.g. "Mechanics And Relativity" and "Oscillations And Waves") reflect my specialization in physics.

Datastructures for external memory


Something I recently became interested in is map data structures for external memory — i.e. ways of storing indexed data that are optimized for storage on disk.

In a typical analysis of algorithm time complexity, you assume it takes constant time to access memory or perform a basic CPU operation such as addition. This is of course not wholly accurate: in particular, cache effects mean that memory access time varies wildly depending on what exact address you are querying. In a system where your algorithm may access external memory, this becomes even more true — a CPU that takes 1ns to perform an addition may easily find itself waiting 5ms (i.e. 5 million ns) for a read from a spinning disk to complete.

An alternative model of complexity is the Disk Access Machine (DAM). In this model, reading one block of memory (of fixed size B) has constant time cost, and all other operations are free. Just like its conventional cousin this is clearly a simplification of reality, but it's one that lets us succinctly quantify the disk usage of various data structures.

At the time of writing, this is the performance we can expect from the storage hierarchy:


Category Representative device Sequential Read Bandwidth Sequential Write Bandwidth 4KB Read IOPS 4KB Write IOPS
Mechanical disk Western Digital Black WD4001FAEX (4TB) 130MB/s 130MB/s 110 150
SATA-attached SSD Samsung 850 Pro (1TB) 550MB/s 520MB/s 10,000 36,000
PCIe-attached SSD Intel 750 (1.2TB) 2,400MB/s 1,200MB/s 440,000 290,000
Main memory Skylake @ 3200MHz 42,000MB/s 48,000MB/s 16,100,000 (62ns/operation)


(In the above table, all IOPS figures are reported assuming a queue depth of 1, so will tend to be worst case numbers for the SSDs.)

Observe that the implied bandwidth of random reads from a mechanical disk is (110 * 4KB/s) i.e. 440KB/s — approximately 300 times slower than the sequential read case. In contrast, random read bandwith from a PCIe-attached SSD is (440,000 * 4KB/s) = 1.76GB/s i.e. only about 1.4 times slower than the sequential case. So you still pay a penalty for random access even on SSDs, but it's much lower than the equivalent cost on spinning disks.

One way to think about the IOPS numbers above is to break them down into that part of the IOPS that we can attribute to the time necessary to transfer the 4KB block (i.e. 4KB/Bandwidth) and whatever is left, which we can call the seek time (i.e. (1/IOPS) - (4KB/Bandwidth)):

Category Implied Seek Time From Read Implied Seek Time From Write Mean Implied Seek Time
Mechanical Disk 9.06ms 6.63ms 7.85ms
SATA-attached SSD 92.8us 20.2us 56.5us
PCIe-attached SSD 645ns 193ns 419ns

If we are using the DAM to model programs running on top of one of these storage mechanisms, which block size B should we choose such that algorithm costs derived from the DAM are a good guide to real-world time costs? Let's say that our DAM cost for some algorithm is N block reads. Consider two scenarios:

  • If these reads are all contiguous, then the true time cost (in seconds) of the reads will be N*(B/Bandwidth) + Seek Time
  • If they are all random, then the true time cost is N*((B/Bandwidth) + Seek Time), i.e. (N - 1)*Seek Time more than the sequential case

The fact that the same DAM cost can correspond to two very different true time costs suggests that in we should try to choose a block size that minimises the difference between the two possible true costs. With this in mind, a sensible choice is to set B equal to the product of the seek time and the bandwidth of the device. If we do this, then in random-access scenario (where the DAM most underestimates the cost):

  • Realized IOPS will be at least half of peak IOPS for the storage device.
  • Realized bandwidth will be at least half of peak bandwidth for the storage device.

If we choose B smaller than the bandwidth/seek time product then we'll get IOPS closer to device maximum, but only at the cost of worse bandwidth. Likewise, larger blocks than this will reduce IOPS but boost bandwidth. The proposed choice of B penalises both IOPS and bandwidth equally. Applying this idea to the storage devices above:

Category Implied Block Size From Read Implied Block Size From Write Mean Implied Block Size
Mechanical Disk 1210KB 883KB 1040KB
SATA-attached SSD 52.3KB 10.8KB 31.6KB
PCIe-attached SSD 1.59KB 243B 933B

On SSDs the smallest writable/readable unit of storage is the page. On current generation devices, a page tends to be around 8KB in size. It's gratifying to see that this is within an order of magnitude of our SSD block size estimates here.

Interestingly, the suggested block sizes for mechanical disks are much larger than the typical block sizes used in operating systems and databases, where 4KB virtual memory/database pages are common (and certainly much larger than the 512B sector size of most spinning disks). I am of course not the first to observe that typical database page sizes appear to be far too small.

Applying the DAM

Now we've decided how we can apply the DAM to estimate disk costs that will translate (at least roughly) to real-world costs, we can actually apply the model to the analysis of some algorithms. Before we begin, some interesting features of the DAM:

  • Binary search is not optimal. Binary-searching N items takes O(log (N/B)) block reads, but O(logB N) search is possible with other algorithms.
  • Sorting by inserting items one at a time into a B-tree and then traversing the tree is not optimal. The proposed approach takes O(N logB N) but it's possible to sort in O((N/B) * log (N/B)).
  • Unlike with the standard cost model, many map data structures have different costs for lookup and insertion in the DAM, which means that e.g. adding UNIQUE constraints to database indexes can actually change the complexity of inserting into the index (since you have to do lookup in such an index before you know whether an insert should succeed).

Now let's cover a few map data structures. We'll see that the maps that do well in the DAM model will be those that are best able to sequentialize their access patterns to exploit the block structure of memory.

2-3 Tree

The 2-3 tree is a balanced tree structure where every leaf node is at the same depth, and all internal nodes have either 1 or 2 keys — and therefore have either 2 or 3 children. Leaf nodes have either 1 or 2 key/value pairs.

Lookup in this tree is entirely straightforward and has complexity O(log N). Insertion into the tree proceeds recursively starting from the root node:

  1. If inserting into a leaf, we add the data item to the leaf. Note that this may mean that the leaf temporarily contain 3 key/value pairs, which is more than the usual limit.
  2. If inserting into a internal node, we recursively add the data item to the appropriate child. After doing this, the child may contain 3 keys, in which case we pull one up to this node, creating a new sibling in the process. If this node already contained 2 keys this will in turn cause it to become oversized. An example of how this might look is:

    23-internal
  3. If, after the recursion completes, the root node contains 3 keys, then we pull a new root node (with one key) out of the old root, like so:

    23-root

It's easy to see that this keeps the tree balanced. This insertion process also clearly has O(log N) time complexity, just like lookup. The data structure makes no attempt to exploit the fact that memory is block structured, so both insertion and lookup have identical complexity in the DAM and the standard cost model.

B-Tree

The B-tree (and the very closely related B+tree) is probably the most popular structure for external memory storage. It can be seen as a simple generalisation of the 2-3 tree where, instead of each internal node having 1 or 2 keys, it instead has between m and 2m keys for any m > 0. We then set m to the maximum value so that one internal node fits exactly within our block size B, i.e. m = O(B).

In the DAM cost model, lookup in a B-tree has time complexity O(logB N). This is because we can access each internal node's set of at least m keys using a single block read — i.e. in O(1) — and this lets us make a choice between at least m+1 = O(B) child nodes.

For similar reasons to the lookup case, inserting into a B-tree also has time cost O(logB N) in the DAM.

Buffered Repository Tree

A buffered repository tree, or BRT, is a generalization of a 2-3 tree where each internal node is associated with an additional buffer of size k = O(B). When choosing k a sensible choice is to make it just large enough to use all the space within a block that is not occupied by the keys of the internal node.

When inserting into this tree, we do not actually modify the tree structure immediately. Instead, a record of the insert just gets appended to the root node's buffer until that buffer becomes full. Once it is full, we're sure to be able to spill at least k/3 insertions to one child node. These inserts will be buffered at the lower level in turn, and may trigger recursive spills to yet-deeper levels.

What is the time complexity of insertion? Some insertions will be very fast because they just append to the buffer, while others will involve extensive spilling. To smooth over these differences, we therefore consider the amortized cost of an insertion. If we insert N elements into the tree, then at each of the O(log (N/B)) levels of the tree we'll spill at most O(N/(k/3)) = O(N/B) times. This gives a total cost for the insertions of O((N/B) log (N/B)), which is an amortized cost of O((log (N/B))/B).

Lookup proceeds pretty much as normal, except that the buffer at each level must be searched before any child nodes are considered. In the DAM, this additional search has cost O(1), so lookup cost becomes O(log (N/B)).

Essentially what we've done with this structure is greatly sped up the insertion rate by exploiting the fact that the DAM lets us batch up writes into groups of size O(B) for free. This is our first example of a structure whose insertion cost is lower than its lookup cost.

B-ε Tree

It turns out that it's possible to see the B-tree and the BRT as the two most extreme examples of a whole family of data structures. Specifically, both the B-tree and the BRT are instances of a more general notion called a B-ε tree, where ε is a real variable ranging between 0 and 1.

A B-ε tree is a generalisation of a 2-3 tree where each internal node has between m and 2m keys, where 0 < m = O(Bε). Each node is also accompanied by a buffer of size k = O(B). This buffer space is used to queue pending inserts, just like in the BRT.

One possible implementation strategy is to set m so that one block is entirely full with keys when ε = 1, and so that m = 2 when ε = 0. The k value can then be chosen to exactly occupy any space within the block that is not being used for keys (so in particular, if ε = 1 then k = 0). With these definitions it's clear that the ε = 1 case corresponds to a B-tree and ε = 0 gives you a BRT.

As you would expect, the B-ε insertion algorithm operates in essentially the same manner as described above for the BRT. To derive the time complexity of insertion, we once again look at the amortized cost. Observe that the structure will have O(logBε (N/B)) = O((logB (N/B))/ε) = O((logB N)/ε) levels and that on each spill we'll be able to push down at least O(B1-ε) elements to a child. This means that after inserting N elements into the tree, we'll spill at most O(N/(B1-ε)) = O(N*Bε-1) times. This gives a total cost for the insertions of O(N*Bε-1*(logB N)/ε), which is an amortized cost of O((Bε-1/ε)*logB N).

The time complexity of lookups is just the number of levels in the tree i.e. O((logB N)/ε).

Fractal Tree

These complexity results for the B-ε tree suggest a tantalising possibility: if we set ε = ½ we'll have a data structure whose asymptotic insert time will be strictly better (by a factor of sqrt B) than that of B-trees, but which have exactly the same asymptotic lookup time. This data structure is given the exotic name of a "fractal tree". Unfortunately, the idea is patented by the founders of Tokutek (now Percona), so they're only used commercially in Percona products like TokuDB. If you want to read more about what you are missing out on, there's a good article on the company blog and a whitepaper.

Log-Structured Merge Tree

The final data structure we'll consider, the log-structured merge tree (LSMT) rivals the popularity of the venerable B-tree and is the technology underlying most "NoSQL" stores.

In a LSMT, you maintain your data in a list of B-trees of varying sizes. Lookups are accomplished by checking each B-tree in turn. To avoid lookups having to check too many B-trees, we arrange that we never have too many small B-trees in the collection.

There are two classes of LSMT that fit this general scheme: size-tiered and levelled.

In a levelled LSMT, your collection is a list of B-trees of size at most O(B), O(B*k), O(B*k2), O(B*k3), etc for some growth factor k. Call these level 0, 1, 2 and so on. New items are inserted into level 0 tree. When this tree exceeds its size bound, it is merged into the level 1 tree, which may trigger recursive merges in turn.

Observe that if we insert N items into a levelled LSMT, there will be O(logk (N/B)) B-trees and the last one will have O(N/B) items in it. Therefore lookup has complexity O(logB N * logk (N/B)). To derive the update cost, observe that the items in the last level have been merged down the full O(log_k (N/B)) levels, and they will have been merged into on average O(k) times in each level before moving down to the next. Therefore the amortized insertion cost is O((k * log_k (N/B)) / B).

If we set k = ½ then lookup and insert complexity simplify to O((logB N)2) and O(logB N / sqrt B) respectively.

In a size-tiered LSMT things are slightly different. In this scheme we have a staging buffer of size O(B) and more than one tree at each level: specifically, at level i >= 0, we have up to k B-trees of size exactly O(B*ki). New items are inserted inte the staging buffer. When it runs out of space, we turn it into a B-tree and insert it into level 0. If would causes us to have more than k trees in the level, we merge the k trees together into one tree of size O(B*k) that we can try to insert into level 1, which may in turn trigger recursive merges.

The complexity arguments we made for levelled LSMT carry over almost unchanged into this new setting, showing that the two schemes have identical costs. LSMTs match the insert performance of fractal trees, but suffer the cost of an extra log factor when doing lookup. To try to improve lookup time, in practice most LSMT implementations store each B-tree along with a Bloom filter which allows them to avoid accessing a tree entirely when a key of interest is certainly not included in it.

There are several good overviews of LSMTs available online.

Experiments

To validate my knowledge of these data structures, I wrote a Python program that tries to perform an apples-to-apples comparison of various B-ε tree variants. The code implements the datastructure and also logs how many logical blocks it would need to touch if the tree was actually implemented on a block-structured device (in reality I just represent it as a Python object). I assume that as many of the trees towards the top of the tree as possible are stored in memory and so don't hit the block device.

I simulate a machine with 1MB of memory and 32KB pages. Keys are assumed to be 16 bytes and values 240 bytes. With these assumptions can see how the number of block device pages we need to write to varies with the number of keys in the tree for each data structure:

uncached_writes

These experimental results match what we would expect from the theoretical analysis: the BRT has a considerable advantage over the alternatives when it comes to writes, B-trees are the worst, and fractal trees occupy the middle ground.

The equivalent results for reads are as follows:

uncached_reads

This is essentially a mirror image of the write results, showing that we're fundamentally making a trade-off here.

Summary

We can condense everything we've learnt above into the following table:

Structure Lookup Insert
2-3 Tree O(log N) O(log N)
B-ε Tree O((logB N)/ε) O((Bε-1/ε)*logB N)
B-Tree (ε=1) O(logB N) O(logB N)
Fractal Tree (ε=½) O(logB N) O(logB N / sqrt B)
Buffered Repository Tree (ε=0) O(log (N/B)) O((log (N/B))/B)
Log Structured Merge Tree O((logB N)2) O(logB N / sqrt B)

These results suggest that you should always prefer to use a fractal tree to any of a B-tree, LSMT or 2-3 tree. In the real world, things may not be so clear cut: in particular, because of the fractal tree patent situation, it may be difficult to find a free and high-quality implementation of that data structure.

Most engineering effort nowadays is being directed at improving implementations of B-trees and LSMTs, so you probably want to choose one of these two options depending on whether your workload is read or write heavy, respectively. Some would argue, however, that all database workloads are essentially write bound, given that you can usually optimize a slow read workload by simply adding some additional indexes.

Compression of floating point timeseries

I recently had cause to investigate fast methods of storing and transferring financial timeseries. Naively, timeseries can be represented in memory or on disk as simple dense arrays of floating point numbers. This is an attractive representation with many nice properties:

  • Straightforward and widely used.
  • You have random access to the nth element of the timeseries with no further indexes required.
  • Excellent locality-of-reference for applications that process timeseries in time order, which is the common case.
  • Often natively supported by CPU vector instructions.

However, it is not a particularly space-efficient representation. Financial timeseries have considerable structure (e.g. Vodafone's price on T is likely to be very close to the price on T+1), and this structure can be exploited by compression algorithms to greatly reduce storage requirements. This is important either when you need to store a large number of timeseries, or need to transfer a smaller number of timeseries over a bandwidth-constrained network link.

Timeseries compression has recieved quite a bit of attention from both the academic/scientific programming community (see e.g. FPC and PFOR) and also practicioner communities such as the demoscene (see this presentation by a member of Farbrausch). This post summarises my findings about the effect that a number of easy-to-implement "filters" have on the final compression ratio.

In the context of compression algorithms, filters are simple invertible transformations that are applied to the stream in the hopes of making the stream more compressible by subsequent compressors. Perhaps the canonical example of a filter is the Burrows-Wheeler transform, which has the effect of moving runs of similar letters together. Some filters will turn a decompressed input stream (from the user) of length N into an output stream (fed to the compressor) of length N, but in general filters will actually have the effect of making the stream longer. The hope is that the gains to compressability are enough to recover the bytes lost to any encoding overhead imposed by the filter.

In my application, I was using the compression as part of an RPC protocol that would be used interactively, so I wanted keep decompression time very low, and for ease-of-deployment I wanted to get results in the context of Java without making use of any native code. Consequently I was interested in which choice of filter and compression algorithm would give a good tradeoff between performance and compression ratio.

I determined this experimentally. In my experiments, I used timeseries associated with 100 very liquid US stocks retrieved from Yahoo Finance, amounting to 69MB of CSVs split across 6 fields per stock (open/high/low/close and adjusted close prices, plus volume). This amounted to 12.9 million floating point numbers.

Choice of compressor

To decide which compressors were contenders, I compressed these price timeseries with a few pure-Java implementations of the algorithms:

Compressor Compression time (s) Decompression time (s) Compression ratio
None 0.0708 0.0637 1.000
Snappy (org.iq80.snappy:snappy-0.3) 0.187 0.115 0.843
Deflate BEST_SPEED (JDK 8) 4.59 4.27 0.602
Deflate DEFAULT_COMPRESSION (JDK 8) 5.46 4.29 0.582
Deflate BEST_COMPRESSION (JDK 8) 7.33 4.28 0.580
BZip2 MIN_BLOCKSIZE (org.apache.commons:commons-compress-1.10) 1.79 0.756 0.540
BZip2 MAX_BLOCKSIZE (org.apache.commons:commons-compress-1.10) 1.73 0.870 0.515
XZ PRESET_MIN (org.apache.commons:commons-compress-1.10 + org.tukaani:xz-1.5) 2.66 1.20 0.469
XZ PRESET_DEFAULT (org.apache.commons:commons-compress-1.10 + org.tukaani:xz-1.5) 9.56 1.15 0.419
XZ PRESET_MAX (org.apache.commons:commons-compress-1.10 + org.tukaani:xz-1.5) 9.83 1.13 0.419

These numbers were gathered from a custom benchmark harness which simply compresses and then decompresses the whole dataset once. However, I saw the same broad trends confirmed by a JMH benchmark of the same combined operation:

Compressor Compress/decompress time (s) JMH compress/decompress time (s)
None 0.135 0.127 ± 0.002
Snappy (org.iq80.snappy:snappy-0.3) 0.302 0.215 ± 0.003
Deflate BEST_SPEED (JDK 8) 8.86 8.55 ± 0.15
Deflate DEFAULT_COMPRESSION (JDK 8) 9.75 9.35 ± 0.09
Deflate BEST_COMPRESSION (JDK 8) 11.6 11.4 ± 0.1
BZip2 MIN_BLOCKSIZE (org.apache.commons:commons-compress-1.10) 2.55 3.10 ± 0.04
BZip2 MAX_BLOCKSIZE (org.apache.commons:commons-compress-1.10) 2.6 3.77 ± 0.31
XZ PRESET_MIN (org.apache.commons:commons-compress-1.10 + org.tukaani:xz-1.5) 3.86 4.08 ± 0.12
XZ PRESET_DEFAULT (org.apache.commons:commons-compress-1.10 + org.tukaani:xz-1.5) 10.7 11.1 ± 0.1
XZ PRESET_MAX (org.apache.commons:commons-compress-1.10 + org.tukaani:xz-1.5) 11.0 11.5 ± 0.4

What we see here is rather impressive performance from BZip2 and Snappy. I expected Snappy to do well, but BZip2's good showing surprised me. In some previous (unpublished) microbenchmarks I've not seen GZipInputStream (a thin wrapper around Deflate with DEFAULT_COMPRESSION) be quite so slow, and my results also seem to contradict other Java compression benchmarks.

One contributing factor may be that the structure of the timeseries I was working with in that unpublished benchmark was quite different: there was a lot more repetition (runs of NaNs and zeroes), and compression ratios were consequently higher.

In any event, based on these results I decided to continue my evaluation with both Snappy and BZip2 MIN_BLOCKSIZE. It's interesting to compare these two compressors because, unlike BZ2, Snappy doesn't perform any entropy encoding.

Filters

The two filters that I evaluated were transposition and zig-zagged delta encoding.

Transposition

The idea behind transposition (also known as "shuffling") is as follows. Let's say that we have three floating point numbers, each occuping 4 bytes:

On a big-endian system this will be represented in memory row-wise by the 4 consecutive bytes of the first float (MSB first), followed by the 4 bytes of the second float, and so on. In contrast, a transposed representation of the same data would encode all of the MSBs first, followed by all of the second-most-significant bytes, and so on, in a column-wise fashion:

The reason you might think that writing the data column-wise would improve compression is that you might expect that e.g. the most significant bytes of a series of floats in a timeseries to be very similar to each other. By moving these similar bytes closer together you increase the chance that compression algorithms will be able to find repeating patterns in them   undisturbed by the essentially random content of the LSB.

Field transposition

Analagous to the byte-level transposition described above, another thing we might try is transposition at the level of a float subcomponent. Recall that floating point numbers are divided into sign, exponent and mantissa components. For single precision floats this looks like:

Inspired by this, another thing we might try is transposing the data field-wise — i.e. serializing all the signs first, followed by all the exponents, then all the mantissas:

(Note that I'm inserting padding bits to keep multibit fields byte aligned — more on this later on.)

We might expect this transposition technique to improve compression by preventing changes in unrelated fields from causing us to be unable to spot patterns in the evolution of a certain field. A good example of where this might be useful is the sign bit: for many timeseries of interest we expect the sign bit to be uniformly 1 or 0 (i.e. all negative or all positive numbers). If we encoded the float without splitting it into fields, that that one very predictable bit is mixed in with 31 much more varied bits, which makes it much harder to spot this pattern.

Delta encoding

In delta encoding, you encode consecutive elements of a sequence not by their absolute value but rather by how much larger than they are than the previous element in the sequence. You might expect this to aid later compression of timeseries data because, although a timeseries might have an overall trend, you would expect the day-to-day variation to be essentially unchanging. For example, Vodafone's stock price might be generally trending up from 150p at the start of the year to 200p at the end, but you expect it won't usually change by more than 10p on any individual day within that year. Therefore, by delta-encoding the sequence you would expect to increase the probability of the sequence containing a repeated substring and hence its compressibility.

This idea can be combined with transposition, by applying the transposition to the deltas rather than the raw data to be compressed. If you do go this route, you might then apply a trick called zig-zagging (used in e.g. protocol buffers) and store your deltas such that small negative numbers are represented as small positive ints. Specifically, you might store the delta -1 as 1, 1 as 2, -2 as 3, 2 as 4 and so on. The reasoning behind this is that you expect your deltas to be both positive and negative, but certainly clustered around 0. By using zig-zagging, you tend to cause the MSB of your deltas to become 0, which then in turn leads to extremely compressible long runs of zeroes in your transposed version of those deltas.

Special cases

One particular floating point number is worth discussing: NaN. It is very common for financial timeseries to contain a few NaNs scattered throughout. For example, when a stock exchange is on holiday no official close prices will be published for the day, and this tends to be represented as a NaN in a timeseries of otherwise similar prices.

Because NaNs are both common and very dissimilar to other numbers that we might encounter, we might want to encode them with a special short representation. Specifically, I implemented a variant of the field transposition above, where the sign bit is actually stored extended to a two bit "descriptor" value with the following interpretation:

Bit 1 Bit 2 Interpretation
0 0 Zero
0 1 NaN
1 0 Positive
1 1 Negative

The mantissa and exponent are not stored if the descriptor is 0 or 1.

Note that this representation of NaNs erases the distinction between different NaN values, when in reality there are e.g. 16,777,214 distinct single precision NaNs. This technically makes this a lossy compression technique, but in practice it is rarely important to be able to distinguish between different NaN values. (The only application that I'm aware of that actually depends on the distinction between NaNs is LuaJIT.)

Methodology

In my experiments (available on Github) I tried all combinations of the following compression pipeline:

  1. Field transposition: on (start by splitting each number into 3 fields) or off (treat whole floating point number as a single field)?
  2. (Only if field transposition is being used.) Special cases: on or off?
  3. Delta encoding: off (store raw field contents) or on (stort each field as an offset from the previous field)? When delta encoding was turned on, I additionally used zig-zagging.
  4. Byte transposition: given that I have a field, should I transpose the bytes of that field? In fact, I exhaustively investigated all possible byte-aligned transpositions of each field.
  5. Compressor: BZ2 or Snappy?

I denote a byte-level transposition as a list of numbers summing to the number of bytes in one data item. So for example, a transposition for 4-byte numbers which wrote all of the LSBs first, followed by the all of the next-most-significant bytes etc would be written as [1, 1, 1, 1], while one that broke each 4-byte quantity into two 16-bit chunks would be written [2, 2], and the degenerate case of no transposition would be [4]. Note that numbers occur in the list in increasing order of the significance of the bytes in the item that they manipulate.

As discussed above, in the case where a literal or mantissa wasn't an exact multiple of 8 bits, my filters padded the field to the nearest byte boundary before sending it to the compressor. This means that the filtering process actually makes the data substantially larger (e.g. 52-bit double mantissas are padded to 56 bits, becoming 7.7% larger in the process). This not only makes the filtering code simpler, but also turns out to be essential for good compression when using Snappy, which is only able to detect byte-aligned repitition.

Without further ado, let's look at the results.

Dense timeseries

I begin by looking at dense timeseries where NaNs do not occur. With such data, it's clear that we won't gain from the "special cases" encoding above, so results in this section are derived from a version of the compression code where we just use 1 bit to encode the sign.

Single-precision floats

The minimum compressed size (in bytes) achieved for each combination of parameters is as follows:

Exponent Method Mantissa Method BZ2 Snappy
Delta Delta 6364312 9067141
Literal 6283216 8622587
Literal Delta 6372444 9071864
Literal 6306624 8626114

(This table says that, for example, if we delta-encode the float exponents but literal-encode the mantissas, then the best tranposition scheme achieved a compressed size of 6,283,216 bytes.)

The story here is that delta encoding is strictly better than literal encoding for the exponent, but conversely literal encoding is better for the mantissa. In fact, if we look at the performance of each possible mantisas transposition, we can see that delta encoding tends to underperform in those cases where the MSB is split off into its own column, rather than being packaged up with the second-most-significant byte. This result is consistent across both BZ2 and Snappy.

Mantissa Transposition Mantissa Method BZ2 Snappy
[1, 1, 1] Delta 7321926 9199766
Literal 7226293 9154821
[1, 2] Delta 7394645 9317258
Literal 7824557 9420206
[2, 1] Delta 6514554 9099462
Literal 6283216 8622587
[3] Delta 6364312 9067141
Literal 6753718 9475316

The other interesting feature of these results is that transposition tends to hurt BZ2 compression ratios. It always makes things worse with delta encoding, but even with literal encoding only one particular transposition ([2, 1]) actually strongly improves a BZ2 result. Things are a bit different for Snappy: although once again delta is always worse with transposition enabled, transpostion always aids Snappy in the literal case — though once again the effect is strongest with [2, 1] transposition.

The strong showing for [2, 1] transposition suggests to me that the lower-order bits of the mantissa are more correlated with each other than they are with the MSB. This sort of makes sense, since due to the fact that equities trade with a fixed tick size prices will actually be quantised into a relatively small number of values. This will tend to cause the lower order bits of the mantissa to become correlated.

Finally, we can ask what would happen if we didn't make the mantissa/exponent distinction at all and instead just packed those two fields together:

Method BZ2 Snappy
Delta 6254386 8847839
Literal 6366714 8497983

These numbers don't show any clear preference for either of the two approaches. For BZ2, delta performance is improved by not doing the splitting, at the cost of larger outputs when using the delta method, while for Snappy we have the opposite: literal performance is improved while delta performance is harmed. What is true is that the best case, the compressed sizes we observe here are better than the best achievable size in the no-split case.

In some ways it is quite surprising that delta encoding ever beats literal encoding in this scenario, because it's not clear that the deltas we compute here are actually meaningful and hence likely to generally be small.

We can also analyse the best available transpositions in this case. Considering BZ2 first:

BZ2 Rank Delta Transposition Size Literal Transposition Size
1 [4] 6254386 [2, 2] 6366714
2 [3, 1] 6260446 [2, 1, 1] 6438215
3 [2, 2] 6395954 [3, 1] 7033810
4 [2, 1, 1] 6402354 [4] 7109668
5 [1, 1, 1, 1] 7136612 [1, 1, 1, 1] 7327437
6 [1, 2, 1] 7227282 [1, 1, 2] 7405128
7 [1, 1, 2] 7285350 [1, 2, 1] 8039403
8 [1, 3] 7337277 [1, 3] 8386647

Just as above, delta encoding does best when transposition at all is used, and generally gets worse as the transposition gets more and more "fragmented". On the other hand, literal encoding does well with transpositions that tend to keep together the first two bytes (i.e. the exponent + the leading bits of the mantissa).

Now let's look at the performance of the unsplit data when compressed with Snappy:

Snappy Rank Delta Transposition Size Literal Transposition Size
1 [3, 1] 8847839 [2, 1, 1] 8497983
2 [2, 1, 1] 8883392 [1, 1, 1, 1] 9033135
3 [1, 1, 1, 1] 8979988 [1, 2, 1] 9311863
4 [1, 2, 1] 9093582 [3, 1] 9404027
5 [2, 2] 9650796 [2, 2] 10107042
6 [4] 9659888 [1, 1, 2] 10842190
7 [1, 1, 2] 9847987 [4] 10942085
8 [1, 3] 10524215 [1, 3] 11722159

The Snappy results are very different from the BZ2 case. Here, the same sort of transpositions tend to do well with both the literal and delta methods. The kinds of transpositions that are successful are those that keep together the exponent and the leading bits of the mantissa, though even fully-dispersed transpositions like [1, 1, 1, 1] put in a strong showing.

That's a lot of data, but what's the bottom line? For Snappy, splitting the floats into mantisas and exponent before processing does seem to have slightly more consistently small outputs than working with unsplit data. The BZ2 situation is less clear but only because the exact choice doesn't seem to make a ton of difference. Therefore, my recommendation for single-precision floats is to delta-encode exponents, and to use literal encoding for mantissas with [2, 1] transposition.

Double-precision floats

While there were only 4 different transpositions for single-precision floats, there are 2 ways to transpose a double-precision exponent, and 64 ways to transpose the mantissa. This makes the parameter search for double precision considerably more computationally expensive. The results are:

Exponent Method Mantissa Method BZ2 Snappy
Delta Delta 6485895 12463583
Literal 6500390 10437550
Literal Delta 6456152 12469132
Literal 6475579 10439869

These results show interesting differences between BZ2 and Snappy. For BZ2 there is not much in it, but it's consistently always better to literal-encode the exponent and delta-encode the mantissa. For Snappy, things are exactly the other way around: delta-encoding the exponent and literal-encoding the mantissa is optimal.

The choice of exponent transposition scheme has the following effect:

Exponent Transposition Exponent Method BZ2 Snappy
[1, 1] Delta 6485895 10437550
Literal 6492837 10439869
[2] Delta 6496032 10560467
Literal 6456152 10564737

It's not clear, but [1, 1] transposition might be optimal. Bear in mind that double exponents are only 11 bits long, so the lower 5 bits of the LSB being encoded here will always be 0. Using [1, 1] transposition might better help the compressor get a handle on this pattern.

When looking at the best mantissa transpositions, there are so many possible transpositions that we'll consider BZ2 and Snappy one by one, examining just the top 10 transposition choices for each. BZ2 first:

BZ2 Rank Delta Mantissa Transposition Size Literal Mantissa Transposition Size
1 [7] 6456152 [6, 1] 6475579
2 [6, 1] 6498686 [1, 5, 1] 6495555
3 [4, 3] 6962167 [2, 4, 1] 6510416
4 [4, 2, 1] 6994270 [1, 1, 4, 1] 6510416
5 [1, 6] 7040401 [3, 3, 1] 6692613
6 [3, 4] 7054835 [2, 1, 3, 1] 6692613
7 [1, 5, 1] 7092230 [1, 1, 1, 3, 1] 6692613
8 [2, 5] 7108000 [1, 2, 3, 1] 6692613
9 [2, 4, 1] 7176760 [1, 6] 6926231
10 [3, 3, 1] 7210100 [7] 6931475

We can see that literal encoding tends to beat delta encoding, though the very best size was in fact achieved via a simple untransposed delta representation. In both the literal and the delta case, the encodings that do well tend to keep the middle 5 bytes of the mantissa grouped together, which is support for our idea that these bytes tend to be highly correlated, with most of the information being encoded in the MSB.

Turning to Snappy:

Snappy Rank Delta Mantissa Transposition Size Literal Mantissa Transposition Size
1 [5, 1, 1] 12463583 [3, 2, 1, 1] 10437550
2 [1, 3, 2, 1] 12678887 [1, 1, 1, 2, 1, 1] 10437550
3 [6, 1] 12737838 [1, 2, 2, 1, 1] 10437550
4 [4, 2, 1] 12749067 [2, 1, 2, 1, 1] 10437550
5 [7] 12766804 [1, 1, 1, 1, 1, 1, 1] 10598739
6 [1, 3, 1, 1, 1] 12820360 [2, 1, 1, 1, 1, 1] 10598739
7 [3, 1, 2, 1] 12824154 [3, 1, 1, 1, 1] 10598739
8 [4, 1, 1, 1] 12890981 [1, 2, 1, 1, 1, 1] 10598739
9 [3, 3, 1] 12915900 [1, 1, 1, 1, 2, 1] 10715444
10 [3, 1, 1, 1, 1] 12946767 [3, 1, 2, 1] 10715444

The Snappy results are strikingly different from the BZ2 ones. In this case, just like BZ2, literal encoding tends to beat delta encoding, but the difference is much more pronounced than the BZ2 case. Furthermore, the kinds of transpositions that minimize the size of the literal encoded data here are very different from the transpositions that were successful with BZ2: in that case we wanted to keep the middle bytes together, while here the scheme [1, 1, 1, 1, 1, 1, 1] where every byte has it's own column is not far from optimal.

And now considering results for the case where we do not split the floating point number into mantissa/exponent components:

Method BZ2 Snappy
Delta 6401147 11835533
Literal 6326562 9985079

These results show a clear preference for literal encoding, which is definitiely what we expect, given that delta encoding is not obviously meaningful for a unsplit number. We also see results that are universally better than those for split case: it seems that splitting the number into fields is actually a fairly large pessimisation! This is probably caused by the internal fragmentation implied by our byte-alignment of the data, which is a much greater penalty for doubles than it was for singles. It would be interesting to repeat the experiment without byte-alignment.

We can examine which transposition schemes do best in the unsplit case. BZ2 first:

BZ2 Rank Delta Transposition Size Literal Transposition Size
1 [8] 6401147 [6, 2] 6326562
2 [7, 1] 6407935 [1, 5, 2] 6351132
3 [6, 2] 6419062 [2, 4, 2] 6360533
4 [6, 1, 1] 6440333 [1, 1, 4, 2] 6360533
5 [4, 3, 1] 6834375 [3, 3, 2] 6562859
6 [4, 4] 6866167 [1, 2, 3, 2] 6562859
7 [4, 2, 1, 1] 6903123 [2, 1, 3, 2] 6562859
8 [4, 2, 2] 6903322 [1, 1, 1, 3, 2] 6562859
9 [1, 7] 6979216 [6, 1, 1] 6598104
10 [1, 6, 1] 6983998 [1, 5, 1, 1] 6621003

Recall that double precision floating point numbers have 11 bits of exponent and 52 bits of mantissa. We can actually see that showing up in the literal results above: the transpositions that do best are those that either pack together the exponent and the first bits of the mantissa, or have a seperate column for just the exponent information (e.g. [1, 5, 2] or [2, 4, 2]).

And Snappy:

Snappy Rank Delta Transposition Size Literal Transposition Size
1 [5, 1, 1, 1] 11835533 [1, 1, 1, 2, 1, 1, 1] 9985079
2 [1, 3, 2, 1, 1] 12137223 [2, 1, 2, 1, 1, 1] 9985079
3 [6, 1, 1] 12224432 [3, 2, 1, 1, 1] 9985079
4 [4, 2, 1, 1] 12253670 [1, 2, 2, 1, 1, 1] 9985079
5 [1, 3, 1, 1, 1, 1] 12280165 [1, 1, 1, 1, 1, 1, 1, 1] 10232996
6 [3, 1, 2, 1, 1] 12281694 [2, 1, 1, 1, 1, 1, 1] 10232996
7 [5, 1, 2] 12338551 [1, 2, 1, 1, 1, 1, 1] 10232996
8 [4, 1, 1, 1, 1] 12399017 [3, 1, 1, 1, 1, 1] 10232996
9 [3, 1, 1, 1, 1, 1] 12409691 [1, 1, 1, 1, 2, 1, 1] 10343471
10 [2, 2, 2, 1, 1] 12434147 [2, 1, 1, 2, 1, 1] 10343471

Here we see the same pattern as we did above: Snappy seems to prefer "more transposed" transpositions than BZ2 does, and we even see a strong showing for the maximal split [1, 1, 1, 1, 1, 1, 1, 1].

To summarize: for doubles, it seems that regardless of which compressor you use, you are better off not splitting into mantissa/exponent portions, and just literal encoding the whole thing. If using Snappy, [1, 1, 1, 1, 1, 1, 1, 1] transposition seems to be the way to go, but the situation is less clear with BZ2: [6, 2] did well in our tests but it wasn't a runaway winner.

If for some reason you did want to use splitting, if you are also going to use BZ2, then [6, 1] literal encoding for the mantissas and literal encoding for the exponents seems like a sensible choice. If you are a Snappy user, then I would suggest that a principled choice would be to use [1, 1, 1, 1, 1, 1, 1] literal encoding for the mantissas and likewise [1, 1] literal encoding for the exponent.

Sparse timeseries

Let's now look at the sparse timeseries case, where many of the values in the timeseries are NaN. In this case, we're interested in evaluating how useful the "special case" optimization above is in improving compression ratios.

To evaluate this, I replaced a fraction of numbers in my test dataset with NaNs and looked at the best possible size result for a few such fractions. The compressed size in each case is:

No Split Split w/ Special Cases Split wout/ Special Cases
0.00 9985079 10445497 10437550
0.10 10819117 9909251 11501526
0.50 8848393 6238097 10255811
0.75 5732076 3628543 7218287

Note that for convenience here the only compressor I tested was Snappy — i.e. BZ2 was not tested. I also didn't implement special cases in the no-split case, because an artifact of my implementation is that the special-casing is done at the same time as the float is split into its three component fields (sign, mantissa, exponent).

As we introduce small numbers of NaNs to the data, both the nosplit and non-special-cased data get larger. This is expected, because we're replacing predictable timeseries values at random with totally dissimilar values and hence adding entropy. The special-cased split shrinks because this increasing entropy is compensated for by the very short codes we have chosen for NaNs (for which we do pay a very small penalty in the NaNless case). At very high numbers of NaNs, the compressed data for all methods shrinks as NaNs become the rule rather than the exception.

High numbers of NaNs (10% plus) is probably a realistic fraction for real world financial data, so it definitely does seem like implementing special cases is worthwhile. The improvement would probably be considerably less if we looked at BZ2-based results, though.

Closing thoughts

One general observation is that delta encoding is very rarely the best choice, and when it is the best, the gains are usually marginal when compared to literal encoding. This is interesting because Fabian Giesen came to exactly the same conclusion (that delta encoding is redundant when you can do transposition) in the excellent presentation that I linked to earlier.

By applying these techniques to the dataset I was dealing with at work, I was able to get a nice compression ratio on the order of 10%-20% over and above that I could achieve with naive use of Snappy, so I consider the work a success, but don't intend to do any more research in the area. However, there are definitely more experiments that could be done in this vein. In particular, interesting questions are:

  • How robust are the findings of this post when applied to other datasets?
  • What if we don't byte-align everything? Does that improve the BZ2 case? (My prelimiary experiments showed that it made Snappy considerably worse.)
  • Why exactly are the results for BZ2 and Snappy so different? Presumably it relates to the lack of an entropy encoder is Snappy, but it is not totally clear to me how this leads to the results above.

Easy publishing to Maven Central with Gradle

I recently released my first open source library for Java, MDBI. I learnt a lot about the Java open-source ecosystem as part of this process, and this blog summarises that in the hope that it will be useful to others. Specifically, the post will explain how to set up a project using the modern Gradle build system to build code and deploy it to the standard Maven Central repository from the command line really easily.

Getting started

In the Haskell ecosystem, everyone uses Cabal and Hackage, which are developed by the same people and tightly integrated. In contrast, Java's ecosystem is a bit more fragmented: build systems and package repositiories are managed by different organisations, and you need to do a bit of integration work to join everything up.

In particular, in order to get started we're going to have to sign up with two different websites: Sonatype OSS and Bintray:

  • No-one can publish directly to Maven Central: instead you need to publish your project to an "approved repository", from where it will be synced to Central. Sonatype OSS is an approved repository that Sonatype (the company that runs Maven Central) provide free of charge specifically for open-source projects. We will use this to get our artifacts into Central, so go and follow the sign-up instructions now.

    Your application will be manually reviewed by a Sonatype employee and approved within one or two working days. If you want an example of what this process looks like you can take a look at the ticket I raised for my MDBI project.

  • Sonatype OSS is a functional enough way to get your artifacts onto Central, but it has some irritating features. In particular, when you want to make a release you need to first push your artifacts to OSS, and then use an ugly and confusing web interface called Sonatype Nexus to actually "promote" this to Central. I wanted the release to Central to be totally automated, and the easiest way to use that is to have a 3rd party deal with pushing to and then promoting from OSS. For this reason, you should also sign up with Bintray (you can do this with one click if you have a GitHub account).

    Bintray is run by a company called JFrog and basically seems to be a Nexus alternative. JFrog run a Maven repository called JCenter, and it's easy to publish to that via Bintray. Once it's on JCenter we'll be able to push and promote it on Sonatype OSS fully automatically.

We also need to create a Bintray "package" within your Bintray Maven repository. Do this via the Bintray interface — it should be self-explanatory. Use the button on the package page to request it be linked to JCenter (this was approved within a couple of hours for me).

We'll also need a GPG public/private key pair. Let's set that up now:

  1. Open up a terminal and run gpg --gen-key. Accept all the defaults about the algorithm to use, and enter a name, email and
    passphrase of your choosing.
  2. If you run gpg --list-public-keys you should see something like this:

    /Users/mbolingbroke/.gnupg/pubring.gpg
    --------------------------------------
    pub   2048R/3117F02B 2015-11-18
    uid                  Max Bolingbroke <batterseapower@hotmail.com>
    sub   2048R/15245385 2015-11-18

    Whatever is in place of 3117F02B is the name of your key. I'll call this $KEYNAME from now on.

  3. Run gpg --keyserver hkp://pool.sks-keyservers.net --send-keys $KEYNAME to publish your key.
  4. Run gpg -a --export-key $KEYNAME and gpg -a --export-secret-key $KEYNAME to get your public and secret keys as ASCII text. Edit your Bintray account and paste these into the "GPG Signing" part of the settings.
  5. Edit your personal Maven repository on Bintray and select the option to "GPG Sign uploaded files automatically". Don't use Bintray's public/private key pair.

Now you have your Bintray and OSS accounts we can move on to setting up Gradle.

Gradle setup

The key problem we're trying to solve with our Gradle build is producing a set of JARs that meet the Maven Central requirements. What this boils down to is ensuring that we provide:

  • The actual JAR file that people will run.
  • Source JARs containing the code that we built.
  • Javadoc JARs containing compiled the HTML help files.
  • GPG signatures for all of the above. (This is why we created a GPG key above.)
  • A POM file containing project metadata.

To satisfy these requirements we're going to use gradle-nexus-plugin. The resulting (unsigned, but otherwise Central-compliant) artifacts will then be uploaded to Bintray (and eventually Sonatype OSS + Central) using gradle-bintray-plugin. I also use one more plugin — Palantir's gradle-gitsemver — to avoid having to update the Gradle file whenever the version number changes. Our Gradle file begins by pulling all those plugins in:

buildscript {
    repositories {
        jcenter()
        maven { url "http://dl.bintray.com/palantir/releases" }
    }
    dependencies {
        classpath 'com.bmuschko:gradle-nexus-plugin:2.3.1'
        classpath 'com.jfrog.bintray.gradle:gradle-bintray-plugin:1.4'
        classpath 'com.palantir:gradle-gitsemver:0.7.0'
    }
}

apply plugin: 'java'
apply plugin: 'com.bmuschko.nexus'
apply plugin: 'com.jfrog.bintray'
apply plugin: 'gitsemver'

Now we have the usual Gradle configuration describing how to build the JAR. Note the use of the semverVersion() function (provided by the gradle-gitsemver plugin) which returns a version number derived from from the most recent Git tag of the form
vX.Y.Z. Despite the name of the plugin, there is no requirement to actually adhere to the principles of Semantic Versioning to
use it: the only requirements for the version numbers are syntactic.

version semverVersion()
group 'uk.co.omega-prime'

def projectName = 'mdbi'
def projectDescription = 'Max\'s DataBase Interface: a simple but powerful JDBC wrapper inspired by JDBI'

sourceCompatibility = 1.8

jar {
    baseName = projectName
    manifest {
        attributes 'Implementation-Title': projectName,
                   'Implementation-Version': version
    }
}

repositories {
    mavenCentral()
}

dependencies {
    compile group: 'com.google.code.findbugs', name: 'jsr305', version: '3.0.1'
    testCompile group: 'org.xerial', name: 'sqlite-jdbc', version: '3.8.11.2'
    testCompile group: 'junit', name: 'junit', version: '4.12'
}

(Obviously your group, project name, description, dependencies etc will differ from this. Hopefully it's clear which parts of this example Gradle file you'll need to change for your project and which you can copy verbatim.)

Now we need to configure gradle-nexus-plugin to generate the POM. Just by the act of including the plugin we have already arranged for the appropriate JARs to be generated, but the plugin can't figure out the full contents of the POM by itself.

modifyPom {
    project {
        name projectName
        description projectDescription
        url 'http://batterseapower.github.io/mdbi/'

        scm {
            url 'https://github.com/batterseapower/mdbi'
            connection 'scm:https://batterseapower@github.com/batterseapower/mdbi.git'
            developerConnection 'scm:git://github.com/batterseapower/mdbi.git'
        }

        licenses {
            license {
                name 'The Apache Software License, Version 2.0'
                url 'http://www.apache.org/licenses/LICENSE-2.0.txt'
                distribution 'repo'
            }
        }

        developers {
            developer {
                id 'batterseapower'
                name 'Max Bolingbroke'
                email 'batterseapower@hotmail.com'
            }
        }
    }
}

nexus {
    sign = false
}

Note that I've explicitly turned off the automatic artifact signing capability of the Nexus plugin. Theoretically we should be able to keep this turned on, and sign everything locally before pushing to Bintray. This would mean that we wouldn't have to give Bintray our private key. In practice, if you sign things locally Bintray seems to mangle the signature filenames so they become unusable...

Finally, we need to configure the Bintray sync:

if (hasProperty('bintrayUsername') || System.getenv().containsKey('BINTRAY_USER')) {
    // Used by the bintray plugin
    bintray {
        user = System.getenv().getOrDefault('BINTRAY_USER', bintrayUsername)
        key  = System.getenv().getOrDefault('BINTRAY_KEY', bintrayApiKey)
        publish = true

        pkg {
            repo = 'maven'
            name = projectName
            licenses = ['Apache-2.0']
            vcsUrl = 'https://github.com/batterseapower/mdbi.git'

            version {
                name = project.version
                desc = projectDescription
                released = new Date()

                mavenCentralSync {
                    user     = System.getenv().getOrDefault('SONATYPE_USER', nexusUsername)
                    password = System.getenv().getOrDefault('SONATYPE_PASSWORD', nexusPassword)
                }
            }
        }

        configurations = ['archives']
    }
}

We do this conditionally because we still want people to be able to use the Gradle file even if they don't have your a username and password set up. In order to make these credentials available to the script when run on your machine, you need to create a ~/.gradle/gradle.properties file with contents like this:

# These 3 are optional: they'll be needed if you ever use the nexus plugin with 'sign = true' (the default)
signing.keyId=<GPG $KEYNAME from earlier>
signing.password=<your GPG passphrase>
signing.secretKeyRingFile=<absolute path to your ~/.gnupg/secring.gpg file (or whatever you called it)>

nexusUsername=<username for Sonatype OSS>
nexusPassword=<password for Sonatype OSS>

bintrayUsername=<username for Bintray>
bintrayApiKey=<Bintray API key, found in the "API key" section of https://bintray.com/profile/edit>

You can see the complete, commented, Gradle file that I'm using in my project on Github.

Your first release

We should now be ready to go (assuming your Sonatype OSS and JCenter setup requests have been approved). Let's make a release! Go to the terminal and type:

git tag v1.0.0
gradle bintrayUpload

If everything works, you'll get a BUILD SUCCESSFUL message after a minute or so. Your new version should be visible on the Bintray package page (and JCenter) immediately, and will appear on Maven Central shortly afterwards.

If you want to go the whole hog and have your continuous integration (e.g. the excellent Travis) make these automatic deploys after every passing build, this guide for SBT looks useful. However, I didn't go this route so I can't say it how it could be adapted for Gradle.

A nice benefit of publishing to Maven Central is that javadoc.io will host your docs for free totally automatically. Check it out!

Overall I found the process of painlessly publishing Java open source to Maven Central needlessly confusing, with many more moving parts than I was expecting. The periods of waiting for 3rd parties to approve my project were also a little frustrating, though it fairness the turnaround time was quite impressive given that they were doing the work for free. Hopefully this guide will help make the process a little less frustrating for other Gradle users in the future.