Category Archives: Uncategorized

A Cambridge Computer Science degree summarised in 58 crib sheets


From 2005 to 2008 I was an undergraduate studying Computer Science at Cambridge.
My method of preparing for the exams was to summarise each lecture course into just a few sides of A4, which I'd then commit to memory in their entirety.

To make them shorter and hence easier to memorise, I'd omit all but truly essential information from each crib sheet. For example, I wouldn't include any formula if it was easily derivable from first principles, and I certainly didn't waste any words on conceptual explanations. As a consequence, these sheets certainly aren't the best choice for those learning a subject for the first time, but they might come in handy as a refresher for those with some familiarity with the subject.

So without further ado, here is my summary of a complete Cambridge Computer Science degree in 58 crib sheets:

Advanced System Topics pdf lyx
Algorithms pdf doc
Algorithms II pdf doc
Artifical Intelligence I pdf doc
Bioinformatics pdf lyx
Business Studies pdf lyx
C And C++ pdf doc
Comparative Architectures pdf lyx
Compiler Construction pdf doc
Computation Theory pdf doc
Computer Design pdf doc
Computer Graphics pdf doc
Computer Systems Modelling pdf lyx
Computer Vision pdf lyx
Concepts In Programming Languages pdf doc
Concurrent Systems And Applications pdf doc
Databases pdf doc
Denotational Semantics pdf lyx
Digital Communications pdf doc
Digital Communications II pdf lyx
Digital Electronics pdf doc
Digital Signal Processing pdf lyx
Discrete Mathematics I pdf doc
Discrete Mathematics II pdf doc
Distributed Systems pdf lyx
ECAD pdf doc
Economics And Law pdf doc
Floating Point Computation pdf doc
Foundations Of Computer Science pdf doc
Foundations Of Functional Programming pdf doc
Human Computer Interaction pdf lyx
Information Retrieval pdf lyx
Information Theory And Coding pdf lyx
Introduction To Security pdf doc
Logic And Proof pdf doc
Mathematical Methods For CS pdf doc
Mathematics I pdf doc
Mathematics II pdf doc
Mathematics III pdf doc
Mechanics And Relativity pdf doc
Natural Language Processing pdf lyx
Operating Systems pdf doc
Optimising Compilers pdf lyx
Oscillations And Waves pdf doc
Probability pdf doc
Professional Practice And Ethics pdf doc
Programming In Java pdf doc
Prolog pdf doc
Quantum And Statistical Mechanics pdf doc
Regular Languages And Finite Automata pdf doc
Semantics Of Programming Languages pdf doc
Software Design pdf doc
Software Engineering pdf doc
Specification And Verification I pdf lyx
Specification And Verification II pdf lyx
Topics In Concurrency pdf lyx
Types pdf lyx
VLSI Design pdf doc

Because I only created crib sheets for subjects that I thought I might potentially choose to answer questions on during the exam, this list does not cover every available course (though it's probably at least 70% of them). The other thing to note is that Cambridge requires Computer Science students to take some courses in natural science during your first year: the crib sheets that I've included (e.g. "Mechanics And Relativity" and "Oscillations And Waves") reflect my specialization in physics.

Datastructures for external memory


Something I recently became interested in is map data structures for external memory — i.e. ways of storing indexed data that are optimized for storage on disk.

In a typical analysis of algorithm time complexity, you assume it takes constant time to access memory or perform a basic CPU operation such as addition. This is of course not wholly accurate: in particular, cache effects mean that memory access time varies wildly depending on what exact address you are querying. In a system where your algorithm may access external memory, this becomes even more true — a CPU that takes 1ns to perform an addition may easily find itself waiting 5ms (i.e. 5 million ns) for a read from a spinning disk to complete.

An alternative model of complexity is the Disk Access Machine (DAM). In this model, reading one block of memory (of fixed size B) has constant time cost, and all other operations are free. Just like its conventional cousin this is clearly a simplification of reality, but it's one that lets us succinctly quantify the disk usage of various data structures.

At the time of writing, this is the performance we can expect from the storage hierarchy:


Category Representative device Sequential Read Bandwidth Sequential Write Bandwidth 4KB Read IOPS 4KB Write IOPS
Mechanical disk Western Digital Black WD4001FAEX (4TB) 130MB/s 130MB/s 110 150
SATA-attached SSD Samsung 850 Pro (1TB) 550MB/s 520MB/s 10,000 36,000
PCIe-attached SSD Intel 750 (1.2TB) 2,400MB/s 1,200MB/s 440,000 290,000
Main memory Skylake @ 3200MHz 42,000MB/s 48,000MB/s 16,100,000 (62ns/operation)


(In the above table, all IOPS figures are reported assuming a queue depth of 1, so will tend to be worst case numbers for the SSDs.)

Observe that the implied bandwidth of random reads from a mechanical disk is (110 * 4KB/s) i.e. 440KB/s — approximately 300 times slower than the sequential read case. In contrast, random read bandwith from a PCIe-attached SSD is (440,000 * 4KB/s) = 1.76GB/s i.e. only about 1.4 times slower than the sequential case. So you still pay a penalty for random access even on SSDs, but it's much lower than the equivalent cost on spinning disks.

One way to think about the IOPS numbers above is to break them down into that part of the IOPS that we can attribute to the time necessary to transfer the 4KB block (i.e. 4KB/Bandwidth) and whatever is left, which we can call the seek time (i.e. (1/IOPS) - (4KB/Bandwidth)):

Category Implied Seek Time From Read Implied Seek Time From Write Mean Implied Seek Time
Mechanical Disk 9.06ms 6.63ms 7.85ms
SATA-attached SSD 92.8us 20.2us 56.5us
PCIe-attached SSD 645ns 193ns 419ns

If we are using the DAM to model programs running on top of one of these storage mechanisms, which block size B should we choose such that algorithm costs derived from the DAM are a good guide to real-world time costs? Let's say that our DAM cost for some algorithm is N block reads. Consider two scenarios:

  • If these reads are all contiguous, then the true time cost (in seconds) of the reads will be N*(B/Bandwidth) + Seek Time
  • If they are all random, then the true time cost is N*((B/Bandwidth) + Seek Time), i.e. (N - 1)*Seek Time more than the sequential case

The fact that the same DAM cost can correspond to two very different true time costs suggests that in we should try to choose a block size that minimises the difference between the two possible true costs. With this in mind, a sensible choice is to set B equal to the product of the seek time and the bandwidth of the device. If we do this, then in random-access scenario (where the DAM most underestimates the cost):

  • Realized IOPS will be at least half of peak IOPS for the storage device.
  • Realized bandwidth will be at least half of peak bandwidth for the storage device.

If we choose B smaller than the bandwidth/seek time product then we'll get IOPS closer to device maximum, but only at the cost of worse bandwidth. Likewise, larger blocks than this will reduce IOPS but boost bandwidth. The proposed choice of B penalises both IOPS and bandwidth equally. Applying this idea to the storage devices above:

Category Implied Block Size From Read Implied Block Size From Write Mean Implied Block Size
Mechanical Disk 1210KB 883KB 1040KB
SATA-attached SSD 52.3KB 10.8KB 31.6KB
PCIe-attached SSD 1.59KB 243B 933B

On SSDs the smallest writable/readable unit of storage is the page. On current generation devices, a page tends to be around 8KB in size. It's gratifying to see that this is within an order of magnitude of our SSD block size estimates here.

Interestingly, the suggested block sizes for mechanical disks are much larger than the typical block sizes used in operating systems and databases, where 4KB virtual memory/database pages are common (and certainly much larger than the 512B sector size of most spinning disks). I am of course not the first to observe that typical database page sizes appear to be far too small.

Applying the DAM

Now we've decided how we can apply the DAM to estimate disk costs that will translate (at least roughly) to real-world costs, we can actually apply the model to the analysis of some algorithms. Before we begin, some interesting features of the DAM:

  • Binary search is not optimal. Binary-searching N items takes O(log (N/B)) block reads, but O(logB N) search is possible with other algorithms.
  • Sorting by inserting items one at a time into a B-tree and then traversing the tree is not optimal. The proposed approach takes O(N logB N) but it's possible to sort in O((N/B) * log (N/B)).
  • Unlike with the standard cost model, many map data structures have different costs for lookup and insertion in the DAM, which means that e.g. adding UNIQUE constraints to database indexes can actually change the complexity of inserting into the index (since you have to do lookup in such an index before you know whether an insert should succeed).

Now let's cover a few map data structures. We'll see that the maps that do well in the DAM model will be those that are best able to sequentialize their access patterns to exploit the block structure of memory.

2-3 Tree

The 2-3 tree is a balanced tree structure where every leaf node is at the same depth, and all internal nodes have either 1 or 2 keys — and therefore have either 2 or 3 children. Leaf nodes have either 1 or 2 key/value pairs.

Lookup in this tree is entirely straightforward and has complexity O(log N). Insertion into the tree proceeds recursively starting from the root node:

  1. If inserting into a leaf, we add the data item to the leaf. Note that this may mean that the leaf temporarily contain 3 key/value pairs, which is more than the usual limit.
  2. If inserting into a internal node, we recursively add the data item to the appropriate child. After doing this, the child may contain 3 keys, in which case we pull one up to this node, creating a new sibling in the process. If this node already contained 2 keys this will in turn cause it to become oversized. An example of how this might look is:

    23-internal
  3. If, after the recursion completes, the root node contains 3 keys, then we pull a new root node (with one key) out of the old root, like so:

    23-root

It's easy to see that this keeps the tree balanced. This insertion process also clearly has O(log N) time complexity, just like lookup. The data structure makes no attempt to exploit the fact that memory is block structured, so both insertion and lookup have identical complexity in the DAM and the standard cost model.

B-Tree

The B-tree (and the very closely related B+tree) is probably the most popular structure for external memory storage. It can be seen as a simple generalisation of the 2-3 tree where, instead of each internal node having 1 or 2 keys, it instead has between m and 2m keys for any m > 0. We then set m to the maximum value so that one internal node fits exactly within our block size B, i.e. m = O(B).

In the DAM cost model, lookup in a B-tree has time complexity O(logB N). This is because we can access each internal node's set of at least m keys using a single block read — i.e. in O(1) — and this lets us make a choice between at least m+1 = O(B) child nodes.

For similar reasons to the lookup case, inserting into a B-tree also has time cost O(logB N) in the DAM.

Buffered Repository Tree

A buffered repository tree, or BRT, is a generalization of a 2-3 tree where each internal node is associated with an additional buffer of size k = O(B). When choosing k a sensible choice is to make it just large enough to use all the space within a block that is not occupied by the keys of the internal node.

When inserting into this tree, we do not actually modify the tree structure immediately. Instead, a record of the insert just gets appended to the root node's buffer until that buffer becomes full. Once it is full, we're sure to be able to spill at least k/3 insertions to one child node. These inserts will be buffered at the lower level in turn, and may trigger recursive spills to yet-deeper levels.

What is the time complexity of insertion? Some insertions will be very fast because they just append to the buffer, while others will involve extensive spilling. To smooth over these differences, we therefore consider the amortized cost of an insertion. If we insert N elements into the tree, then at each of the O(log (N/B)) levels of the tree we'll spill at most O(N/(k/3)) = O(N/B) times. This gives a total cost for the insertions of O((N/B) log (N/B)), which is an amortized cost of O((log (N/B))/B).

Lookup proceeds pretty much as normal, except that the buffer at each level must be searched before any child nodes are considered. In the DAM, this additional search has cost O(1), so lookup cost becomes O(log (N/B)).

Essentially what we've done with this structure is greatly sped up the insertion rate by exploiting the fact that the DAM lets us batch up writes into groups of size O(B) for free. This is our first example of a structure whose insertion cost is lower than its lookup cost.

B-ε Tree

It turns out that it's possible to see the B-tree and the BRT as the two most extreme examples of a whole family of data structures. Specifically, both the B-tree and the BRT are instances of a more general notion called a B-ε tree, where ε is a real variable ranging between 0 and 1.

A B-ε tree is a generalisation of a 2-3 tree where each internal node has between m and 2m keys, where 0 < m = O(Bε). Each node is also accompanied by a buffer of size k = O(B). This buffer space is used to queue pending inserts, just like in the BRT.

One possible implementation strategy is to set m so that one block is entirely full with keys when ε = 1, and so that m = 2 when ε = 0. The k value can then be chosen to exactly occupy any space within the block that is not being used for keys (so in particular, if ε = 1 then k = 0). With these definitions it's clear that the ε = 1 case corresponds to a B-tree and ε = 0 gives you a BRT.

As you would expect, the B-ε insertion algorithm operates in essentially the same manner as described above for the BRT. To derive the time complexity of insertion, we once again look at the amortized cost. Observe that the structure will have O(logBε (N/B)) = O((logB (N/B))/ε) = O((logB N)/ε) levels and that on each spill we'll be able to push down at least O(B1-ε) elements to a child. This means that after inserting N elements into the tree, we'll spill at most O(N/(B1-ε)) = O(N*Bε-1) times. This gives a total cost for the insertions of O(N*Bε-1*(logB N)/ε), which is an amortized cost of O((Bε-1/ε)*logB N).

The time complexity of lookups is just the number of levels in the tree i.e. O((logB N)/ε).

Fractal Tree

These complexity results for the B-ε tree suggest a tantalising possibility: if we set ε = ½ we'll have a data structure whose asymptotic insert time will be strictly better (by a factor of sqrt B) than that of B-trees, but which have exactly the same asymptotic lookup time. This data structure is given the exotic name of a "fractal tree". Unfortunately, the idea is patented by the founders of Tokutek (now Percona), so they're only used commercially in Percona products like TokuDB. If you want to read more about what you are missing out on, there's a good article on the company blog and a whitepaper.

Log-Structured Merge Tree

The final data structure we'll consider, the log-structured merge tree (LSMT) rivals the popularity of the venerable B-tree and is the technology underlying most "NoSQL" stores.

In a LSMT, you maintain your data in a list of B-trees of varying sizes. Lookups are accomplished by checking each B-tree in turn. To avoid lookups having to check too many B-trees, we arrange that we never have too many small B-trees in the collection.

There are two classes of LSMT that fit this general scheme: size-tiered and levelled.

In a levelled LSMT, your collection is a list of B-trees of size at most O(B), O(B*k), O(B*k2), O(B*k3), etc for some growth factor k. Call these level 0, 1, 2 and so on. New items are inserted into level 0 tree. When this tree exceeds its size bound, it is merged into the level 1 tree, which may trigger recursive merges in turn.

Observe that if we insert N items into a levelled LSMT, there will be O(logk (N/B)) B-trees and the last one will have O(N/B) items in it. Therefore lookup has complexity O(logB N * logk (N/B)). To derive the update cost, observe that the items in the last level have been merged down the full O(log_k (N/B)) levels, and they will have been merged into on average O(k) times in each level before moving down to the next. Therefore the amortized insertion cost is O((k * log_k (N/B)) / B).

If we set k = ½ then lookup and insert complexity simplify to O((logB N)2) and O(logB N / sqrt B) respectively.

In a size-tiered LSMT things are slightly different. In this scheme we have a staging buffer of size O(B) and more than one tree at each level: specifically, at level i >= 0, we have up to k B-trees of size exactly O(B*ki). New items are inserted inte the staging buffer. When it runs out of space, we turn it into a B-tree and insert it into level 0. If would causes us to have more than k trees in the level, we merge the k trees together into one tree of size O(B*k) that we can try to insert into level 1, which may in turn trigger recursive merges.

The complexity arguments we made for levelled LSMT carry over almost unchanged into this new setting, showing that the two schemes have identical costs. LSMTs match the insert performance of fractal trees, but suffer the cost of an extra log factor when doing lookup. To try to improve lookup time, in practice most LSMT implementations store each B-tree along with a Bloom filter which allows them to avoid accessing a tree entirely when a key of interest is certainly not included in it.

There are several good overviews of LSMTs available online.

Experiments

To validate my knowledge of these data structures, I wrote a Python program that tries to perform an apples-to-apples comparison of various B-ε tree variants. The code implements the datastructure and also logs how many logical blocks it would need to touch if the tree was actually implemented on a block-structured device (in reality I just represent it as a Python object). I assume that as many of the trees towards the top of the tree as possible are stored in memory and so don't hit the block device.

I simulate a machine with 1MB of memory and 32KB pages. Keys are assumed to be 16 bytes and values 240 bytes. With these assumptions can see how the number of block device pages we need to write to varies with the number of keys in the tree for each data structure:

uncached_writes

These experimental results match what we would expect from the theoretical analysis: the BRT has a considerable advantage over the alternatives when it comes to writes, B-trees are the worst, and fractal trees occupy the middle ground.

The equivalent results for reads are as follows:

uncached_reads

This is essentially a mirror image of the write results, showing that we're fundamentally making a trade-off here.

Summary

We can condense everything we've learnt above into the following table:

Structure Lookup Insert
2-3 Tree O(log N) O(log N)
B-ε Tree O((logB N)/ε) O((Bε-1/ε)*logB N)
B-Tree (ε=1) O(logB N) O(logB N)
Fractal Tree (ε=½) O(logB N) O(logB N / sqrt B)
Buffered Repository Tree (ε=0) O(log (N/B)) O((log (N/B))/B)
Log Structured Merge Tree O((logB N)2) O(logB N / sqrt B)

These results suggest that you should always prefer to use a fractal tree to any of a B-tree, LSMT or 2-3 tree. In the real world, things may not be so clear cut: in particular, because of the fractal tree patent situation, it may be difficult to find a free and high-quality implementation of that data structure.

Most engineering effort nowadays is being directed at improving implementations of B-trees and LSMTs, so you probably want to choose one of these two options depending on whether your workload is read or write heavy, respectively. Some would argue, however, that all database workloads are essentially write bound, given that you can usually optimize a slow read workload by simply adding some additional indexes.

Easy publishing to Maven Central with Gradle

I recently released my first open source library for Java, MDBI. I learnt a lot about the Java open-source ecosystem as part of this process, and this blog summarises that in the hope that it will be useful to others. Specifically, the post will explain how to set up a project using the modern Gradle build system to build code and deploy it to the standard Maven Central repository from the command line really easily.

Getting started

In the Haskell ecosystem, everyone uses Cabal and Hackage, which are developed by the same people and tightly integrated. In contrast, Java's ecosystem is a bit more fragmented: build systems and package repositiories are managed by different organisations, and you need to do a bit of integration work to join everything up.

In particular, in order to get started we're going to have to sign up with two different websites: Sonatype OSS and Bintray:

  • No-one can publish directly to Maven Central: instead you need to publish your project to an "approved repository", from where it will be synced to Central. Sonatype OSS is an approved repository that Sonatype (the company that runs Maven Central) provide free of charge specifically for open-source projects. We will use this to get our artifacts into Central, so go and follow the sign-up instructions now.

    Your application will be manually reviewed by a Sonatype employee and approved within one or two working days. If you want an example of what this process looks like you can take a look at the ticket I raised for my MDBI project.

  • Sonatype OSS is a functional enough way to get your artifacts onto Central, but it has some irritating features. In particular, when you want to make a release you need to first push your artifacts to OSS, and then use an ugly and confusing web interface called Sonatype Nexus to actually "promote" this to Central. I wanted the release to Central to be totally automated, and the easiest way to use that is to have a 3rd party deal with pushing to and then promoting from OSS. For this reason, you should also sign up with Bintray (you can do this with one click if you have a GitHub account).

    Bintray is run by a company called JFrog and basically seems to be a Nexus alternative. JFrog run a Maven repository called JCenter, and it's easy to publish to that via Bintray. Once it's on JCenter we'll be able to push and promote it on Sonatype OSS fully automatically.

We also need to create a Bintray "package" within your Bintray Maven repository. Do this via the Bintray interface — it should be self-explanatory. Use the button on the package page to request it be linked to JCenter (this was approved within a couple of hours for me).

We'll also need a GPG public/private key pair. Let's set that up now:

  1. Open up a terminal and run gpg --gen-key. Accept all the defaults about the algorithm to use, and enter a name, email and
    passphrase of your choosing.
  2. If you run gpg --list-public-keys you should see something like this:

    /Users/mbolingbroke/.gnupg/pubring.gpg
    --------------------------------------
    pub   2048R/3117F02B 2015-11-18
    uid                  Max Bolingbroke <batterseapower@hotmail.com>
    sub   2048R/15245385 2015-11-18

    Whatever is in place of 3117F02B is the name of your key. I'll call this $KEYNAME from now on.

  3. Run gpg --keyserver hkp://pool.sks-keyservers.net --send-keys $KEYNAME to publish your key.
  4. Run gpg -a --export-key $KEYNAME and gpg -a --export-secret-key $KEYNAME to get your public and secret keys as ASCII text. Edit your Bintray account and paste these into the "GPG Signing" part of the settings.
  5. Edit your personal Maven repository on Bintray and select the option to "GPG Sign uploaded files automatically". Don't use Bintray's public/private key pair.

Now you have your Bintray and OSS accounts we can move on to setting up Gradle.

Gradle setup

The key problem we're trying to solve with our Gradle build is producing a set of JARs that meet the Maven Central requirements. What this boils down to is ensuring that we provide:

  • The actual JAR file that people will run.
  • Source JARs containing the code that we built.
  • Javadoc JARs containing compiled the HTML help files.
  • GPG signatures for all of the above. (This is why we created a GPG key above.)
  • A POM file containing project metadata.

To satisfy these requirements we're going to use gradle-nexus-plugin. The resulting (unsigned, but otherwise Central-compliant) artifacts will then be uploaded to Bintray (and eventually Sonatype OSS + Central) using gradle-bintray-plugin. I also use one more plugin — Palantir's gradle-gitsemver — to avoid having to update the Gradle file whenever the version number changes. Our Gradle file begins by pulling all those plugins in:

buildscript {
    repositories {
        jcenter()
        maven { url "http://dl.bintray.com/palantir/releases" }
    }
    dependencies {
        classpath 'com.bmuschko:gradle-nexus-plugin:2.3.1'
        classpath 'com.jfrog.bintray.gradle:gradle-bintray-plugin:1.4'
        classpath 'com.palantir:gradle-gitsemver:0.7.0'
    }
}

apply plugin: 'java'
apply plugin: 'com.bmuschko.nexus'
apply plugin: 'com.jfrog.bintray'
apply plugin: 'gitsemver'

Now we have the usual Gradle configuration describing how to build the JAR. Note the use of the semverVersion() function (provided by the gradle-gitsemver plugin) which returns a version number derived from from the most recent Git tag of the form
vX.Y.Z. Despite the name of the plugin, there is no requirement to actually adhere to the principles of Semantic Versioning to
use it: the only requirements for the version numbers are syntactic.

version semverVersion()
group 'uk.co.omega-prime'

def projectName = 'mdbi'
def projectDescription = 'Max\'s DataBase Interface: a simple but powerful JDBC wrapper inspired by JDBI'

sourceCompatibility = 1.8

jar {
    baseName = projectName
    manifest {
        attributes 'Implementation-Title': projectName,
                   'Implementation-Version': version
    }
}

repositories {
    mavenCentral()
}

dependencies {
    compile group: 'com.google.code.findbugs', name: 'jsr305', version: '3.0.1'
    testCompile group: 'org.xerial', name: 'sqlite-jdbc', version: '3.8.11.2'
    testCompile group: 'junit', name: 'junit', version: '4.12'
}

(Obviously your group, project name, description, dependencies etc will differ from this. Hopefully it's clear which parts of this example Gradle file you'll need to change for your project and which you can copy verbatim.)

Now we need to configure gradle-nexus-plugin to generate the POM. Just by the act of including the plugin we have already arranged for the appropriate JARs to be generated, but the plugin can't figure out the full contents of the POM by itself.

modifyPom {
    project {
        name projectName
        description projectDescription
        url 'http://batterseapower.github.io/mdbi/'

        scm {
            url 'https://github.com/batterseapower/mdbi'
            connection 'scm:https://batterseapower@github.com/batterseapower/mdbi.git'
            developerConnection 'scm:git://github.com/batterseapower/mdbi.git'
        }

        licenses {
            license {
                name 'The Apache Software License, Version 2.0'
                url 'http://www.apache.org/licenses/LICENSE-2.0.txt'
                distribution 'repo'
            }
        }

        developers {
            developer {
                id 'batterseapower'
                name 'Max Bolingbroke'
                email 'batterseapower@hotmail.com'
            }
        }
    }
}

nexus {
    sign = false
}

Note that I've explicitly turned off the automatic artifact signing capability of the Nexus plugin. Theoretically we should be able to keep this turned on, and sign everything locally before pushing to Bintray. This would mean that we wouldn't have to give Bintray our private key. In practice, if you sign things locally Bintray seems to mangle the signature filenames so they become unusable...

Finally, we need to configure the Bintray sync:

if (hasProperty('bintrayUsername') || System.getenv().containsKey('BINTRAY_USER')) {
    // Used by the bintray plugin
    bintray {
        user = System.getenv().getOrDefault('BINTRAY_USER', bintrayUsername)
        key  = System.getenv().getOrDefault('BINTRAY_KEY', bintrayApiKey)
        publish = true

        pkg {
            repo = 'maven'
            name = projectName
            licenses = ['Apache-2.0']
            vcsUrl = 'https://github.com/batterseapower/mdbi.git'

            version {
                name = project.version
                desc = projectDescription
                released = new Date()

                mavenCentralSync {
                    user     = System.getenv().getOrDefault('SONATYPE_USER', nexusUsername)
                    password = System.getenv().getOrDefault('SONATYPE_PASSWORD', nexusPassword)
                }
            }
        }

        configurations = ['archives']
    }
}

We do this conditionally because we still want people to be able to use the Gradle file even if they don't have your a username and password set up. In order to make these credentials available to the script when run on your machine, you need to create a ~/.gradle/gradle.properties file with contents like this:

# These 3 are optional: they'll be needed if you ever use the nexus plugin with 'sign = true' (the default)
signing.keyId=<GPG $KEYNAME from earlier>
signing.password=<your GPG passphrase>
signing.secretKeyRingFile=<absolute path to your ~/.gnupg/secring.gpg file (or whatever you called it)>

nexusUsername=<username for Sonatype OSS>
nexusPassword=<password for Sonatype OSS>

bintrayUsername=<username for Bintray>
bintrayApiKey=<Bintray API key, found in the "API key" section of https://bintray.com/profile/edit>

You can see the complete, commented, Gradle file that I'm using in my project on Github.

Your first release

We should now be ready to go (assuming your Sonatype OSS and JCenter setup requests have been approved). Let's make a release! Go to the terminal and type:

git tag v1.0.0
gradle bintrayUpload

If everything works, you'll get a BUILD SUCCESSFUL message after a minute or so. Your new version should be visible on the Bintray package page (and JCenter) immediately, and will appear on Maven Central shortly afterwards.

If you want to go the whole hog and have your continuous integration (e.g. the excellent Travis) make these automatic deploys after every passing build, this guide for SBT looks useful. However, I didn't go this route so I can't say it how it could be adapted for Gradle.

A nice benefit of publishing to Maven Central is that javadoc.io will host your docs for free totally automatically. Check it out!

Overall I found the process of painlessly publishing Java open source to Maven Central needlessly confusing, with many more moving parts than I was expecting. The periods of waiting for 3rd parties to approve my project were also a little frustrating, though it fairness the turnaround time was quite impressive given that they were doing the work for free. Hopefully this guide will help make the process a little less frustrating for other Gradle users in the future.

The Sad State of Symbol Aliases

This point continues my quest to condense and write down some of the folklore surrounding assemblers linkers. In this case, I recently came across a situation where it would be useful to be able to generate an object file that contained an alias for a symbol defined elsewhere. For example, I want an object file to export a symbol foo that aliases bar, such that when any use site of foo is linked against the object file that use site then behaves exactly as if it had referenced bar instead.

This could be done straightforwardly (just export both foo with the same value as bar) except for the wrinkle that in general bar is not defined in the object file exporting foo, so we don't know its value yet.

This article picks apart support for this feature on a platform-by-platform basis. Long story short: this is supported by the object file format on OS X and Windows, but you can't get to it from the assembly code level. Linux has no support at all.

OS X

Buried deep within the Mach-O specification is a mention of the symbol table entry type N_INDR. Quoth the standard: "The symbol is defined to be the same as another symbol. The n_value field is an index into the string table specifying the name of the other symbol. When that symbol is linked, both this and the other symbol have the same defined type and value".

This is great stuff, and exactly what we want! The fly in the ointment is that the latest version of Apples assembler has no support for actually generating such indirections. The source tree does contain a tool called indr which is capable of generating these indirections in a limited capacity, but it is not distributed with OS X and anyway not general enough for our needs. Happily, Apple's linker does seem to include support for N_INDR, so everything should work OK if you managed to generate an object file making use of that type.

Windows

Interestingly, Windows DLLs support something called "forwarders" which give us the behaviour we want for dynamically exported symbols. You can create such DLLs with special syntax in your .def file EXPORTS section. This is not relevant to our problem though, because there is no equivalent at the object file level.

Page 44 of the PE/COFF specification talks about symbol tables. Reading carefully, we find a mention of "Weak Externals" on page 51:

“Weak externals” are a mechanism for object files that allows flexibility at link time. A module can contain an unresolved external symbol (sym1), but it can also include an auxiliary record that indicates that if sym1 is not present at link time, another external symbol (sym2) is used to resolve references instead. If a definition of sym1 is linked, then an external reference to the symbol is resolved normally. If a definition of sym1 is not linked, then all references to the weak external for sym1 refer to sym2 instead. The external symbol, sym2, must always be linked; typically, it is defined in the module that contains the weak reference to sym1.

This is not exactly what we had in mind, but it can be abused for the same effect. Nothing will go wrong unless someone else defines a symbol with the same name as our alias in another object file.

As far as I can see, the GNU assembler can't be persuaded to generate this. The assembler does have rudimentary support for generating weak externals, but only uses it in the rudimentary capacity of supporting the .weak directive (with ELF-style semantics) on Windows. And as we shall shortly see, ELF semantics are not what we want at all...

Linux

Turning to page 1-16 of the ELF specification we find the definition of the ELF symbol table. As far as I can tell, there is no support whatsoever for this use case. Bah.

We might be tempted to search for some equivalent to the weak externals feature on Windows. Unfortunately, ELF weak symbols have a rather different semantics:

  1. An undefined weak symbol will not cause the linker to error out if a definition is not found. Instead, the symbol will be filled in with a default value of 0.
  2. A defined weak symbol has a lower link precedence than a strong symbol of the same name, and will not cause the linker to generate an error about duplicate symbol definitions in the case of such a conflict.

The difference between this and the Windows situation is that Windows basically lets us change the default value filled in by the linker in the case of no definition being found to an arbitrary symbol.

GCC

GCC supports an alias attribute that does exactly what I want. Unfortunately despite a few people trying to do exactly what I want they have elected to reject the construct:

This is because it's meaningless to define an alias to an undefined symbol. On Solaris, the native assembler would have caught this error, but GNU as does not.

This comment refers to the fact that assembly like this:

<code>.globl reexport
.globl export
.equiv export, reexport
</code>

Does not fail to compile with the GNU assembler, but generates an object file that does not define any symbols despite referencing the reexport symbol.

Conclusion

A sufficiently motivated hacker could support a (weak) aliasing feature along the lines described above in the GNU assembler on Windows and OS_X without problems. However, there seems to be no way to support it on Linux within the bounds of the ELF specification.

Unusually Linux is the platform that lags behind the others in linker features! I usually find that quite the opposite is true.

Security implications of PEP 383

I've been looking into improving GHC's support for non-ASCII text, and my investigations have lead to me to PEP 383.

One motivation behind this PEP is as follows: on Unix, the names of files, command line arguments, and environment variables should probably be treated as sequences of bytes. However, for good reasons it is quite natural for programs to act on them as if they were strings. This means that we have to choose some text encoding to use to interpret those byte sequences.

Unfortunately, whatever encoding you choose to use, it is quite likely that some byte sequences you encounter in practice will not in fact decode nicely using that encoding. An example would be a Big5 filename supplied as a command line argument to a program run in the UTF-8 locale.

In this case, what should happen? One sensible thing to do would be to fail, but this might be surprising. Python 3, with PEP 383, chooses to encode the non-decodable bytes as part of the string using surrogates. So if we try to parse a Big5 filename as a string we get a string full of surrogates representing the raw bytes we had to begin with.

This is a good thing to do because if that string is then immediately fed back into a function that just decodes the filename for use on the file system, the original byte sequence can be exactly reconstituted by decoding the surrogates back into bytes and using the locale encoding for the rest. If the user attempts to do something else with a string containing surrogates (such as e.g. display it to the terminal), then an exception will be raised.

This is a reasonably neat solution to a hard problem. However, it has weird implications. For example, consider this script that uses a black list to control access to some files:

#!/usr/bin/env python3

import sys

file = sys.argv[1]

blacklist = open("blacklist.big5", encoding='big5').read().split()
print("Blacklist is:\n" + repr(blacklist))

if file in blacklist:
print("Blacklisted file, not allowed!")
else:
print("OK, I'll let you in!")
print(open(file).read())

Let's say that the blacklist contains a single entry, for the file 你好 (encoded in Big5, naturally).

Seems simple enough, right? Although I store file names as Big5, I compare Python's Unicode strings. And indeed this program works perfectly when run from a terminal in the Big5 locale, with Big5 file names.

However, consider what happens when the terminal is set to UTF-8 and we invoke the script with the command line argument 你好 (encoded in Big5 of course, because the file name on disk is still Big5 even though we changed the terminal locale). In this case, Python 3 will attempt to decode the file name as UTF-8. Naturally, it will fail, so the Big5 filename will be represented in memory with surrogates.

Now for the punchline: when we come to compare that string (containing surrogates) with the entry from the blacklist (without surrogates) they will not be equal. Yet, when we go on to open the file, the filename (with surrogates) is decoded perfectly back into valid Big5 and hence we get the contents of the blacklisted file.

In my opinion, the fact that the current encoding affects the results of string comparisons is deeply weird behaviour and could probably be the cause of subtle security bugs. This is just one reason that I'm wary about adopting PEP 383-like behaviour for GHC.

P.S. For those who believe that my code is broken because you should only compare normalised unicode strings, I will add that even after using unicodedata.normalize to normalise to NFC I get the same problem.

P.P.S I will further note that you get the same issue even if the blacklist and filename had been in UTF-8, but this time it gets broken from a terminal in the Big5 locale. I didn't show it this way around because I understand that Python 3 may only have just recently started using the locale to decode argv, rather than being hardcoded to UTF-8.

The case of the mysterious "thread blocked indefinitely in an MVar operation" exception

I recently tracked down the cause of the persistent "thread blocked indefinitely in an MVar operation" exceptions I was getting from the GHC runtime when I used my parallel-io library. There doesn't seem to be much information about this exception on the web, so I'm documenting my experiences here to help those unfortunate souls that come across it.

The essence of the parallel-io library is a combinator like this:

parallel :: [IO a] -> IO [a]

The list of IO actions on the input are run (possibly in parallel) and the results returned as a list. The library ensures that we never go beyond a certain (user specified) degree of parallelism, but that's by-the-by for this post.

There are a number of interesting design questions about the design of such a combinator, but the one that tripped me up is the decision about what to do with exceptions raised by the IO actions. What I used to do was have parallel run all the IO actions wrapped in a try, and wait for all of the actions to either throw an exception or complete normally. The combinator would then either return a list of the results, or, if at least one action threw an exception, the exception would be rethrown on the thread that had invoked parallel.

At first glance, this seems very reasonable. However, in this turns out to be a disastrous choice that all-but-guarantees your program will go down in flames with "thread blocked indefinitely in an MVar operation".

Consider this simple program:

-- The main action executes on thread 1:
main = do
    res1 <- newEmptyMVar
    res2 <- newEmptyMVar
    
    wait <- newEmptyMVar
    
    -- Spawn thread 2:
    forkIO $ saveExceptionsTo res1 $ do
        () <- takeMVar wait
        putMVar res1 (Right "Hello")
    
    -- Spawn thread 3:
    forkIO $ saveExceptionsTo res2 $ do
        throwIO $ ErrorCall "Oops!"
        
        putMVar wait () -- Unblocks thread 2
        putMVar res2 (Right "World")
    
    -- Wait for the results of both threads:
    liftM2 (,) (takeMVar res1) (takeMVar res2) >>= print

saveExceptionsTo :: MVar (Either SomeException a) -> IO () -> IO ()
saveExceptionsTo mvar act = act `catch` \e -> putMVar mvar (Left e)

This code is similar to what the parallel combinator does in that the main thread (thread 1) waits for thread 2 and thread 3 to both return a result or exception to res1 and res2 respectively. After it has both, it shows the exception or result.

The fly in the ointment is that thread 2 is waiting on thread 3, so the thread dependency graph looks something like this:

(You should read an arrow that points from thread A to thread B as thread B is waiting on thread A to unblock it).

Unfortunately, our programmer has written buggy code, and the code running in thread 3 throws an exception before it can wake up thread 2. Thread 3 writes to the res2 MVar just fine, but thread 1 is still blocked on the res1 MVar, which will never be written to. At this point, everything comes crashing down with "thread blocked indefinitely in an MVar operation".

The most annoying part is that the exception that originally caused the error is squirrelled away in an MVar somewhere, and we never get to see it!

This goes to show that the proposed semantics for exceptions above is dangerous. In the presence of dependencies between the actions in the parallel call, it tends to hide exceptions that we really want to see.

The latest version of the parallel-io library solves this by the simple method of using the asynchronous exceptions feature of GHC to rethrow any exceptions coming from an action supplied to parallel as soon as they occur. This unblocks thread 1 and lets us deal with the exception in whatever way is appropriate for our situation.

Although this bug is simple to describe and obvious in retrospect, it took a hell of a time to diagnose. It never ceases to amaze me exactly how difficult it is to write reliable concurrent programs!

Interpreting GHC cost centre stacks

A little-known feature of GHC is the ability to run a program (that has been compiled with profiling) with the +RTS -xc option. Now, every time an exception is raised a stack trace will be dumped to stderr. This is the only way get stack traces for your Haskell code currently. Unfortunately:

  • This stack trace is constructed from the cost-centre stack, so you'll only see names from modules you compiled in profiling mode with -auto-all or similar
  • The stack trace is shown regardless of whether the exception in question is actually handled or not - so you are likely to see lots of spurious stack traces where exceptions have been caught and handled

It's still better than the alternative (i.e. wailing about the lack of proper Haskell stack traces) though.

Now, the question is how to interpret the mess you get out. Here is an example stack trace from my current debugging session (I've added linebreaks to aid comprehension):


<Control.Concurrent.ParallelIO.Local.parallelE,
Control.Concurrent.ParallelIO.Local.parallel,
Control.Concurrent.ParallelIO.Local.parallel_,
Control.Concurrent.ParallelIO.Local.withPool,
Development.Shake.shake,Development.Shake.CAF>

(I'm trying to figure out why OpenShake reliably dies with the dreaded "blocked indefinitely on MVar" exception.)

The answer lies in the GHC source code (see fprintCSS in Profiling.c). The items at the front of the stack trace are at the top of the stack and hence correspond to the approximate to location from which the exception was thrown.

Looking forward, it would be awesome if e.g. the Google Summer of Code could sponsor a student to take my work on GHC ticket 3693 further. Personally I would find stack trace support for Haskell totally invaluable, and I'm sure I'm not alone.