Bitesize Functional Programming: Comprehensive Comprehensions

As my final year undergraduate project I’ve implemented Philip Wadler’s and Simon Peyton Jones’ Comprehensive Comprehensions in the Glasgow Haskell Compiler. I’m happy to report that my patch was accepted for inclusion in the compiler, so this is a feature you’ll really be able to use come the release of 6.10!

So, what are comprehensive comprehensions? Well, first let’s just review what list comprehensions are:

triples = [(a,b,c) | a <- [1..20]
                   , b <- [a..20]
                   , let sum = a^2 + b^2
                   , c <- [b..20]
                   , sum == c^2]

This Haskell code draws numbers from 1 to 20 into the variable a, and for each one of those elements draws the numbers from a to 20 into b. We then compute the sum of the squares of each of those numbers, and finally give up in all but the case where we can find some c in the range from b to 20 such that that sum is the same as the square of c. Assuming that we have passed this test, we output a tuple containing a, b and c. Clearly, this implements some code for finding small Pythagorean triples.

Pythagorean Triples

So, the result will be:

[(3,4,5), (5,12,13), (6,8,10), (8,15,17), (9,12,15), (12,16,20)]

We can build an analogy between what we can do with list comprehensions and what we can do with parts of SQL. Drawing an item from a list is like SELECTion, filtering out certain items is similar to a WHERE clause and so on. However, there are some prominent features of SQL that aren’t so easy to express with just these vanilla list comprehensions. How could we write the following query in Haskell?

SELECT Name, SUM(Salary)
FROM Employees
GROUP BY Dept
ORDER BY SUM(Salary) ASC
LIMIT 5

Well, the answer is that we can use our lovely new generalized list comprehensions and write this:

[(the dept, sum salary)
  | (name, dept, salary) <- employees
  , then group by dept
  , then sortWith by sum salary
  , then take 5]

Notice that the existing keyword “then” is being used to introduce the sorting and grouping functionality that we require: this happens to be a bit different from the syntax in the paper, so be careful you don’t get caught out by this.

What isn’t obvious from the presentation above is just how general these comprehensive comprehensions are! Unlike SQL you can sort or group at any point in the query without making use of subqueries. Unlike SQL you can use any aggregation function, not just one it has defined like COUNT or SUM. Most significantly, unlike SQL you can actually use alternative grouping and sorting functions! So, you could choose a grouping function that actually looked for runs in the thing you are grouping by, or a “sorting” function that just returns the first 5 elements of the list: if you look at the example above we used just this feature to implement the LIMIT clause of SQL.

These new pieces of syntax are pretty nice, and speaking from personal experience it’s quite hard to do heavy duty list manipulation without them once you’ve grown used to having them available. In particular, I found them to have a killer application in the code I used for benchmarking the syntax extension, for slicing and dicing lists of the runtime and heap usage of various program runs.

If you want the full story on how to use the syntax, you can check out the documentation. Go forth and conquer with your new, readable, list-munging powers! I’ll leave you with a picture of the guys you have it all to thank for, those giants of functional programming Simon Peyton Jones and Philip Wadler:

Simon Peyton Jones and Philip Wadler

Mike K says:

20 March 2008 00:10 UTC

Seems very powerful indeed. Could you elaborate a little of how this goes along with laziness? I'm especially intrigued how one can perform a lazy grouped query, where data "flowing in" is not ordered by groups.

Max says:

20 March 2008 08:33 UTC

Well, the first stage of the default grouping operation is to sort the incoming data by the group key. This means that at least that component of the list will be forced at that point, so some laziness is discarded.
Of course, this means that in particular you cannot group infinite comprehensions! I think this is reasonable however :-). It is not really clear what the semantics would be.
On the other hand, if you know the data flowing in IS sorted you can group by a "runs" function and it will work exactly as you expect. This will of course work with infinite lists.

Mark Carmichael says:

14 March 2008 12:39 UTC

Speaking as an admirer of Haskell from afar, this looks like a lovely addition.
A nit: in the version of the diagram (250px-pythagorean.png) I'm seeing now,
the 'b' label is on the shortest segment of the triangle. Perhaps it should
be swapped with 'a' for clarity?

14 March 2008 12:51 UTC

There are some comments on this post being made on Reddit. For posterity, you can find these here.