Armin Biere gave a lecture at the Pragmatics of SAT workshop (proceeedings here) in Vienna about all the things inside lingeling which won a lot of awards[PDF] this year. If you weren’t there, you missed an amazing presentation. In this blog post I’ll reflect on a particular part of the presentation dealing with a memory trickery that has been intriguing me for a long while but I did not implement. Before I begin let me say: the presentation was awesome, and it’s not by chance that lingeling won so many awards.
The idea
The idea used by lingeling I want to talk about is easy to explain (though not easy to invent, as is usually the case). If you look at typical CNF problems, the majority of the clauses will be binary, i.e. only contain 2 literals. These clauses used to be stored exactly the same way as normal clauses: in the heap we allocate 2 literals and we put a pointer into the watchlist to these literals.
An improvement over this idea are the so-called implicit clauses. For the binary clause “x V y” we put into the watchlist of x the literal y, and in the watchlist of y the literal x. There is no other place we store these binary clauses, hence the word “implicit”. For other clauses, we still put pointers into the watchlist and allocate space, as usual. The problem with this approach is that the pointer to a clause is 32b (we use an 32b offset on 64b machines) but for each clause we also store a so-called blocking literal in the watchlist, which is also 32b. That makes the entries in the watchlist 64b long for normal clauses, and 32b long for binary clauses.
The idea is to have differing sizes of elements in a watchlist. If e.g. the first bit of the element is a 1, the next 63b relate to a long clause, and if it’s a 0, then the next 31b relate to a binary clause. In case 80% of the clauses are binary, this saves 50% of the space in 80% of the cases. Not bad at all.
The advantages
The advantages of using this idea are twofold. First, as already mentioned, memory use is lowered. This is non-trivial as memory usage of the watchlists can be enormous and although many other improvements can be switched off (such as e.g. stamping), storage of the clauses can never be switched off. Secondly, not having holes in memory leads to much better cache usage which in turn can bring real speedup. In case you think this is not important, you might enjoy knowing that the HHVM module of Facebook was made over 2x faster by making sure that important cache lines are not knocked out[PDF].
The disadvantages
In case you have an array that has varying size elements in it, some non-trivial complications arise. Let me list just a few that come to my mind.
Sorting the list is no longer trivial You cannot just swap elements: they might not fit. One way to do sorting efficiently is to move all the data into another, equally-spaced container and sort there, then move it back in. However, keep in mind that the reason why quicksort is so fast is that it can do in-place sort. Merge-sorting would be another option, but it copies elements and it’s not by chance it’s not the default sort in most cases. Also, you would have to re-write merge-sort of course, to deal with the varying-sized elements.
In case you think that sorting is not needed, maybe you forgot to consider the lightning-quick subsumption you can do between implicit binary&tertiary clauses using sort to give just one example.
Removing an element is no longer an O(1) operation In case you need to remove element X in a watchlist, you can simply swap the last element to the position of X and make the array one smaller. This trick is used quite extensively since the order of the watchlist is usually irrelevant.
Loops need to be re-written All your loops that go through the watchlist need to be re-written and have to have a switch() in them with some pointer arithmetic (do we advance by 32b or 64b?). In case you think you don’t need to go through the watchlist too often anyway, think again. Any time you need to do anything with clauses you will have to go through the watchlists, since they are the only place where binary (and tertiary, if your system is optimised enough) clauses are stored. This means your watchlist-gothrough function will be absolutely everywhere in the code unless you don’t want to implement any pre- or in-processing.
You might think you could write a function and just pass a pointer to another function to it that does the ‘real’ job, essentially hiding the complexity in a function that you only need to write once. There are three problems with this. First, this will be a tight loop and so performance is important, which you will loose as you will need to dereference the passed-in function pointer every time. This can be overcome with the use of templates but it won’t make the code pretty. Secondly, your original, hiding function will need to be written more than a single time. For example, some such executions will need to count time (operations) and some won’t. You will need to count pointer dereferences (normal clause is fetched) and binary clauses (no pointer dereferenced) in a significantly different way as a cache-miss is very expensive and a clause-access will cause such a cache-miss most of the time. For performance reasons you will need variations that don’t dereference long clauses, variations that allow for manipulation of the array, variations that don’t, etc.
Maintaining datastructure consistency becomes harder Unless you use hiding functions, which is non-trivial as explained above (and maybe impossible in e.g. plain C), the complexity to maintain consistency of the datastructure will be all over the code. Even if done very carefully, the constraints on the datastructure may end up being implicitly, rather than explicitly, represented in the code. It will make it easier to create bugs and harder to find them.
Conclusions
This idea of using less memory for binary clauses in the watchlists is very interesting and has intrigued me for a long while — Armin was kind enough to tell me about this a long time ago. It has the potential to save a lot of memory and to keep things more packed in the datastructure that is arguably the most accessed during solving and inprocessing. However, I was always daunted by the obstacles I saw in front of me — though I might simply need to understand C++11 and templates better to make it work.
Currently, I feel like there are plenty of other optimisations that I could implement from the talk of Armin, e.g. that all watchlists are stored in the same array, using offsets and a hand-rolled memory manager. That seems to have a potential of also improving the memory usage and speed while being easier to implement and easier to hide in a class.