All posts by msoos

CryptoMiniSat in SAT Competition’11

I have submitted three versions of CryptoMiniSat to the 2011 SAT Competition, all available here. The releases are commonly called “Strange Night”, after the music track of the same name. Inside the archive you will find 5 binaries and a PDF description. The five binaries are made up of two single-threaded, two multi-threaded, and a unified single- and multi-threaded binary. This latter is called techdemo, as it is more of a technological demonstrator than anything else, and was developed in collaboration with George Katsirelos and Laurent Simon. All versions are collaborative, however, as they all have the hands of some people from the CryptoMiniSat development mailing list on them.

To be honest, I am not overly optimistic of these binaries, especially since I couldn’t get the 3.0 version of CryptoMiniSat ready for the competition, on which I have been working on the past two months. I have tried to backport some features to the 2.9.0 branch to make these releases, and so the resulting executables are pretty unstable. The techdemo version is a broken 3.0 release, as it is missing the signature feature of fully shared clauses between threads. Since I have finally obtained a 4-core CPU — with hyper-threading for 8 “independent” threads — I am all the more optimistic about a fully-shared scheme. The upcoming 3.0 release will be specifically made to work in multi-threaded environments, and will suffer if launched in single-threaded mode. This architectural decision was made because multi-threading nowadays no longer seems optional: it’s a must. On the CryptoMiniSat mailing list, some people are reporting performance on 64-core machines — optimising for single-threaded environment (and having multi-threading as an afterthought) in these times seems like a broken approach.

For those interested in the actual sourcecode of what has been submitted, everything is available from GIT, as usual:

  1. Strange Night1 – single-threaded
  2. Strange Night1 – multi-threaded
  3. Strange Night2 – single-threaded
  4. Strange Night2 – multi-threaded
  5. Tech Demo

Since these are pretty unstable, I wouldn’t use them in a production environment… happy bug-hunting ;)

Visiting Linz

Lately I had the pleasure of visiting Linz, Armin Biere’s workplace, where I gave a quick talk on SAT solver architectures. To me, it was really interesting to think through that presentation — not because it was entirely new or exciting, but because it recapped on many issues that have been bothering me lately. Namely, that it’s difficult to make a really versatile SAT solver, because the low-level choices that must be made (e.g. watchlist scheme) determines so many things when one must make higher-level architectural decisions such as clause sharing or even something as simple as hyper-binary resolution. As for this latter, thanks to Armin Biere’s thoughts I have finally managed to realise why my hyper-binary resolution was so slow: I lately took the decision not to fully propagate binary clauses before propagating normal (i.e. larger) clauses, which meant that doing hyper-binary resolution was much slower as I had to re-calculate the binary graph. The fact of not fully propagating binary clauses before normal clauses seemed also to influence my much higher-level choice of using cached implications, as they (intuitively, and also in practice) help much more if binary clauses are not fully propagated before normal clauses. This latter influence is interesting to think through, as something this trivial shouldn’t — at least in principle — influence such a high-level decision.

Also thanks to Armin Biere, I have managed to grasp a better understanding of lingeling and its superior watchlist scheme. Some of the architectural elements of lingeling’s watchlist scheme are really interesting, and when they get published I will definitely port some of them to CryptoMiniSat. It seems to use much less space, and stores information in a more cache-friendly manner, aiding the processor in its job. A related, interesting pointer that I have learnt is this paper that originally introduced blocking literals, but also talks about a range of other ideas that can help. All in all, it was great to visit Linz and the group of Armin Biere, as I have managed to learn a lot from him and his colleagues.

Delayed updates

In this post I will try to explain how locality of reference could be exploited by grouping operations that access different memory locations. In the SAT solver program I write, there is a loop that does a something called propagation. The goal of propagation is to set a number of variables to a given value through a pretty cache-intensive process. While doing this, as the variables are set, some statistics are calculated. For example, I used to have something like this called for every variable set:

enqueueNewFact(Literal lit, PropagatedBy someData) {
    trail.push_back(lit.var());
    value[lit.var()] = lit.sign();
    agility *= conf.agilityG;
    if (varPolarity[lit.var()] != lit.sign()) agility += 0.0001;
    varPolarity[lit.var()] = lit.sign();
    varSet[lit.var()]++;
    level[lit.var()] = getLevel();
}

The problem with this function is that it is accessing a lot of data pieces: trail[], value[], level[], varPolarity[] and varSet[]. All of these must be in cache for these to be updated, and if they aren’t, they must be pulled in from lower levels of cache or the main memory. Since enqueuNewFact() is called often, these are mostly in cache all the time, taking the space away from what really needs it: the data used during the propagate() routine.

What I have just recently realised is that I don’t use the values of level[], varPolarity[], varSet[] or agility at all during the propagate() routine. Instead, they are used when the recursive calling of propagate() has finished, and a certain technical point (called a conflict) is reached. If this is indeed the case, why are we updating them all the time, immediately when enqueueNewFact() is called? If you look at the code, due to various reasons, we must save a trail (called, conveniently, trail[]) of what we have done. This trail contains all the information needed by all the statistics-update functions: var() and the sign(), both contained in Lit is all that’s necessary.

So, I have made a new function: delayedUpdate() that calculates all the statistics just when it is needed: when a conflict is reached. It has a simple state that saves at what trail position it was called last time, lastDelayedEnqueueUpdate. The default is trail level 0, and then whenever we reach a conflict, we update it, once the stats are calculated:

delayedEnqueueUpdate() {
    int pseudoLevel = lastDelayedEnqueueUpdateLevel;
    for (int i = lastDelayedEnqueueUpdate; i < trail.size(); i++) {
        Lit lit = trail[i];
        agility *= conf.agilityG;
        if (varPolarity[lit.var()] != lit.sign()) agility += 0.0001;
        varPolarity[lit.var()] = lit.sign();
        varSet[lit.var()]++;
        if (trail_lim[pseudoLevel] == i) pseudoLevel++;
        level[p.var()] = pseudoLevel;
    }
    lastDelayedEnqueueUpdate = trail.size();
    lastDelayedEnqueueUpdateLevel = decisionLevel();
}

This code, apart from the pseudoLevel and lastDelayedEnqueueUpdateLevel (both of which are SAT solver technicalities) is rather easy to understand. Basically, we take the last position we updated last time, lastDelayedEnqueueUpdate, and we work our way from there to the end, trail.size(). Once we have worked our way through, we set lastDelayedEnqueueUpdate to the current point, trail.size().

All this seems useless if we don't take cache usage into consideration: we are not saving any CPU instructions. If anything, we add CPU instructions to execute, as we now need to execute an extra for loop (propagate() is essentially a recursive for loop). However, from a cache usage point of view, we are saving a lot. When doing propagation, the propagate() routine can have all the cache it needs. Then, when that cache-intensive process has finished, we pull in a set of arrays, varPolarity[], varSet[], level[], update them all in one go, and then continue operation.

We have managed to group operations according to what memory pieces they are using&updating. This makes our program more cache-friendly, which almost always leads to a measurable speed increase.

On the New Watchlist Scheme

All right, so I am not so cool because I work on weekends, but I got this new watchlist scheme implemented. First off, let me say it was relatively easy to implement, which is a testament to the extensibility of CryptoMiniSat: it only took me ~4 hours to get it done. Secondly, I realised that I have forgotten XOR clauses when I thought through the new scheme. I used to store the current value of the XOR clause at the XOR clause itself (using the sings of the literals, being the hacker I am), and update it at every call to propXorClause(). Since we need to keep XOR clauses static just like normal clauses, this is no longer possible, so a separate data-structure needs to be built up for this. However, it won’t need too many updates, so all is fine, but it does make things a tad more complicated.

When first implemented, the new scheme performed worse than the normal one — much worse. We lost ~20% in terms of time. I never really give up on my ideas, however, so I started thinking which are the advantages of this new scheme, and how could we make them work for us? First of all, since the watchlists are now more expensive to go through, we might as well spend more time at the clause, looking for a watch that is not likely going to trigger again. In other words, when we are at the clause, and looking for a new watch, don’t just look for a literal that is either Undefined or True: strongly prefer to find one that is True. The idea here is that if we set the watch to be a literal that is True, during backtrack we might still have that literal be True (as we might not have undone that assignment), and so we won’t have to bother propagating it next time. This leads to more time going though the clause, but once the clause is already in the cache, going through it takes almost no time at all. This gained about ~5% of time: good, but not enough.

For the next idea, we have to notice that we are going through each clause from the beginning, and the literals there don’t change, in contrast to the scheme where we constantly move literals to the beginning of the clause. In almost every modern solver, we have something called polarity caching: we memorise what was the last polarity (True or False) of every variable, and then when making a branch next time, we take the cached polarity. In CryptoMiniSat there is a slight salt added, so with a low possibility we take the opposite branch, but still: the cached polarity is a pretty good indicator whether a literal will be True or False. If this is indeed the case, and we are preferably searching for literals that are True at the beginning of the clause… why not move the literals that have a True cached polarity to the beginning of the clause? So, I added a trivial (~20 line) algorithm that is executed very infrequently (~once per 100’000 confl.) to do this. An interesting property of the new watchlist scheme is that this can be done without detaching and reattaching the clauses. Once implemented, the time to solve decreased by ~13%. Overall, then, the speed of the new watchlist scheme is close to the original.

The advantages of this new sheme are multifold. Since we are only content when finding a True watched literal in the clause, we decrease the overall number of memory requests per thread, which should help the totality of threads make less memory requests, leading to less strain on the memory architecture. Not to mention that we really don’t touch the clauses, so write-back to memory is dramatically decreased, by probably 80% or more. And finally, we can share the clauses, leading to less overall memory fetches.

The only drawback of this new scheme seems to be that we cannot optimise for every thread: cached polarities are stored on a per-thread basis. However, one would expect that most of the time the cached polarities of the different threads are close. Therefore, this should work relatively well. Also, we could do voting for polarities among threads, and then sort literals in the clauses according to the votes cast — we would then optimise for the whole set of threads instead of only the master thread.

While doing the above, I have also played around with OProfile that is a Linux tool to tell (among other things) an approximation of the number of L2 cache misses by a program. This helped me put the right prefetch builtins at the right places, and the number of cache misses is now half of that without prefetch instructions (for both watchlist schemes). CryptoMiniSat 2.9.0 had some prefetch instructions added at the very last moment, but they weren’t placed optimally enough. The new way they are organised is much better, leaving more time for the CPU to prefetch the lines (which can take lots of cycles).

PS: Current architecture doesn’t allow CryptoMiniSat to share clauses, but even this way multi-threading is already faster by ~5% due to less memory write-back and less memory contention due to optimised watched literal picking.

CryptoMiniSat 2.9.0 released

After a long break, CryptoMiniSat 2.9.0 has finally been released. The Windows executables need this to work, and the Linux binaries need a recent (>=2.6.26) kernel version. The program has evolved substantially since the last release — more than half of its codebase has been changed. It now works in multi-threaded mode, and uses lazy cached implications for all sorts of interesting purposes from simple transitive on-the-fly self-subsuming resolution to very efficient literal dependency analysis. Currently, the program solves ~225 problems from the 2009 SAT Competition examples given the same time and similar computing power. This is a nice improvement over other SAT solvers such as MiniSat (205), PrecoSat (210) or lingeling (207).

Intensive testing has been carried out on this version of the program, to the extent of finding and identifying a new and unknown gcc bug affecting all gcc versions 4.5.0 and later. Many have played part in finding this and other bugs, including, but not limited to, Martin Maurer, Vegard Nossum, Oliver Kullmann, Robert Aston, and others. This release has been made possible thanks to them.

I hope this version of the code will be useful to many, not only end users but also researchers. In fact I decided to release this version (relatively) early to let researchers merge their changes for the upcoming SAT Competition 2011. I am very interested in any and all versions of CryptoMiniSat that will be submitted to the competition. In case you personally want to change it and find something difficult to understand, just drop a mail to the nowadays very active development mailing list. With this release, I can confidently say that CryptoMiniSat is becoming not only a programming project but also a community. If you feel like you could contribute, or you are simply interested in what is going on, join us.

As for the technical details, the new CryptoMiniSat uses OpenMP for mutli-threading which means it is fully platform-independent. During multi-threading we share unitary and binary clauses between threads — the latter is checked for redundancy using lazy cached implications. We use implicit binary clauses, which don’t help as much in terms of memory as I have hoped, but they help immensely for certain operations, in particular with subsumption checks. Literal dependencies are now calculated using the lazy cached implications and the dominating literal with the largest tree is selected as dominant literal for all literals: when picking variables, we pick the dominating literal 50% of the time instead of the original literal. Variable elimination heuristics have also been highly tuned thanks to the much better at-hand statistics due to implicit binary clauses.

Unfortunately, this version doesn’t offer all the things I wanted it to offer: multi-threaded library interface, extended resolution, distributed solving through MPI, unified restart strategy, and others. All of these (except MPI), and more are however available from the public GIT repository. They work relatively stable, but have a terrible speed: the version with all of these only solves ~214 problems from those above. Therefore, I will personally submit version 2.9.0 to the SAT Competition unless some grave bug is found inside. I might also submit a corrected version of the upcoming 3.0.0 release as well, but as it usually takes a month to tune the solver, and I am currently very busy with some great projects at my new workplace, so a 3.0.0 version probably won’t happen until the deadline.

I hope you will enjoy this new version of CryptoMiniSat! In case of questions, problems or bugs, please contact either me, or the development mailing list and join in to collaborate :)