All posts by msoos

On the New Watchlist Scheme

January 23, 2011SATBCP, memory layout, Parallel SAT Solvingmsoos

All right, so I am not so cool because I work on weekends, but I got this new watchlist scheme implemented. First off, let me say it was relatively easy to implement, which is a testament to the extensibility of CryptoMiniSat: it only took me ~4 hours to get it done. Secondly, I realised that I have forgotten XOR clauses when I thought through the new scheme. I used to store the current value of the XOR clause at the XOR clause itself (using the sings of the literals, being the hacker I am), and update it at every call to propXorClause(). Since we need to keep XOR clauses static just like normal clauses, this is no longer possible, so a separate data-structure needs to be built up for this. However, it won’t need too many updates, so all is fine, but it does make things a tad more complicated.

When first implemented, the new scheme performed worse than the normal one — much worse. We lost ~20% in terms of time. I never really give up on my ideas, however, so I started thinking which are the advantages of this new scheme, and how could we make them work for us? First of all, since the watchlists are now more expensive to go through, we might as well spend more time at the clause, looking for a watch that is not likely going to trigger again. In other words, when we are at the clause, and looking for a new watch, don’t just look for a literal that is either Undefined or True: strongly prefer to find one that is True. The idea here is that if we set the watch to be a literal that is True, during backtrack we might still have that literal be True (as we might not have undone that assignment), and so we won’t have to bother propagating it next time. This leads to more time going though the clause, but once the clause is already in the cache, going through it takes almost no time at all. This gained about ~5% of time: good, but not enough.

For the next idea, we have to notice that we are going through each clause from the beginning, and the literals there don’t change, in contrast to the scheme where we constantly move literals to the beginning of the clause. In almost every modern solver, we have something called polarity caching: we memorise what was the last polarity (True or False) of every variable, and then when making a branch next time, we take the cached polarity. In CryptoMiniSat there is a slight salt added, so with a low possibility we take the opposite branch, but still: the cached polarity is a pretty good indicator whether a literal will be True or False. If this is indeed the case, and we are preferably searching for literals that are True at the beginning of the clause… why not move the literals that have a True cached polarity to the beginning of the clause? So, I added a trivial (~20 line) algorithm that is executed very infrequently (~once per 100’000 confl.) to do this. An interesting property of the new watchlist scheme is that this can be done without detaching and reattaching the clauses. Once implemented, the time to solve decreased by ~13%. Overall, then, the speed of the new watchlist scheme is close to the original.

The advantages of this new sheme are multifold. Since we are only content when finding a True watched literal in the clause, we decrease the overall number of memory requests per thread, which should help the totality of threads make less memory requests, leading to less strain on the memory architecture. Not to mention that we really don’t touch the clauses, so write-back to memory is dramatically decreased, by probably 80% or more. And finally, we can share the clauses, leading to less overall memory fetches.

The only drawback of this new scheme seems to be that we cannot optimise for every thread: cached polarities are stored on a per-thread basis. However, one would expect that most of the time the cached polarities of the different threads are close. Therefore, this should work relatively well. Also, we could do voting for polarities among threads, and then sort literals in the clauses according to the votes cast — we would then optimise for the whole set of threads instead of only the master thread.

While doing the above, I have also played around with OProfile that is a Linux tool to tell (among other things) an approximation of the number of L2 cache misses by a program. This helped me put the right prefetch builtins at the right places, and the number of cache misses is now half of that without prefetch instructions (for both watchlist schemes). CryptoMiniSat 2.9.0 had some prefetch instructions added at the very last moment, but they weren’t placed optimally enough. The new way they are organised is much better, leaving more time for the CPU to prefetch the lines (which can take lots of cycles).

PS: Current architecture doesn’t allow CryptoMiniSat to share clauses, but even this way multi-threading is already faster by ~5% due to less memory write-back and less memory contention due to optimised watched literal picking.

CryptoMiniSat 2.9.0 released

January 21, 2011SATCollaborative effort, Releasemsoos

After a long break, CryptoMiniSat 2.9.0 has finally been released. The Windows executables need this to work, and the Linux binaries need a recent (>=2.6.26) kernel version. The program has evolved substantially since the last release — more than half of its codebase has been changed. It now works in multi-threaded mode, and uses lazy cached implications for all sorts of interesting purposes from simple transitive on-the-fly self-subsuming resolution to very efficient literal dependency analysis. Currently, the program solves ~225 problems from the 2009 SAT Competition examples given the same time and similar computing power. This is a nice improvement over other SAT solvers such as MiniSat (205), PrecoSat (210) or lingeling (207).

Intensive testing has been carried out on this version of the program, to the extent of finding and identifying a new and unknown gcc bug affecting all gcc versions 4.5.0 and later. Many have played part in finding this and other bugs, including, but not limited to, Martin Maurer, Vegard Nossum, Oliver Kullmann, Robert Aston, and others. This release has been made possible thanks to them.

I hope this version of the code will be useful to many, not only end users but also researchers. In fact I decided to release this version (relatively) early to let researchers merge their changes for the upcoming SAT Competition 2011. I am very interested in any and all versions of CryptoMiniSat that will be submitted to the competition. In case you personally want to change it and find something difficult to understand, just drop a mail to the nowadays very active development mailing list. With this release, I can confidently say that CryptoMiniSat is becoming not only a programming project but also a community. If you feel like you could contribute, or you are simply interested in what is going on, join us.

As for the technical details, the new CryptoMiniSat uses OpenMP for mutli-threading which means it is fully platform-independent. During multi-threading we share unitary and binary clauses between threads — the latter is checked for redundancy using lazy cached implications. We use implicit binary clauses, which don’t help as much in terms of memory as I have hoped, but they help immensely for certain operations, in particular with subsumption checks. Literal dependencies are now calculated using the lazy cached implications and the dominating literal with the largest tree is selected as dominant literal for all literals: when picking variables, we pick the dominating literal 50% of the time instead of the original literal. Variable elimination heuristics have also been highly tuned thanks to the much better at-hand statistics due to implicit binary clauses.

Unfortunately, this version doesn’t offer all the things I wanted it to offer: multi-threaded library interface, extended resolution, distributed solving through MPI, unified restart strategy, and others. All of these (except MPI), and more are however available from the public GIT repository. They work relatively stable, but have a terrible speed: the version with all of these only solves ~214 problems from those above. Therefore, I will personally submit version 2.9.0 to the SAT Competition unless some grave bug is found inside. I might also submit a corrected version of the upcoming 3.0.0 release as well, but as it usually takes a month to tune the solver, and I am currently very busy with some great projects at my new workplace, so a 3.0.0 version probably won’t happen until the deadline.

I hope you will enjoy this new version of CryptoMiniSat! In case of questions, problems or bugs, please contact either me, or the development mailing list and join in to collaborate :)

Changing the Watchlist Scheme

January 14, 2011SATBCP, memory layout, Parallel SAT Solvingmsoos

Okay, what if we changed the watchlist scheme… a bit? The way the watchlist scheme operates in SAT solvers is the following. We have a set of literals lit1, lit2, lit3, lit4 that defines the clause we want to be watched for propagation and conflict. We take two literals from this set of literals, and put them in a datastructure. It’s not important which ones we take, but we have to know which are the ones we took. So, we take the 1st and the 2nd (because we are lazy), and we put them in a set of lists called the watchlists. We now have a pointer to the clause in watchlist[~lit1] and in watchlist[~lit2], where ~ is the binary negation. If we now set lit1 = false we have to have a look at the clauses in watchlist[~lit1], see if any of the clauses forces a propagation. If not, then we remove the corresponding lit1 from the watchlist, find another literal that is unset, e.g. lit3, and put that into the watchlist[~lit3]. It is important that lit3 must not be lit1 or lit2. Obviously, we know what lit1 is (we are visiting its watchlist, after all), but how could we know what lit2 is? Well, we always update the clause such that lit1 and lit2 are the first two literals. We do this for every clause in every watchlist, and until a conflict is found, or all clauses are satisfied.

So, why is this inefficient? Well, what needs to be noticed, is that we updated the clause non-stop in order to keep in mind lit1 and lit2. We can implement this by changing the order of literals (that’s how it’s done in most cases), or by updating some data in the clause. However, this needs an update of the clause, and consequently of the whole cacheline on which the clause (and probably other clauses, as these are memory-packed) reside. Let’s say the average clause length is 10, but maximum clause length is 2^16. Since the maximal literal length is 32 bits, an average clause, with all the extra info it needs to keep, is around 10*32+32+32= 384 bits long. Cache line size for a Core i7 processors is 64 Bytes (=512bits), though the new architectures I believe are bumping this to 128 Bytes. We are therefore asking the CPU to grab 64Bytes from memory, store it in the CPU cache, and then put the changed value back into main memory — as the cache is woefully small to keep all the clauses we will visit. If you take a CPU with 6 cores and try to run this algorithm, you are probably looking at a memory bandwidth nightmare. The clauses cannot be shared, because updating the lit1 and lit2 positions is not an option, as these would be different for each thread.Furthermore, the CPU is not only taking data from main memory, but also putting back data into main memory at an astonishing rate.

Let’s take a Core i7 processor off the shelf, and see how its architecture looks like. We have 1MByte of L2 cache for each core, and we have 8MByte of shared L3 cache for all the CPU cores. This is like a finger pointing at us: have a big piece of shared memory that is constant and can be shared, while keep a much smaller piece of thread-local memory. Also: please lighten up the load on the memory access sub-architecture, as that isn’t going to keep up with all the 6 cores churning through the clauses. Well, a interesting solution is then to split each clause into two pieces: one to store which is lit1 and lit2, and one that stores the literals and the extra information. The first piece, let’s call this threadContext stores the point of lit1 and lit2, each of which is 16 bits (the maximal size of the clause), so 4 bytes in total. The clause will then not need any update at all. An average problem instance is typically beaten down to a size of <300’000 clauses by CryptoMiniSat in <500 seconds. Optimally, we would like to access these clauses from main memory or L3 cache (this latter is a dream, really), and keep their threadContext in the L2 cache of the core. So, this latter is 300’000*4=1.2MByte for a typical problem. There are also some other things to keep in L2, such as current variable assignment, and the current trail of assignments, but each of these should be much smaller. The total would probably be ~2MByte. Oh well, we didn’t fit them inside the 1MByte of L2 cache, but still. The net effect of this change should be positive. Furthermore, even if we only use one core, more memory bandwidth should be available, as we don’t move that much (or hardly any) information back into the main memory any more.

So why wouldn’t this idea work? Well, for starters, we would access two pieces of memory whenever visiting a clause: instead of only accessing the clause, we now also have to access threadContext to be able to work with the clause. I am willing to argue that this is not a problem at all. The reason is really simple: since threadContext is small, and since it is being accessed all the time, it will surely be in cache, probably L2, maybe L3. However, the clause itself is almost surely to be in main memory, but if not, then in L3. All we need to do, is to emit a non-blocking memory-prefetch assembly instruction for both entities when we want to treat the clause, and then work with the clause. Since threadContext is almost sure to arrive earlier to the CPU than the clause itself, and since these can be done in parallel, the speed difference will be either negligible or nonexistent. The other argument we could come up with is related to the following design feature: swapping the literals lit1 and lit2 to be always at the head of the clause allows the CPU to travel through less of the clause and carry out less instructions (less checking for lit1&lit2). This could mean less cachelines needed, and less instructions carried out. Since memory is the bottleneck, number of instructions is not of any real concern. More cachelines are more of a problem, but if we access the right literal at the right point in the clause, then parts of the clause might not be needed, and we would only need to grab more cachelines than with the original scheme if we happen to roll around the clause: starting at the middle, reaching the end and not finding a new unassigned literal, having to go to the beginning. But this shouldn’t occur too often, and the clause might be on the same cacheline anyway (with <50% chance for a 10-long clause and a 128Byte cachline).

The last, and most burning question is: will this ever be implemented? Well, if we are really to scale to 6- and 12-core CPUs, I think eventually a better way to share cache will be needed. This is just one way of doing it — probably a crazy complicated and inefficient way at that. Only a good implementation can tell if this approach is worthwhile to pursue, and the 2011 SAT Competition is far too close for me to get this right. Oh well, let’s get back to removing that “last” bug in CryptoMiniSat 2.9.0….

Edited to Add(21/01/2011): I forgot that watchlists also need to be in cache. However, most of the watchlists needed can be prefetched: when a literal is propagated, we can prefetch the corresponding watchlist immediately. The only problem is when a watchlist is needed because a clause’s watched literal is changed. This poses a headache that I haven’t yet thought through.

The variable speed of SAT solving

January 3, 2011SATbranching, resolution, SAT, UNSATmsoos

Attacking-Bivium-Using-SAT-Solvers — Timings from the article "Attacking Bivium Using SAT Solvers". The authors didn't seem to have randomised the problems enough: the time to solve should increase exponentially, but instead it goes up and down like a roller coaster

Vegard Nossum asked me about the varying time it took for CryptoMiniSat to solve a certain instance that was satisfiable. This inspired me to write an overly long reply, which I think might interest others. So, why does the solving time vary for a specific instance if we permutate the clauses? Intuitively, just by permutating the clauses, the problem itself doesn’t get any easier or harder, so why should that make any difference at all? I will now try to go through the reasons, though they can almost all be summed up as follows: the SAT solver has absolutely no clue what it is solving, and so to solve a problem, it relies on heuristics and randomisation, both of which are influenced by its starting seed, which in turn is influenced by the order of the clauses in the CNF file. And now on to the long explanation.

Let’s divide problems into two categories: satisfiable and unsatisfiable instances. First, let me talk about satisfiable instances. Let’s suppose that we have been solving the problem for some time, and we have connected the dots (variables) well enough through learnt clauses. All that the SAT solver is waiting for is a good guess to set 3-4 variables right, and then it will propagate all variables to the right setting to satisfy all clauses. The problem is therefore to get to the point of only needing to guess 3-4 variables, i.e. to connect the dots, which we do through resolution. We can sacrifice some or all of dot-connecting (i.e. resolution) with some more guessing, and this can in fact be coded down by simply foregoing some of the conflict analysis we do. In the extreme case, all conflict analysis can be disabled, and then the solver would not restart its search. The solver would in essence be brute-forcing the instance through BCP (boolean constraint propagation). It is easy to see why this latter is prone to variation: depending on the ordering of the variables in the search tree, the tree could be orders of magnitude smaller or larger, and the first solution can be at any point in the search tree, essentially at a random place.

If we decide to carry out resolution for satisfiable problems to counter the problem of the variation of the search-tree, it is interesting to realise the following: in most cases, we can not forgo guessing. The reason is simple yet leads to quite interesting properties. Essentially, a SAT instance can have multiple solutions. If there are two solutions, e.g. 01100... and 10011... i.e. the solutions are the complete inverse of one another, then the SAT solver will not be able to prove any variable to any value. The best it could do is to create binary clauses, in the example case for instance

var1=0 -> var2=1, var3=1, var4=0...
and
var1=1 -> var2=0, var3=0, var4=1...

If we do enough resolutions, the “guessing” part will eventually become a solution-selector, i.e. it will select a solution from the set of available solutions. If there are 2^20, evenly distributed in the search space, we might need to set 20 variables before a solution is found through a simple application of BCP. Naturally, SAT solvers don’t normally do this, as they are content in finding one satisfying assignment, reporting that to the user, and exiting. It would be interesting to know the average remaining number of variables that needed to be guessed to solve the solution at the point the SAT solver actually found the solution, but this has not been done yet as far as I know. Then, we would know the trade-off the SAT solver employs between resolution and searching when solving satisfiable instances.

All right, so much for satisfiable instances. What happens with UNSAT instances? Well, the solver must either go through the whole tree and realise there is no solution, or do resolution until it reaches an empty clause, essentially building a resolution tree with an empty clause at its root. Since both of these can be done at the same time, there is a similar trade-off as above, but this time it’s somewhat upside-down. First of all, the search tree can be smaller or larger depending on the ordering of the variables (as before), and secondly, the resolution tree can be smaller or larger depending on the order of resolutions. The minimal resolution tree is (I believe) NP-hard to find, which doesn’t help, but there is at least a minimum resolution tree that limits us, and there is a minimum search tree which we must go through completely, that limits us. So, in contrast to finding satisfying solutions, both of these are complete in some sense, which should make searching for them robust in terms of speed. Finding a satisfying solution is not complete, because, as I noted above, SAT solvers don’t find all satisfying solutions — if they did, they would actually have the same properties as solving an unsatisfiable problem, as they would have to prove that there are no more solutions remaining, essentially proving unsatisfiability.

The above reasoning is probably the reason why SAT solvers tend to behave more robustly when they encounter an unsatisfiable problem. In particular, CryptoMiniSat’s running time seems more robust when solving unsatisfiable problems. Also, for problems that contain a lot of randomly placed solutions in the search space, such as the (in)famous vmpc_33 and vmpc_34 problems, the solving times seem wholly unpredictable for at least two interesting SAT solvers: lingeling and CryptoMiniSat.

Extended resolution is working!

December 29, 2010SATglues, resolution, restartsmsoos

The subtitle of this post should really say: and I was wrong, again. First, let me explain what I have been wrong about — wrong assumptions are always the most instructive.

I have been convinced for the past year that it’s best to have two type of restart and corresponding learnt clause usefulness strategies: the glue-based restart&garbage collection of Glucose and a variation of the geometric restart and activity-based garbage collection of MiniSat. The glue-based variation was observed to be very good for industrial instances with many variables, many of which can be fixed easily, and the MiniSat-type variation was observed to be good for cryptography-type instances with typically low number of variables, almost none of which can be fixed. However, I had the chance to talk to Armin Biere a couple of weeks ago, and while talking with him, I have realised that maybe one of Lingeling’s advantages was its very interesting single restart strategy, based on agility (presentation, PDF) which seemed to let it use glue-based garbage collection all the time, succeeding to solve both types of instances. So, inspired by the CCC crowd, in only 10 minutes I have implemented a simple version of this restart strategy into CryptoMiniSat, and voilá, it seems to work on both type of instances, using only glues!

Now that this dumb assumption of mine is out of the way, let me talk about what is really interesting. As per my last post, I have managed to add a lightweight form of Extended Resolution (ER). Since then, I have improved the heuristics here-and-there, and I have launched the solver on a somewhat strange problem (aloul-chnl11-13.cnf) that only has 2×130 variables, but seems to be hard: very few SAT solvers were able to solve this instance within 10000 secs during the last SAT Competition. So, after CryptoMiniSat cutting the problem into two distinct parts, each containing 130 variables, it started adding variables to better express the learnt clauses… and the average learnt clause size dropped from ~30 literals to ~13 literals in comparison with the version that didn’t add variables. Furthermore, and this is the really interesting part, the solver was able to prove that some of these newly introduced variables must have a certain value, or that some of them were equi- or antivalent to others. So, after lightening the problem up to contain ~3000 variables, the solver was able to solve it in less than 5 minutes. This is, by the way, a problem that neither the newest MiniSat, nor Lingeling, nor Glucose can solve. Interestingly, the Extended Resolution version of Glucose can solve it, in almost the same time as CryptoMiniSat with ER added…

Wonderings of a SAT geek

A blog about SAT solving and cryptography