All posts by msoos

Changing the Watchlist Scheme

Okay, what if we changed the watchlist scheme… a bit? The way the watchlist scheme operates in SAT solvers is the following. We have a set of literals lit1, lit2, lit3, lit4 that defines the clause we want to be watched for propagation and conflict. We take two literals from this set of literals, and put them in a datastructure. It’s not important which ones we take, but we have to know which are the ones we took. So, we take the 1st and the 2nd (because we are lazy), and we put them in a set of lists called the watchlists. We now have a pointer to the clause in watchlist[~lit1] and in watchlist[~lit2], where ~ is the binary negation. If we now set lit1 = false we have to have a look at the clauses in watchlist[~lit1], see if any of the clauses forces a propagation. If not, then we remove the corresponding lit1 from the watchlist, find another literal that is unset, e.g. lit3, and put that into the watchlist[~lit3]. It is important that lit3 must not be lit1 or lit2. Obviously, we know what lit1 is (we are visiting its watchlist, after all), but how could we know what lit2 is? Well, we always update the clause such that lit1 and lit2 are the first two literals. We do this for every clause in every watchlist, and until a conflict is found, or all clauses are satisfied.

So, why is this inefficient? Well, what needs to be noticed, is that we updated the clause non-stop in order to keep in mind lit1 and lit2. We can implement this by changing the order of literals (that’s how it’s done in most cases), or by updating some data in the clause. However, this needs an update of the clause, and consequently of the whole cacheline on which the clause (and probably other clauses, as these are memory-packed) reside. Let’s say the average clause length is 10, but maximum clause length is 2^16. Since the maximal literal length is 32 bits, an average clause, with all the extra info it needs to keep, is around 10*32+32+32= 384 bits long. Cache line size for a Core i7 processors is 64 Bytes (=512bits), though the new architectures I believe are bumping this to 128 Bytes. We are therefore asking the CPU to grab 64Bytes from memory, store it in the CPU cache, and then put the changed value back into main memory — as the cache is woefully small to keep all the clauses we will visit. If you take a CPU with 6 cores and try to run this algorithm, you are probably looking at a memory bandwidth nightmare. The clauses cannot be shared, because updating the lit1 and lit2 positions is not an option, as these would be different for each thread.Furthermore, the CPU is not only taking data from main memory, but also putting back data into main memory at an astonishing rate.

Let’s take a Core i7 processor off the shelf, and see how its architecture looks like. We have 1MByte of L2 cache for each core, and we have 8MByte of shared L3 cache for all the CPU cores. This is like a finger pointing at us: have a big piece of shared memory that is constant and can be shared, while keep a much smaller piece of thread-local memory. Also: please lighten up the load on the memory access sub-architecture, as that isn’t going to keep up with all the 6 cores churning through the clauses. Well, a interesting solution is then to split each clause into two pieces: one to store which is lit1 and lit2, and one that stores the literals and the extra information. The first piece, let’s call this threadContext stores the point of lit1 and lit2, each of which is 16 bits (the maximal size of the clause), so 4 bytes in total. The clause will then not need any update at all. An average problem instance is typically beaten down to a size of <300’000 clauses by CryptoMiniSat in <500 seconds. Optimally, we would like to access these clauses from main memory or L3 cache (this latter is a dream, really), and keep their threadContext in the L2 cache of the core. So, this latter is 300’000*4=1.2MByte for a typical problem. There are also some other things to keep in L2, such as current variable assignment, and the current trail of assignments, but each of these should be much smaller. The total would probably be ~2MByte. Oh well, we didn’t fit them inside the 1MByte of L2 cache, but still. The net effect of this change should be positive. Furthermore, even if we only use one core, more memory bandwidth should be available, as we don’t move that much (or hardly any) information back into the main memory any more.

So why wouldn’t this idea work? Well, for starters, we would access two pieces of memory whenever visiting a clause: instead of only accessing the clause, we now also have to access threadContext to be able to work with the clause. I am willing to argue that this is not a problem at all. The reason is really simple: since threadContext is small, and since it is being accessed all the time, it will surely be in cache, probably L2, maybe L3. However, the clause itself is almost surely to be in main memory, but if not, then in L3. All we need to do, is to emit a non-blocking memory-prefetch assembly instruction for both entities when we want to treat the clause, and then work with the clause. Since threadContext is almost sure to arrive earlier to the CPU than the clause itself, and since these can be done in parallel, the speed difference will be either negligible or nonexistent. The other argument we could come up with is related to the following design feature: swapping the literals lit1 and lit2 to be always at the head of the clause allows the CPU to travel through less of the clause and carry out less instructions (less checking for lit1&lit2). This could mean less cachelines needed, and less instructions carried out. Since memory is the bottleneck, number of instructions is not of any real concern. More cachelines are more of a problem, but if we access the right literal at the right point in the clause, then parts of the clause might not be needed, and we would only need to grab more cachelines than with the original scheme if we happen to roll around the clause: starting at the middle, reaching the end and not finding a new unassigned literal, having to go to the beginning. But this shouldn’t occur too often, and the clause might be on the same cacheline anyway (with <50% chance for a 10-long clause and a 128Byte cachline).

The last, and most burning question is: will this ever be implemented? Well, if we are really to scale to 6- and 12-core CPUs, I think eventually a better way to share cache will be needed. This is just one way of doing it — probably a crazy complicated and inefficient way at that. Only a good implementation can tell if this approach is worthwhile to pursue, and the 2011 SAT Competition is far too close for me to get this right. Oh well, let’s get back to removing that “last” bug in CryptoMiniSat 2.9.0….

Edited to Add(21/01/2011): I forgot that watchlists also need to be in cache. However, most of the watchlists needed can be prefetched: when a literal is propagated, we can prefetch the corresponding watchlist immediately. The only problem is when a watchlist is needed because a clause’s watched literal is changed. This poses a headache that I haven’t yet thought through.

The variable speed of SAT solving

Timings from the article "Attacking Bivium Using SAT Solvers". The authors didn't seem to have randomised the problems enough: the time to solve should increase exponentially, but instead it goes up and down like a roller coaster

Vegard Nossum asked me about the varying time it took for CryptoMiniSat to solve a certain instance that was satisfiable. This inspired me to write an overly long reply, which I think might interest others. So, why does the solving time vary for a specific instance if we permutate the clauses? Intuitively, just by permutating the clauses, the problem itself doesn’t get any easier or harder, so why should that make any difference at all? I will now try to go through the reasons, though they can almost all be summed up as follows: the SAT solver has absolutely no clue what it is solving, and so to solve a problem, it relies on heuristics and randomisation, both of which are influenced by its starting seed, which in turn is influenced by the order of the clauses in the CNF file. And now on to the long explanation.

Let’s divide problems into two categories: satisfiable and unsatisfiable instances. First, let me talk about satisfiable instances. Let’s suppose that we have been solving the problem for some time, and we have connected the dots (variables) well enough through learnt clauses. All that the SAT solver is waiting for is a good guess to set 3-4 variables right, and then it will propagate all variables to the right setting to satisfy all clauses. The problem is therefore to get to the point of only needing to guess 3-4 variables, i.e. to connect the dots, which we do through resolution. We can sacrifice some or all of dot-connecting (i.e. resolution) with some more guessing, and this can in fact be coded down by simply foregoing some of the conflict analysis we do. In the extreme case, all conflict analysis can be disabled, and then the solver would not restart its search. The solver would in essence be brute-forcing the instance through BCP (boolean constraint propagation). It is easy to see why this latter is prone to variation: depending on the ordering of the variables in the search tree, the tree could be orders of magnitude smaller or larger, and the first solution can be at any point in the search tree, essentially at a random place.

If we decide to carry out resolution for satisfiable problems to counter the problem of the variation of the search-tree, it is interesting to realise the following: in most cases, we can not forgo guessing. The reason is simple yet leads to quite interesting properties. Essentially, a SAT instance can have multiple solutions. If there are two solutions, e.g. 01100... and 10011... i.e. the solutions are the complete inverse of one another, then the SAT solver will not be able to prove any variable to any value. The best it could do is to create binary clauses, in the example case for instance

var1=0 -> var2=1, var3=1, var4=0...
and
var1=1 -> var2=0, var3=0, var4=1...

If we do enough resolutions, the “guessing” part will eventually become a solution-selector, i.e. it will select a solution from the set of available solutions. If there are 2^20, evenly distributed in the search space, we might need to set 20 variables before a solution is found through a simple application of BCP. Naturally, SAT solvers don’t normally do this, as they are content in finding one satisfying assignment, reporting that to the user, and exiting. It would be interesting to know the average remaining number of variables that needed to be guessed to solve the solution at the point the SAT solver actually found the solution, but this has not been done yet as far as I know. Then, we would know the trade-off the SAT solver employs between resolution and searching when solving satisfiable instances.

All right, so much for satisfiable instances. What happens with UNSAT instances? Well, the solver must either go through the whole tree and realise there is no solution, or do resolution until it reaches an empty clause, essentially building a resolution tree with an empty clause at its root. Since both of these can be done at the same time, there is a similar trade-off as above, but this time it’s somewhat upside-down. First of all, the search tree can be smaller or larger depending on the ordering of the variables (as before), and secondly, the resolution tree can be smaller or larger depending on the order of resolutions. The minimal resolution tree is (I believe) NP-hard to find, which doesn’t help, but there is at least a minimum resolution tree that limits us, and there is a minimum search tree which we must go through completely, that limits us. So, in contrast to finding satisfying solutions, both of these are complete in some sense, which should make searching for them robust in terms of speed. Finding a satisfying solution is not complete, because, as I noted above, SAT solvers don’t find all satisfying solutions — if they did, they would actually have the same properties as solving an unsatisfiable problem, as they would have to prove that there are no more solutions remaining, essentially proving unsatisfiability.

The above reasoning is probably the reason why SAT solvers tend to behave more robustly when they encounter an unsatisfiable problem. In particular, CryptoMiniSat’s running time seems more robust when solving unsatisfiable problems. Also, for problems that contain a lot of randomly placed solutions in the search space, such as the (in)famous vmpc_33 and vmpc_34 problems, the solving times seem wholly unpredictable for at least two interesting SAT solvers: lingeling and CryptoMiniSat.

Extended resolution is working!

The subtitle of this post should really say: and I was wrong, again. First, let me explain what I have been wrong about — wrong assumptions are always the most instructive.

I have been convinced for the past year that it’s best to have two type of restart and corresponding learnt clause usefulness strategies: the glue-based restart&garbage collection of Glucose and a variation of the geometric restart and activity-based garbage collection of MiniSat. The glue-based variation was observed to be very good for industrial instances with many variables, many of which can be fixed easily, and the MiniSat-type variation was observed to be good for cryptography-type instances with typically low number of variables, almost none of which can be fixed. However, I had the chance to talk to Armin Biere a couple of weeks ago, and while talking with him, I have realised that maybe one of Lingeling’s advantages was its very interesting single restart strategy, based on agility (presentation, PDF) which seemed to let it use glue-based garbage collection all the time, succeeding to solve both types of instances. So, inspired by the CCC crowd, in only 10 minutes I have implemented a simple version of this restart strategy into CryptoMiniSat, and voilá, it seems to work on both type of instances, using only glues!

Now that this dumb assumption of mine is out of the way, let me talk about what is really interesting. As per my last post, I have managed to add a lightweight form of Extended Resolution (ER). Since then, I have improved the heuristics here-and-there, and I have launched the solver on a somewhat strange problem (aloul-chnl11-13.cnf) that only has 2×130 variables, but seems to be hard: very few SAT solvers were able to solve this instance within 10000 secs during the last SAT Competition. So, after CryptoMiniSat cutting the problem into two distinct parts, each containing 130 variables, it started adding variables to better express the learnt clauses… and the average learnt clause size dropped from ~30 literals to ~13 literals in comparison with the version that didn’t add variables. Furthermore, and this is the really interesting part, the solver was able to prove that some of these newly introduced variables must have a certain value, or that some of them were equi- or antivalent to others. So, after lightening the problem up to contain ~3000 variables, the solver was able to solve it in less than 5 minutes. This is, by the way, a problem that neither the newest MiniSat, nor Lingeling, nor Glucose can solve. Interestingly, the Extended Resolution version of Glucose can solve it, in almost the same time as CryptoMiniSat with ER added…

CCC madness

The Chaos Computer Club (CCC) is having its yearly conference starting today. It’s a madhouse here, which is great for ideas, so I have been having this rush of ideas implemented. First off, it seems that complex problems with few variables are way too difficult to solve even with distributed solving. I have been trying to solve some difficult problems, but no matter how much computer power I throw at them (and I have been throwing >2 years’ worth, with >300 CPU cores running), not much progress is being made. I have lately been attributing this “failure” to one prime problem: lack of expressiveness.

Basically, I feel like the SAT solver is trying to express some complex functions with learnt clauses. For instance, an adder. However, the only thing it can do, is describe this function using a Karnaugh Map. Anybody who has tried to express an adder without introducing new variables is well aware that that’s a difficult task. So, we need new variables. This is the point where some collaboration I have lately been doing with some researchers (Laurent Simon and George Katsirelos) comes into play. Following their footsteps, I realised we could introduce new definitions, using a new targeting method we developed. So, I did just that, and here is a nice result for a very difficult problem:

num: 7750 lit1: 741 lit2: 4641
num: 2339 lit1: 1396 lit2: 1670
num: 2172 lit1: -719 lit2: 741
num: 2169 lit1: -719 lit2: 4641
num: 2061 lit1: 1669 lit2: 1670
num: 2010 lit1: 1670 lit2: 1734
num: 1973 lit1: 1670 lit2: 1882
...
time: 2.53 seconds

Where literals lit1 and lit2 are both present in num number of clauses. So, for example literals 741 and 4641 are both present in 7750 clauses. Which means that if we introduce a definition with a new variable 10000:

10000 = 741 OR 4641

(which requires 2 binary clauses and one 3-long clause) we can remove 7750 literals from the problem, while increasing the propagation potential. For example, if there was a clause

741 OR 4641 OR 2 OR 3 = TRUE

we can make it into

10000 OR 2 OR 3 = TRUE

What I have just described is a (dumb) version of extended resolution, the holy grail that many have tried and only few have succeeded at. It is a holy grail because it is provably more powerful than normal resolution, but heuristics of when and how to introduce the new variables have been difficult to come up with. What I have just described at least would reduce the problem size: removing e.g. ~7700 literals while only introducing 3 short clauses seems to worth the effort. It might not simulate extended resolution (ER), but it should speed up the solving, while giving a (lowly) shot at ER.

PS: If all of this sounds mad, that may be attributed to the ~500 people around me loudly developing and discussing issues ranging from peer-to-peer routing to FPGA programming

Distributed SAT solving is fun

I have lately been trying to keep my promise I made on the webpage of CryptoMiniSat: “The long-term goals of CryptoMiniSat are to be an efficient sequential, parallel and distributed solver”. The first step was to make a relatively efficient sequential solver. With the winning of the SAT Race’10, that was at least partially accomplished. The next step was to make the program parallel. Now, I have made it distributed. The parallel version was made using OpenMP, and you can try it using the “master” branch from the git repository. The distributed version uses MPI, which means it can probably work on most cluster/supercomputer architectures. This distributed version is currently alpha, and is not yet in the git repo.

My first impression of the distributed version is: it’s awsome to solve using 10-20 machines in parallel. The speed is just unbelievable. It’s also great to have the multi-threaded and distributed option in one program, as I can now launch just one MPI process on each machine, and the 8 cores on each machine can use threads to work together, instead of using the somewhat slower MPI. I haven’t measured the speedup, and I haven’t measured the MPI load, so I cannot say anything yet about efficiency. All I can say, is that using 160 threads is a little faster than using only one.

As for technical details, I am only sharing unitary and binary clauses. It’s a trivial and lazy choice, but it makes it easier to check for redundant clauses (thanks to Armin Biere for highlighting to me that this is important), and works well with CryptoMiniSat’s expand-contract loop, where a set of simplification operations are carried out at regular intervals to squeeze the most unitary and binary clauses out of the current knowledge stored inside the clause database.

I am looking forward to breaking a somewhat problematic crypto-problem with the new distributed CryptoMiniSat. Hopefully this will work, though maybe it’s some way off — I think the original time needed to break the problem might be in the hundreds of days with single-threaded non-distributed CryptoMiniSat. Fingers are crossed that the speedup is non-linear with the distributed version, and I can allocate enough computers for enough time to the job.

Edited to add — The solver has now been running stable on 100 computers, distributed. This is getting to be real fun. Next step is to deploy it on a 342-node cluster, each node double-core. Having 684 threads collaborate will feel mighty ;) Also, it will inspire me to build more heterogeneity into the architecture. It was never designed to run on this many nodes…