All posts by msoos

On using less memory for binary clauses in lingeling’s watchlists

Armin Biere gave a lecture at the Pragmatics of SAT workshop (proceeedings here) in Vienna about all the things inside lingeling which won a lot of awards[PDF] this year. If you weren’t there, you missed an amazing presentation. In this blog post I’ll reflect on a particular part of the presentation dealing with a memory trickery that has been intriguing me for a long while but I did not implement. Before I begin let me say: the presentation was awesome, and it’s not by chance that lingeling won so many awards.

The idea

The idea used by lingeling I want to talk about is easy to explain (though not easy to invent, as is usually the case). If you look at typical CNF problems, the majority of the clauses will be binary, i.e. only contain 2 literals. These clauses used to be stored exactly the same way as normal clauses: in the heap we allocate 2 literals and we put a pointer into the watchlist to these literals.

An improvement over this idea are the so-called implicit clauses. For the binary clause “x V y” we put into the watchlist of x the literal y, and in the watchlist of y the literal x. There is no other place we store these binary clauses, hence the word “implicit”. For other clauses, we still put pointers into the watchlist and allocate space, as usual. The problem with this approach is that the pointer to a clause is 32b (we use an 32b offset on 64b machines) but for each clause we also store a so-called blocking literal in the watchlist, which is also 32b. That makes the entries in the watchlist 64b long for normal clauses, and 32b long for binary clauses.

The idea is to have differing sizes of elements in a watchlist. If e.g. the first bit of the element is a 1, the next 63b relate to a long clause, and if it’s a 0, then the next 31b relate to a binary clause. In case 80% of the clauses are binary, this saves 50% of the space in 80% of the cases. Not bad at all.

The advantages

The advantages of using this idea are twofold. First, as already mentioned, memory use is lowered. This is non-trivial as memory usage of the watchlists can be enormous and although many other improvements can be switched off (such as e.g. stamping), storage of the clauses can never be switched off. Secondly, not having holes in memory leads to much better cache usage which in turn can bring real speedup. In case you think this is not important, you might enjoy knowing that the HHVM module of Facebook was made over 2x faster by making sure that important cache lines are not knocked out[PDF].

The disadvantages

In case you have an array that has varying size elements in it, some non-trivial complications arise. Let me list just a few that come to my mind.

Sorting the list is no longer trivial You cannot just swap elements: they might not fit. One way to do sorting efficiently is to move all the data into another, equally-spaced container and sort there, then move it back in. However, keep in mind that the reason why quicksort is so fast is that it can do in-place sort. Merge-sorting would be another option, but it copies elements and it’s not by chance it’s not the default sort in most cases. Also, you would have to re-write merge-sort of course, to deal with the varying-sized elements.

In case you think that sorting is not needed, maybe you forgot to consider the lightning-quick subsumption you can do between implicit binary&tertiary clauses using sort to give just one example.

Removing an element is no longer an O(1) operation In case you need to remove element X in a watchlist, you can simply swap the last element to the position of X and make the array one smaller. This trick is used quite extensively since the order of the watchlist is usually irrelevant.

Loops need to be re-written All your loops that go through the watchlist need to be re-written and have to have a switch() in them with some pointer arithmetic (do we advance by 32b or 64b?). In case you think you don’t need to go through the watchlist too often anyway, think again. Any time you need to do anything with clauses you will have to go through the watchlists, since they are the only place where binary (and tertiary, if your system is optimised enough) clauses are stored. This means your watchlist-gothrough function will be absolutely everywhere in the code unless you don’t want to implement any pre- or in-processing.

You might think you could write a function and just pass a pointer to another function to it that does the ‘real’ job, essentially hiding the complexity in a function that you only need to write once. There are three problems with this. First, this will be a tight loop and so performance is important, which you will loose as you will need to dereference the passed-in function pointer every time. This can be overcome with the use of templates but it won’t make the code pretty. Secondly, your original, hiding function will need to be written more than a single time. For example, some such executions will need to count time (operations) and some won’t. You will need to count pointer dereferences (normal clause is fetched) and binary clauses (no pointer dereferenced) in a significantly different way as a cache-miss is very expensive and a clause-access will cause such a cache-miss most of the time. For performance reasons you will need variations that don’t dereference long clauses, variations that allow for manipulation of the array, variations that don’t, etc.

Maintaining datastructure consistency becomes harder Unless you use hiding functions, which is non-trivial as explained above (and maybe impossible in e.g. plain C), the complexity to maintain consistency of the datastructure will be all over the code. Even if done very carefully, the constraints on the datastructure may end up being implicitly, rather than explicitly, represented in the code. It will make it easier to create bugs and harder to find them.

Conclusions

This idea of using less memory for binary clauses in the watchlists is very interesting and has intrigued me for a long while — Armin was kind enough to tell me about this a long time ago. It has the potential to save a lot of memory and to keep things more packed in the datastructure that is arguably the most accessed during solving and inprocessing. However, I was always daunted by the obstacles I saw in front of me — though I might simply need to understand C++11 and templates better to make it work.

Currently, I feel like there are plenty of other optimisations that I could implement from the talk of Armin, e.g. that all watchlists are stored in the same array, using offsets and a hand-rolled memory manager. That seems to have a potential of also improving the memory usage and speed while being easier to implement and easier to hide in a class.

CryptoMiniSat 4.2 released

CryptoMiniSat 4.2 has been released. This release brings multi-threading, some bug fixes, and a lot of code cleanup.

Multi-threading

Multi-threading has been implemented using the std::thread class of C++11. This makes it very portable and at the same time easy to use. Multi-threading is very simple and only shares unitary and binary learnt clauses. This is in comparison to other approaches that have some form of complex clause-sharing algorithms, sometimes even sharing clause databases. However, this system works when it’s used as a library, even with assumptions. Simply call set_num_threads(N) before calling any other functions.

The method used to speed up the system is the portfolio method, i.e. there are many threads started with different parameters and they share some information among them. The threads are configured as:

switch(thread_num) {
case 1: {
    conf.restartType = restart_type_geom;
    conf.polarity_mode = CMSat::polarmode_neg;
    conf.varElimRatioPerIter = 1;
    break;
}
case 2: {
    conf.shortTermHistorySize = 80;
    conf.clauseCleaningType = CMSat::clean_glue_based;
    conf.restartType = CMSat::restart_type_glue;
    conf.increaseClean = 1.08;
    conf.ratioRemoveClauses = 0.55;
    break;
}
case 3: {
    conf.doVarElim = 0;
    conf.doGateFind = 0;
    conf.more_red_minim_limit_cache = 400;
    conf.more_red_minim_limit_binary = 200;
    conf.probe_bogoprops_timeoutM = 3500;
    conf.restartType = CMSat::restart_type_agility;
    conf.ratioRemoveClauses = 0.6;
    break;
}
case 4: {
    conf.simplify_at_startup = 1;
    conf.regularly_simplify_problem = 0;
    conf.varElimRatioPerIter = 1;
    conf.restartType = restart_type_geom;
    conf.clauseCleaningType = CMSat::clean_sum_activity_based;
    conf.polarity_mode = CMSat::polarmode_neg;
    conf.ratioRemoveClauses = 0.65;
    break;
}
case 5: {
    conf.doGateFind = 0;
    conf.more_red_minim_limit_cache = 100;
    conf.more_red_minim_limit_binary = 100;
    conf.probe_bogoprops_timeoutM = 4000;
    conf.ratioRemoveClauses = 0.6;
    break;
}
case 6: {
    conf.numCleanBetweenSimplify = 1;
    conf.skip_some_bve_resolvents = 1;
    conf.ratioRemoveClauses = 0.7;
    break;
}
case 7: {
    conf.clauseCleaningType = CMSat::clean_sum_confl_depth_based;
    conf.ratioRemoveClauses = 0.55;
    break;
}
case 8: {
    conf.polarity_mode = CMSat::polarmode_pos;
    conf.ratioRemoveClauses = 0.6;
    break;
}
case 9: {
    conf.do_bva = 0;
    conf.doGateFind = 0;
    conf.more_red_minim_limit_cache = 800;
    conf.more_red_minim_limit_binary = 400;
    conf.polarity_mode = CMSat::polarmode_neg;
    conf.ratioRemoveClauses = 0.6;
    break;
}
case 10: {
    conf.do_bva = 0;
    conf.doGateFind = 0;
    conf.restartType = CMSat::restart_type_agility;
    conf.clauseCleaningType = CMSat::clean_glue_based;
    conf.ratioRemoveClauses = 0.6;
    break;
}
case 11: {
    conf.simplify_at_startup = 1;
    conf.propBinFirst = 1;
    conf.doLHBR = 1;
    conf.increaseClean = 1.12;
    conf.ratioRemoveClauses = 0.7;
    break;
}
default: {
    conf.clauseCleaningType = CMSat::clean_glue_based;
    conf.ratioRemoveClauses = 0.7;
    break;
}
}

These configurations have been chosen because they seemed to have quite orthogonal parameters. Only 12 threads are properly configured, the rest are not really configured and are only cleaning a lot more clauses than normal (so as not to run out of memory). In a certain sense, the above is the “secret sauce” that makes the parallel system work.

Code cleanup

The code has been greatly refactored. This is an ongoing effort, but its fruits are already quite visible. In general, variable and function names are more meaningful, function sizes have been drastically cut and the expressiveness of the code has been improved.

Unfortunately, C++ (and C) are quite limiting in a number of ways, and so CryptoMiniSat might move to other languages such as Go. Go for example provides reflection and significantly improved compile times. These two are very useful for development: the former greatly simplifies testing while the latter allows for quicker build (and thus debug) cycles.

[paypal-donation]

SMT Competition’14 and STP

The 2014 SMT competition‘s QF_BV divison is over (benchmark tarball here), and the results are (copied from here):

Solver Errors Solved Not Solved CPU time (solved instances)
Boolector 0 2361 127 138077.59
STP-CryptoMiniSat4 0 2283 205 190660.82
[MathSAT] 0 2199 289 262349.39
[Z3] 0 2180 308 214087.66
CVC4 0 2166 322 87954.62
4Simp 0 2121 367 187966.86
SONOLAR 0 2026 462 174134.49
Yices2 0 1770 718 159991.55
abziz_min_features 9 2155 324 134385.22
abziz_all_features 9 2093 386 122540.04

First, let me congratulate Boolector by Armin Biere. STP came a not-so-close second. STP was meant to be submitted twice, once with MiniSat and once with CyrptoMiniSat4 — this is the only reason why the STP submission is called STP-CryptoMiniSat4. Unfortunately, the other submission could not be made because of some compilation problems.

About the results

There are a couple of things I would like to draw your attention to. First is the cumulated solving time of solved instances. Note how it decreases compared to MathSat and Z3. Note also the differences in the DIFF of the number of solved instances. It’s relatively small between 4Simp…MathSat(<40) and increases dramatically with STP and Boolector (with around 80 each). Another thing of interest is that STP is surrounded by commercial products: Boolector, Z3 and MathSAT are all products you have to pay for to use in a commercial setting. In contrast, the biggest issue with STP is that if the optional(!) CryptoMiniSat4 is linked in, it's LGPLv2 instead of MIT licensed. In other words, a lot of effort is in there, and it's all free. Sometimes free as in beer, sometimes free as in freedom.

About STP

Lately, a lot of work has been done on STP. If you look a the github repository, it has about 80 issues resolved, about 40 open, lots of pull requests, and a lot of commits. A group of people, namely (in nickname order): Dan Liew, Vijay Ganesh, Khoo Yit Phang, Ryan Govostes, Trevor Hansen, a lot of external contributors (e.g. Stephen McCamant) and myself have been working on it pretty intensively. It now has a an automated test facility and a robust build system besides all the code cleanup and other improvements.

It’s a great team and I’m happy to be a minor part of it. If you feel like you could contribute, don’t hesitate to fork the repo, make some changes, and ask for a pull request. The discussions on the pull requests can be pretty elaborate which makes for a nice learning experience for all involved. Come and join — next year hopefully we’ll win :)

Multi-threading and startup speed

When doing multi-threading, parsing the input and inserting it into the solver threads is actually one of the most tricky part. Of course one can simply do

void add_clause_to_all_threads(Clause clause) {
  for(int i = 0; i < threads; i++) {
    solver[i].add_clause(clause);
  }
}

It works and my SAT solver entry to the competition used this method. However, if we have 12 threads, the startup time will be 12x more than usual since only one thread is adding the clauses to all the solvers. If startup used to be 3s, it's suddenly more than half a minute.

A simple solution

A simple solution to this problem is to create threads and add the clauses to the threads individually. The problem with this is that creating 12 threads for each clause and destroying the threads for each is actually more time than to do what's above.

The correct solution

The real solution is then to store the data to be added to the threads, and once a good number of them have been stored, add all the cached clauses to the solvers: create the 12 threads, for each thread add the clauses, and destroy the threads. This solution amortizes the creation&destruction of threads at the expense of storing clauses in a cache.

CryptoMiniSat now uses a clause cache for exactly this purpose. Naturally, this also necessitates a cache for variable creation, though that is only a counter, so not a big deal.

Final polish of my SAT competition entry

[wpdm_file id=2] UPDATE: Fixed crash, UPDATE2: fixed looking for Gaussian elimination library

In the last days of the competition deadline, I have improved the following on the SAT solver:

  • Adding N new variables is now possible in one go. This reduces variable addition overhead, especially when having many threads.
  • Memory allocation overhead per new clause addition is now much smaller thanks to judicious use of globally allocated temporaries. Using such temporaries is very dangerous as CryptoMiniSat uses addClause() from multiple places in the code in a recursive manner. However, I only use the temporaries for add_clause_outer(), so things should be fine.
  • If more than 0.5M variables or 1.5M binary clauses are in the problem, on-the-fly hyper-binary resolution and transitive reduction is turned off during 1st-decision-level search. This is kind of like probing, but during search. However, unlike probing, it cannot time-out and switch off these systems in case they takes too much time.
  • Diversified the threads’ parameters. Also, I reduced the number of threads to 8. I wanted to run with 12, but it’s a bit dangerous from a memory-usage perspective: there is only 24GB available for 12 cores, which means 2GB/core. Some problems take more than 2GB just to parse into the watchlists.
  • Check total memory usage at startup of threads and if too much, halve the total number of threads. This is an emergency measure in case things go wild due to very weird CNF.

Overall these are small changes but allow for a much faster startup. For example a notorious problem, AProVE07-11.cnf, now starts in 2.5s instead of 4.5s. As far as I can tell, this is very similar to the startup time of lingeling on this instance. However, lingeling only uses 1/3rd the memory (~250MB) thanks to a more tight memory manager plus I suspect it doesn’t have a couple of datastructs that I keep around.

The other difference, about OTF hyper-binary resolution, allows for large problems to actually get to the point of solving instead of getting stuck at adding and removing useless binary clauses. Let’s hope all goes well for the competition :)