Keywords

figure a
figure b

1 Introduction

State-of-the-art automated reasoning solvers are critical software systems used throughout formal methods. However, even skilled and trusted developers of such tools can inadvertently introduce errors. Two approaches have evolved to provide assurances that automated reasoning tools behave as intended. The first involves the use of theorem provers to formally verify the correctness of solver implementations [20, 30]. This approach guarantees correct outputs for all inputs, but struggles to scale to complex systems such as SAT solvers. The second approach is based on certifying algorithms [38], where a solver is required to produce a certificate alongside its output [6, 10, 24, 35, 37, 53, 58]. A certificate checker (also called proof checker)—which is often formally verified—then checks the correctness of this certificate, ensuring that the system’s output adheres to the desired specifications. This latter method has gained significant traction in the SAT solving community, wherein a SAT solver either returns a satisfying assignment that is easy to check through evaluation or a proof of unsatisfiability as a certificate [58]. However, neither of these approaches have been applied to probabilistic systems that rely on randomized algorithms. In fact, McConnell et al. [38] argue that randomized algorithms resist deterministic certification.

In this paper, we propose a hybrid approach that harnesses the power of both theorem-proving and certificate-based approaches to certify probabilistic systems. We present our approach on \(\textsf{ApproxMC}\), a probabilistic automated reasoning system which computes approximate model counts for Boolean formulas. Model counting is a fundamental problem in computer science that serves as a key component in a wide range of applications including control improvisation [22], network reliability [14, 56], neural network verification [5], probabilistic reasoning [11, 18, 45, 46], and so on. Therefore, it is crucial that the results computed by an approximate model counter, such as \(\textsf{ApproxMC}\), can be trusted.

Two key questions must be tackled by our approach. First, what does it mean to trust a random run of \(\textsf{ApproxMC}\)? Here, we propose a verification modulo randomness approach, i.e., our certification results are modulo a trusted random bit generator. Second, how do we handle the huge volume of (incremental) CNF-XOR satisfiability solver calls which are tightly integrated in \(\textsf{ApproxMC}\) [49, 50]? Here, we design the certificate format to require only the results of solver calls that are crucial for \(\textsf{ApproxMC}\)’s correctness. In particular, \(\textsf{ApproxMC}\) makes \(\mathcal {O}(\varepsilon ^{-2} \cdot \log n \cdot \log \delta ^{-1} )\) many calls to its solver, where n is the number of (projected) variables of the formula, \(\varepsilon \) is the tolerance parameter, and \(\delta \) is the confidence parameter (see Sect. 3 for definitions); our crucial insight is that to certify \(\textsf{ApproxMC}\), we only need to check the correctness of \(\mathcal {O}(\log \delta ^{-1} )\) UNSAT calls, which is independent of n. We then observe that existing CNF-XOR UNSAT checkers fail to scale to formulas that are handled by \(\textsf{ApproxMC}\). To this end, we adapt existing solving and verified proof checking pipelines to natively support proof certificates for CNF-XOR unsatisfiability. With this design, our framework is able to independently check certificates generated by a state-of-the-art (but untrusted) implementation of \(\textsf{ApproxMC}\), with all of the latter’s optimizations enabled. Overall, the key idea is to combine a static, once-off, formal proof of the algorithm’s correctness guarantee in Isabelle/HOL [42, 43] with dynamic, per-run, certification of \(\textsf{ApproxMC}\)’s calls to an external CNF-XOR solver.

In summary, our contributions are as follows:

  1. 1.

    An abstract specification of \(\textsf{ApproxMC}\) and a formal proof of its probably approximately correct (PAC) guarantee in Isabelle/HOL (Sect. 4.1).

  2. 2.

    A refinement of the abstract specification to a concrete certificate format and checker implementation for \(\textsf{ApproxMC}\) (Sects. 4.2 and 4.3).

  3. 3.

    Updates to various tools to realize a formally verified proof checking pipeline with native support for CNF-XOR unsatisfiability (Sect. 4.4).

  4. 4.

    Empirical evaluation of the framework on an extensive suite of model counting benchmarks to demonstrate its practical utility (Sect. 5).

Fig. 1.
figure 1

The certified approximate model counting workflow.

Our workflow for certified approximate model counting is shown in Fig. 1. In step , it uses a trusted external tool to generate uniform random bits which are handed to an untrusted certificate generator \(\textsf{ApproxMCCert}\) and to the verified certificate checker \(\textsf{CertCheck}\) (extracted from Isabelle/HOL); the random bits are used identically by \(\textsf{ApproxMCCert}\) and \(\textsf{CertCheck}\) to generate random XOR constraints as part of the counting algorithm. For step , \(\textsf{ApproxMCCert}\) generates a partial certificate which is subsequently checked in step by \(\textsf{CertCheck}\); the certificate is partial because it does not contain CNF-XOR unsatisfiability proofs. Instead, \(\textsf{CertCheck}\) calls an external CNF-XOR unsatisfiability checking pipeline (with verified proof checking in CakeML [36, 53]). In the final step , an approximate model count is returned upon successful certification.

As part of our commitment to reproducibility, all code and proofs have been made available with a permissive open-source license [2, 21, 54].

Impact. Although our main objective was to enhance end-user trust in answers to their counting queries, undertaking this project led to unexpected benefits that are worth highlighting. While modifying \(\textsf{ApproxMC}\)’s underlying solver, CryptoMiniSat [52], to emit certificates (Sect. 4.4), a bug in CryptoMiniSat’s XOR manipulation system was discovered. The bug was introduced during the development of part of the BIRD system [50] that keeps all XOR constraints’ clausal versions (as well as their compact XOR versions) in-memory at all times. This allows a substantial level of interaction between XOR and clausal constraints. However, it also led to large overhead in terms of the often hundreds of thousands of clauses needed to encode the XORs in their clausal form. The compromise made by the developers was to detach the clausal representation of XORs from the watchlists. However, that seemed to have led to a level of complexity that both allowed the bug to occur, and more importantly, made it impossible to discover via CryptoMiniSat’s standard fuzzing pipeline. Our version of CryptoMiniSat fixes this by not keeping around a clausal encoding of all XORs, instead introducing (and deleting) them whenever needed for the proof.

Furthermore, we have also found minor flaws in the theoretical analysis of \(\textsf{ApproxMC}\) (see discussion of ) and in the implementation, e.g., the sampling of random bits was slightly biased, and an infinite loop could be triggered on certain random seeds. None of these bugs were known to the authors of \(\textsf{ApproxMC}\) or were previously reported by users of the tool. All of these issues have been fixed and upstreamed to their tools’ respective codebases.

2 Related Work

This discussion is focused on formally verified algorithms and proof checkers. Readers are referred to Chakraborty et al. [13] and references therein for related literature on approximate model counting.

Certified Model Counting. Prior research on certificate-based approaches focuses on deterministic methods in model counting. Prior work on certified exact model counting focuses either on the development of proofs, such as MICE [19] and CPOG [10], along with their respective toolchains, or on analyzing the complexity of the proof system [7]. Some efforts have been directed toward certifying deterministic approximate counting algorithms which, however, require access to a \(\mathrm {\Sigma ^{P}_2}\) oracle and did not yield practical implementations [40]. Our work develops the first certification framework for randomized approximate model counting.

Formalization of Randomized Algorithms. Various randomized algorithms have been formally analyzed in Isabelle/HOL, including randomized quicksort, random binary tree data structures [15], and approximation of frequency moments in data streams [31,32,33,34]. These prior efforts as well as ours, all build upon the foundations for measure and probability theory in Isabelle/HOL [16, 28]. Properties of approximate membership query structures (including Bloom filters) have been verified in Coq [26]. Pioneering work on formal verification of randomized algorithms, including the Miller-Rabin primality test, was carried out by Joe Hurd in HOL4 [29]. A common objective of these prior efforts, and that of ours, is to put the guarantees of randomized algorithms on formal foundations.

Verified Proof Checking. Formally verified proof checkers have been developed for several (deterministic) algorithms and theories, such as the CNF unsatisfiability checkers used by the SAT community [27, 37, 53]. Within Isabelle/HOL, the Pastèque tool [35] checks proofs in the practical algebraic calculus, which can be used to validate algebraic reasoning; the CeTA tool [55] is based on an extensive library of results for certifying properties of rewriting systems; and the LEDA project developed specialized proof checkers for graph algorithms [1]. CoqQFBV [47] is similar in design to our approach in that a higher-level Coq-generated tool for verified bit-blasting is used in concert with a lower-level verified proof checker for CNF formulas.

CNF-XOR Unsatisfiability Checking. Given \(\textsf{ApproxMC}\)’s reliance on CNF-XOR formulas, certification of CNF-XOR unsatisfiability emerged as a key challenge in our work. To this end, we provide a brief overview of three prior state-of-the-art approaches for certified CNF-XOR reasoning.

  1. 1.

    The first approach uses proof generation and certification of XOR reasoning based on Binary Decision Diagrams (BDDs) [48]. It uses CryptoMiniSat [52], a SAT solver specifically made to work on CNF-XOR instances and TBUDDY [9] to produce FRAT proof certificates [3] for CryptoMiniSat’s XOR reasoning; FRAT-rs [3] is used as the elaboration backend and a verified LRAT proof checker [27, 53] can be used to check the elaborated proofs.

  2. 2.

    The second approach, due to Gocht and Nordström [24], relies on pseudo-Boolean reasoning and its associated proof system to justify both CNF and parity reasoning. This approach was demonstrated on MiniSat equipped with an XOR reasoning engine, with \(\textsf{VeriPB}\) as a proof checker; pseudo-Boolean proofs are also supported by a verified proof checker [23].

  3. 3.

    The third approach is to rely on the standard SAT solvers accompanied with standard CNF proof formats and (verified) checkers [27, 37, 53].

3 Background

This section gives a brief introduction to \(\textsf{ApproxMC}\) (Sect. 3.1) and to theorem-proving in Isabelle/HOL (Sect. 3.2).

3.1 Approximate Model Counting

Given a Boolean formula F, the model counting problem is to calculate the number of models (also called solutions or satisfying assignments) of F. Model counting is known to be #P-complete, and therefore has been a target of sustained interest for randomized approximation techniques over the past four decades. The current state-of-the-art approximate approach, \(\textsf{ApproxMC}\) [12], is a hashing-based framework that relies on reducing the model counting problem to SAT queries, which are handled by an underlying solver. Importantly, \(\textsf{ApproxMC}\) is a probably approximately correct (PAC) projected model counter, i.e., it takes in a formula F, a projection set \(S\subseteq \textsf{Vars }(F)\), a tolerance parameter \(\varepsilon > 0\), and a confidence parameter \(\delta \in (0,1]\), and returns a count c satisfying the PAC guarantee: \(\Pr \left[ \frac{|{\textsf{sol}({F})}_{\mathsf {\downarrow }S}|}{1+\varepsilon } \le c \le (1+\varepsilon )|{\textsf{sol}({F})}_{\mathsf {\downarrow }S}|\right] \ge 1 - \delta \), where \(|{\textsf{sol}({F})}_{\mathsf {\downarrow }S}|\) denotes the number of the solutions of F projected on \(S\).

figure h
figure i

An outline of \(\textsf{ApproxMC}\) is shown in Algorithms 1 and 2. At a high level, the key idea of \(\textsf{ApproxMC}\) is to partition the set of solutions into small cells of roughly equal size by relying on the power of XOR-based hash families [12, 25], then randomly picking one of the cells and enumerating all the solutions in the chosen small cell up to a threshold \(\textsf{thresh}\) via calls to \(\textsf{BoundedSAT}(F, S, \textsf{thresh})\). The estimated count is obtained by scaling the number of solutions in the randomly chosen cell by the number of cells, and the success probability of this estimation is amplified to the desired level by taking the median result from several trials.

Syntactically, the solution space partition and random cell selection is accomplished by introducing randomly generated XOR constraints of the form \((\bigoplus _{y \in Y} y) = b\) for a random subset \(Y \subseteq S\) and random bit b. A crucial fact about random XOR constraints exploited by \(\textsf{ApproxMC}\) is their 2-universality when viewed as a hash family on assignments—briefly, given any two distinct Boolean assignments over the variable set \(S\), the probability of each one satisfying a randomly chosen XOR constraint is independent and equal to \(\frac{1}{2}\).

Accordingly, the \(\textsf{BoundedSAT}\) queries made in Algorithms 1 and 2 are conjunctions of the input formula and random XOR constraints, i.e., CNF-XOR formulas. The current implementation of \(\textsf{ApproxMC}\) relies on CryptoMiniSat for its ability to handle CNF-XOR formulas efficiently and incrementally [49, 50]. Furthermore, the real-world implementation also relies on three key optimizations. (1) The search for the correct value of m in Algorithm 2 (\(\textsf{FindM}\)) combines a linear neighborhood search, a galloping search, and a binary search [12]. (2) The underlying SAT solver is used as a library, allowing to solve under a set of assumptions, a technique introduced as part of MiniSat [17]. This allows the solver to keep learned lemmas between subsequent calls to solve(), significantly improving solving speed, which is especially helpful for proving unsatisfiability. (3) To improve the speed of finding satisfying assignments, a solution cache of past solutions is retained [49] which is especially helpful when the optimal number of to add is N, but N+1 have been added and were found to be too much. In these cases, all solutions that are valid for N+1 are also solutions to N and can be reused.

3.2 Formalization in Isabelle/HOL

Notation. All Isabelle/HOL syntax is typeset in font with boldface Isar keywords; and are the universal quantifier and implication of Isabelle’s metalogic, respectively. Type variables are written as . The type of (total) functions from to is written as , and the type of partial functions, which are only defined on some elements of type , is . For clarity, we often annotate terms with their type using the notation . For types such as reals, integers, or natural numbers, the interval from to (inclusive) is written as ; the same interval except endpoint is . More comprehensive introductions can be found in standard references [4, 42].

Locales and Probability. Isabelle/HOL is equipped with locales [4], a system of user-declared modules consisting of syntactic parameters, assumptions on those parameters, and module-specific theorems. These modules can be instantiated and inherited, giving users a powerful means of managing mathematical relationships. The following snippet, taken from the Isabelle/HOL standard library, shows an example locale declaration for probability spaces followed by an interpretation command claiming that the measure space associated with any probability mass function (PMF) is a probability space [28].

figure ac

Thanks to the locale interpretation, all definitions and theorems associated with probability spaces can be used with PMFs. For example, the probability of an event occurring under is . The support of PMF is , which is for all PMFs considered in this work.

4 Approximate Model Counting in Isabelle/HOL

This section outlines our formalization of \(\textsf{ApproxMC}\) in Isabelle/HOL and its verified certificate checker implementation. The proof follows a refinement-based approach, starting with an abstract mathematical specification of \(\textsf{ApproxMC}\), where its probabilistic approximation guarantees can be formalized without low-level implementation details getting in the way (Sect. 4.1). Then, the abstract specification is progressively concretized to a verified certificate checker which we call \(\textsf{CertCheck}\) (Sect. 4.2) and we extend \(\textsf{ApproxMC}\) to \(\textsf{ApproxMCCert}\), a certificate-generating counter (Sect. 4.3). As part of \(\textsf{CertCheck}\), we also built a native CNF-XOR unsatisfiability checker, which is external to Isabelle/HOL, but is also based on formally verified proof checking (Sect. 4.4).

4.1 Abstract Specification and Probabilistic Analysis

Throughout this section, the type abstracts the syntactic representation of variables. For example, in the DIMACS CNF format, variables are represented with positive numbers, while in other settings, it may be more convenient to use strings as variable names. A solution (or model) is a Boolean-valued function on variables and a projection set is a (finite) set of variables. The main result of this section is formalized in a locale with two parameters , , and an assumption relating the two:

figure ao

Here, type abstracts the syntactic representation of formulas, is the set of all solutions of a formula , and is a formula whose set of solutions satisfies both and the XOR constraint . An instantiation of the locale would need to provide implementations of , and prove that they satisfy the latter assumed property.

The PAC theorem for \(\textsf{ApproxMC}\) is formalized as follows:

figure ay

Here, is the true count of projected solutions, i.e., the cardinality of the set , interpreted as a real number. The conclusion says that returns an -approximate count with probability at least . The argument is a user-specifiable minimum number of iterations of \(\textsf{ApproxMCCore}\) calls inside \(\textsf{ApproxMC}\); in practice, a sufficient number of rounds is automatically determined using the median method. Since the locale can be instantiated for any Boolean theory in which XOR constraints can be syntactically encoded, this theorem shows that the approximate model counting algorithm of Chakrabory et al. [12] works for any such theory.

The rest of this section gives an overview of our proof of . Technical differences compared to the original proofs are discussed in remarks.

Formalized Analysis of ApproxMCCore. For simplicity, we write for the type of solutions projected onto set and for n-dimensional bit-vectors, i.e., the type of Boolean-valued functions on domain \(0,1,\dots ,n-1\). A hash function maps projected solutions into n-dimensional bit-vectors. Let be any set of solutions, such as . Abstractly, \(\textsf{ApproxMCCore}\) is a way of approximating the cardinality of the projected set , given an oracle that can count up to a specified threshold number of solutions. Without loss of generality, assume (otherwise, the oracle returns the exact count).

Remark 1

The simple type theory of Isabelle/HOL does not support dependent function types like and . Our formalization represents functions with type as partial functions along with an assumption that their function domain is equal to .

For any fixed bit-vector , the sets of hash functions , , and used in the analysis are defined as follows, where counts the number of entries of such that the hash value agrees with on their first entries (also called the i-th slices).

figure cf

For any input hash function , the following function (cf. Algorithm 2 Lines 2–5) finds the first index , if one exists in , where . It returns the approximate model count as a multiplier ( ) and cell size ( ). The failure event is the set of hash functions such that returns a non- -factor-approximate count.

figure cr

The key lemma for (shown with proof sketch below) is that, for hash functions , which are randomly sampled from an appropriate hash family , the probability of the aforementioned failure event is bounded above by 0.36 [12]. The lemma uses Isabelle/HOL ’s formalization of hash families which is seeded [31], i.e., is a PMF on seeds and is a -universal hash family for seeds drawn from ; is a PMF which samples a random seed and then returns the hash function associated with that seed according to the family .

figure dc

Proof

The proof of proceeds via several sub-lemmas [12], which we discuss inline below. We first show that an index exists with the following properties ( is the Isar keyword for existential claims):

figure dg

This is proved by noting that there exists satisfying the first two properties separately in the finite interval , so there must be an satisfying all three properties in that interval.

Next, the failure event (which is a set of hash functions) is proved to be contained in the union of four separate events involving using the properties from and unfolding the respective definitions of , , and :

figure dp

Finally, we bound the probability for each of the four events separately.

figure dq

Lemma follows from , , and the union bound on probabilities.    \(\square \)

Remark 2

Our implicit construction of in avoids an explicit calculation from , and  [12], which is more intricate to analyze. Additionally, in , the first bound for works only when , an omitted condition from the pen-and-paper proof [12, Lemma 2]; we also verified a looser bound of without this condition, but this leads to a weaker overall guarantee for \(\textsf{ApproxMCCore}\) (which we do not use subsequently).

Formalized Analysis of ApproxMC. Random XORs and XOR-based hash families are defined as follows:

figure ed

Here, is the PMF which samples a pair of a uniformly randomly chosen subset of the (projection) variables and the outcome of a fair coin flip; is the PMF that samples independent XORs according to . Given randomly chosen seed , the associated hash function takes a projected solution to the bit-vector whose bit indicates whether the -th XOR is satisfied by .

The following definition of (cf. Algorithm 2) randomly samples XOR constraints over the variables and runs ( instantiated with XOR-based hash families using ). The top-level function (cf. Algorithm 1) selects appropriate values for and the number of rounds for amplification using the median method.

figure ez

The main result follows from 2-universality of XOR-based hash families and the facts that returns a correct value of and chooses a sufficient number of rounds for the median method.

Library Contributions. We added reusable results to Isabelle/HOL ’s probability libraries, such as the Paley-Zigmund inequality (a concentration inequality used in the analysis of \(\textsf{ApproxMCCore}\)) and a slightly modified (tighter) analysis of the median method based on the prior formalization by Karayel [31, 33]; the latter modification does not change the asymptotic analysis of the method but it is needed as \(\textsf{ApproxMC}\) implementations use the tighter calculation to reduce the number of rounds for success probability amplification.

We also formalized the 3-universality of XOR-based hash families [25], which implies its 2-universality, as needed by \(\textsf{ApproxMC}\). The proof is sketched in the online extended version of this paper. Our (new) proof is of independent interest as it is purely combinatorial, using a highly symmetric case analysis which helps to reduce formalization effort because many cases can be proved using without-loss-of-generality-style reasoning in Isabelle/HOL.

4.2 Concretization to a Certificate Checker

The specification from Sect. 4.1 leaves several details abstract. For example, refers to set cardinalities and uses a bounded solution counter as an oracle, neither of which are a priori computable terms. This section gives a concrete implementation strategy where the abstract details are obtained from certificates generated by an untrusted external implementation, and checked using verified code. The main result is formalized in a locale with two key extensions compared to from Sect. 4.1: (i) the locale, switching from set-based to computable list-based representations for the projection set and XORs; (ii) the additional locale parameters determining whether a formula is satisfied by a specified assignment, and that syntactically blocks a solution from further consideration.

figure fl

The correctness of the checker (shown below) has two conjuncts in its conclusion. In both conjuncts, models an external (untrusted) implementation returning a certificate and is a random seed passed to both and . The checker either returns an error string ( ) or a certified count. The soundness guarantee (left conjunct) says that the probability of the checker returning an incorrect count (without error) is bounded above by . Note that for a buggy counter that always returns an invalid certificate, returns an error for all random seeds, i.e., it returns a count (whether correct or not) with probability 0. Thus, the promise-completeness guarantee (right conjunct) says that if the function is promised to return valid certificates for all seeds , then the checker returns a correct count with probability .

figure fy

Additional differences in compared to are: (iii) the oracle function , which is assumed to be an interface to an external unsatisfiability checker; (iv) the additional certificate arguments and ; and (v) the eager sampling of XORs using random bits ( ), compared to which samples lazily.

Remark 3

Note that and are locale parameters with assumptions that must be proven when is instantiated to a Boolean theory; in contrast, appears as an assumption. The pragmatic reason for this difference is that and can be readily implemented in Isabelle/HOL with decent performance. In contrast, developing efficient verified unsatisfiability proof checkers and formats, e.g., for CNFs, is still an active area of research [3, 27, 37, 53]. Leaving outside the scope of Isabelle/HOL allows us to rely on these orthogonal verification efforts (as we do in Sect. 4.4).

Fig. 2.
figure 2

An example pigeon-hole formula (2 pigeons, 5 holes, 180 solutions) in DIMACS format and a valid certificate for the checker at \(\varepsilon =0.8\) and \(\delta =0.2\) ( ). The certificate is shown with colored comments and with redundant spaces added for clarity. In clauses, the negative (resp. positive) integers are negated (resp. positive) literals, with a 0 terminator; solutions are lists of literals assigned to true. Part of the certificate (marked with ) is checked with an external UNSAT proof checking pipeline.

From to . We briefly list the steps in transporting the PAC guarantee from to , with reference to the differences labeled (i)–(v) above. The proof follows a sequence of small refinement steps which are individually straightforward as they do not involve significant probabilistic reasoning. First, cf. (v), a variant of is formalized where all XORs are eagerly sampled upfront, as opposed to lazily at each call to . Without loss of generality, it suffices to sample \(\times \) XORs. Next, cf. (i), the representations are swapped to executable ones, e.g., the projection set is represented as a list of distinct elements. Accordingly, the left-hand side of each XOR is represented as a list of bits, where the -th bit indicates whether the -th entry of is included in the XOR. Note that it suffices to sample \(\times \) \(\times \) bits for \(\textsf{ApproxMC}\). Finally, cf. (iv), partial certificates are introduced. The key observation is that the final value of in from Sect. 4.1 can be readily certified because it is the first entry where adding XORs causes the solution count to fall below —the solution count is monotonically decreasing as more XORs are added. Thus, for a claimed value of it suffices to check, cf. (ii) and (iii) that the following three conditions hold. (1) Firstly, . (2) Secondly, the solution count after adding XORs reaches or exceeds , which can be certified ( ) by a list of solutions of length at least , which are distinct after projection on . (3) Thirdly, if , then the solution count after adding XORs is below , which can be certified ( ) by a list of solutions of length below , which are distinct after projection, and where the formula after excluding all those projected solutions ( ) is unsatisfiable ( ).

An example partial certificate is shown in Fig. 2. Note that we call these partial certificates because of the reliance on an external pipeline for checking unsatisfiability, as illustrated in the example.

Code Extraction for CertCheck. To obtain an executable implementation of , we instantiated the Isabelle/HOL formalization with a concrete syntax and semantics for CNF-XOR formulas, and extracted source code using Isabelle/HOL ’s Standard ML extraction mechanism. The extracted implementation is compiled together with user interface code, e.g., file I/O, parsing, and interfacing with a trusted random bit generator and CNF-XOR unsatisfiability checking, as shown in Fig. 1. The resulting tool is called \(\textsf{CertCheck}\).

4.3 Extending ApproxMC to ApproxMCCert

To demonstrate the feasibility of building a (partial) certificate generation tool, we modified the mainline implementation of \(\textsf{ApproxMC}\) to accept and use an externally generated source of random bits. We also modified it to write its internally calculated values of and a log of the respective models reported by its internal solver to a file. The resulting tool is called \(\textsf{ApproxMCCert}\). An implementation of \(\textsf{ApproxMC}\) (and thus \(\textsf{ApproxMCCert}\)) requires logarithmically many solver calls to find the correct value of and it can employ many search strategies [12]. The partial certificate format is agnostic to how is found, requiring certification only for the final value of in each round.

Remark 4

It is worth remarking that \(\textsf{CertCheck}\) requires checking the validity of \(\mathcal {O}(\varepsilon ^{-2} \cdot \log \delta ^{-1} )\) solutions (each of size n, the number of variables), and unsatisfiability for \(\mathcal {O}(\log \delta ^{-1} )\) formulas, while \(\textsf{ApproxMC}\) requires \(\mathcal {O}(\varepsilon ^{-2} \cdot \log n \cdot \log \delta ^{-1} )\) calls to its underlying solver. In the next section, we instantiate with a CNF-XOR unsatisfiability checking pipeline that generates proofs which are checkable by a verified checker in polynomial time (in the size of the proofs).

4.4 CNF-XOR Unsatisfiability Checking

A crucial aspect of \(\textsf{CertCheck}\) is its reliance on an external checker for unsatisfiability of CNF-XOR formulas. As mentioned in Sect. 2, there are several prior approaches for certified CNF-XOR reasoning that can be plugged into \(\textsf{CertCheck}\).

We opted to build our own native extension of FRAT [3] because none of the previous options scaled to the level of efficient XOR proof checking needed for certifying \(\textsf{ApproxMC}\) (as evidenced later in Sect. 5). For brevity, the new input and proof format(s) are illustrated with inline comments in Fig. 3. We defer a format specification to the tool repository.

In a nutshell, when given an input CNF-XOR formula, CryptoMiniSat has been improved to emit an unsatisfiability proof in our extended FRAT-XOR format. Then, our FRAT-xor tool elaborates the proof into XLRUP, our extension of Reverse Unit Propagation (RUP) proofs [27] with XOR reasoning. The latter format can be checked using cake_xlrup, our formally verified proof checker. Such an extension to FRAT was suggested as a possibility by Baek et al. [3] and we bear their claim out in practice.

Fig. 3.
figure 3

(top left) A sample input CNF-XOR formula where XOR lines start with x and indicate the literals that XOR to 1, e.g., the line x 1 2 -3 represents the XOR constraint \(x_1 \oplus x_2 \oplus \bar{x}_3 = 1\); (bottom left) a FRAT-XOR proof; (right) an XLRUP proof. The steps in bold indicate newly added XOR reasoning. Note that the XOR steps are (mostly) syntactically and semantically unchanged going from FRAT-XOR to XLRUP, so we focus on the latter here. The meaning of each XLRUP step (analogously for FRAT-XOR) is annotated in color-coded comments above the respective line.

Extending FRAT-rs to FRAT-xor . Our FRAT-xor tool adds XOR support to FRAT-rs  [3], an existing tool for checking and elaborating FRAT proofs. This extension is designed to be lightweightFRAT-xor does not track XORs nor check the correctness of any XOR-related steps; instead, it defers the job to an underlying verified proof checker. Our main changes were: (i) adding parsing support for XORs; (ii) ensuring that clauses implied from XORs can be properly used for further clausal steps, including automatic elaboration of RUP [3]; and (iii) ensuring the clauses used to imply XORs are trimmed from the proof at proper points, i.e., after the last usage by either a clausal or XOR step.

Extending cake_lpr to cake_xlrup . We also modified cake_lpr  [53], a verified proof checker for CNF unsatisfiability, to support reasoning over XOR constraints. The new tool supports: (i) clause-to-clause reasoning via RUP steps; (ii) deriving new XORs by adding together XORs; (iii) XOR-to-clause and clause-to-XOR implications. The main challenge here was to represent XORs efficiently using byte-level representations to take advantage of native machine instructions in XOR addition steps. The final verified correctness theorem for cake_xlrup is similar to that of cake_lpr  [53] (omitted here).

Modifications to CryptoMiniSat. A refactoring of CryptoMiniSat was performed in response to the bug described in Sect. 1 and in order to add FRAT-XOR proof logging. As part of this rewrite, a new XOR constraint propagation engine has been added that had been removed as part of BIRD [50]—that system did not need it, as it kept all XOR constraints also in a blasted form. Furthermore, XOR constraints have been given IDs instead of a pointer to a TBUDDY BDD previously used, and all XOR manipulations such as XOR-ing together XOR constraints, constant folding [57], satisfied XOR constraint deletion, etc., had to be documented in the emitted FRAT-XOR proof log. Further, CryptoMiniSat had to be modified to track which clause IDs were responsible for recovered XOR constraints. To make sure our changes were correct, we modified CryptoMiniSat ’s fuzzing pipeline to include XOR constraint-generating problems and to check the generated proofs using our certification tools.

5 Experimental Evaluation

To evaluate the practicality of partial certificate generation (\(\textsf{ApproxMCCert}\)) and certificate checking (\(\textsf{CertCheck}\)), we conducted an extensive evaluation over a publicly available benchmark set [41] of 1896 problem instances that were used in previous evaluations of \(\textsf{ApproxMC}\) [49, 51]. The benchmark set consists of (projected) model counting problems arising from applications such as probabilistic reasoning, plan recognition, DQMR networks, ISCAS89 combinatorial circuits, quantified information flow, program synthesis, functional synthesis, and logistics. Most instances are satisfiable with large model counts and only approximately 6% are unsatisfiable for testing corner cases.

To demonstrate the effectiveness of our new CNF-XOR unsatisfiability checking pipeline, we also compared it to the three prior state-of-the-art approaches discussed in Sect. 2. The approaches are labeled as follows:

  • \(\textsf{CMS}\)+\(\textsf{frat}\)-\(\textsf{xor}\). Our new (default) pipeline based on FRAT-XOR (Sect. 4.4); here, CMS is short for CryptoMiniSat.

  • \(\textsf{CMS}\)+\(\textsf{tbuddy}\). The pipeline consisting of CryptoMiniSat with TBUDDY, FRAT-rs, and a verified CNF proof checker (Sect. 2, item 1).

  • \(\textsf{MiniSatXOR}\)+\(\textsf{pbp}\). The pipeline consisting of MiniSat with XOR engine, \(\textsf{VeriPB}\), and its verified proof checker (Sect. 2, item 2)

  • \(\textsf{CaDiCaL}\)+\(\textsf{lrat}\). A state-of-the-art SAT solver \(\textsf{CaDiCaL}\)  [8, 44] which generates proofs checkable by a verified CNF proof checker (Sect. 2, item  3).

We experimented with each of these approaches as the CNF-XOR unsatisfiability checking pipeline for \(\textsf{CertCheck}\), checking the same suite of certificates produced by \(\textsf{ApproxMCCert}\).

The empirical evaluation was conducted on a high-performance computer cluster where every node consists of an AMD EPYC-Milan processor featuring \(2\times 64\) real cores and 512 GB of RAM. For each instance and tool (\(\textsf{ApproxMC}\), \(\textsf{ApproxMCCert}\), or \(\textsf{CertCheck}\)), we set a timeout of 5000 s, memory limit of 16GB, and we used the default values of \(\delta =0.2\) and \(\varepsilon =0.8\) for all tools following previous experimental conventions [49]. For each given tool, we report the PAR-2 score which is commonly used in the SAT competition. It is calculated as the average of all runtimes for solved/certified instances out of the relevant instances for that tool, with unsolved/uncertified instances counting for double the time limit (i.e., 10000 s).

Our empirical evaluation sought to answer the following questions:

  • RQ1 How does the performance of \(\textsf{ApproxMCCert}\) and \(\textsf{CertCheck}\) compare to that of \(\textsf{ApproxMC}\)?

  • RQ2 How does the performance of \(\textsf{CMS}\)+\(\textsf{frat}\)-\(\textsf{xor}\) compare to prior state-of-the-art approaches for CNF-XOR UNSAT checking for use in \(\textsf{CertCheck}\)?

RQ1 Feasibility of Certificate Generation and Checking. We present the results for \(\textsf{ApproxMC}\), \(\textsf{ApproxMCCert}\), and \(\textsf{CertCheck}\) in Table 1. For certificate generation, our main observation is that \(\textsf{ApproxMCCert}\) is able to solve and generate certificates for \(99.3\%\) (i.e., 1202 out of 1211) instances that \(\textsf{ApproxMC}\) can solve alone, and their PAR-2 scores (out of 1896 instances) are similar. Indeed, in the per-instance scatter plot of \(\textsf{ApproxMC}\) and \(\textsf{ApproxMCCert}\) runtimes in Fig. 4, we see that for almost all instances, the overhead of certificate generation in \(\textsf{ApproxMCCert}\) is fairly small. This is compelling evidence for the practicality of adopting certificate generation for approximate counters with our approach.

Table 1. Performance comparison of \(\textsf{ApproxMC}\), \(\textsf{ApproxMCCert}\), and \(\textsf{CertCheck}\). The PAR-2 score is calculated out of 1896 instances for \(\textsf{ApproxMC}\) and \(\textsf{ApproxMCCert}\), and out of the 1202 instances with certificates for \(\textsf{CertCheck}\).
Fig. 4.
figure 4

Per instance runtime (s) comparison for \(\textsf{ApproxMCCert}\) and \(\textsf{ApproxMC}\).

Turning to the feasibility of certificate checking, we observe in Table 1 that \(\textsf{CertCheck}\) is able to fully certify \(84.7\%\) of the instances (i.e., 1018 out of 1202) with certificates. Of the remaining instances, \(\textsf{CertCheck}\) timed out for 46 and ran out of memory for 138 instances (no certificate errors were reported in our latest versions of the tools). On average, \(\textsf{CertCheck}\) requires 4.6 times the runtime of \(\textsf{ApproxMCCert}\) across all certified instances. Note that each instance of \(\textsf{CertCheck}\) requires nine separate calls to the CNF-XOR unsatisfiability checking pipeline (because \(\delta = 0.2\)). It is worth emphasizing that in other certificate checking setups, such as the SAT competitions, one would typically provide an order of magnitude more time and memory to the checkers compared to solvers. Thus, \(\textsf{CertCheck}\) performs well even though our time and memory limits are stringent. Furthermore, we believe that \(\textsf{CertCheck}\)’s ability to achieve a fairly low PAR-2 score (computed out of 1202 instances) is compelling evidence for the practicality of certificate checking in approximate counting. Future work could explore parallelized certificate checking since each round used in \(\textsf{CertCheck}\) can be checked independently of each other.

RQ2 Comparison of CNF-XOR Unsatisfiability Checkers. We present results using various alternative unsatisfiability checking pipelines as part of \(\textsf{CertCheck}\) in Table 2. Here, we observe that the use of \(\textsf{CMS}\)+\(\textsf{frat}\)-\(\textsf{xor}\) allows \(\textsf{CertCheck}\) to fully certify significantly more instances than can be certified by prior approaches, and with a much lower PAR-2 score.

Fig. 5.
figure 5

(left) Runtime performance comparison between CNF-XOR unsatisfiability checkers. (right) Per instance CNF-XOR unsatisfiability proof size (bytes) comparison for \(\textsf{CMS}\)+\(\textsf{frat}\)-\(\textsf{xor}\) and \(\textsf{CMS}\)+\(\textsf{tbuddy}\).

Table 2. Performance comparison of CNF-XOR unsatisfiability checkers in \(\textsf{CertCheck}\). The PAR-2 score is calculated out of the 1202 instances with certificates for all checkers.

Figure 5 (left) visualizes the performance gap between \(\textsf{CMS}\)+\(\textsf{frat}\)-\(\textsf{xor}\) and the prior methods using a CDF (cumulative distribution function) plot; a point (xy) indicates that the corresponding tool certifies y number of instances when given a timeout of x seconds for each instance. This plot provides strong justification for our claim of the need to develop \(\textsf{CMS}\)+\(\textsf{frat}\)-\(\textsf{xor}\) for native CNF-XOR unsatisfiability proof checking in Sect. 4.4. The ability to log XOR proof steps compactly in our new CNF-XOR unsatisfiability proof format is also significant. This is illustrated in Fig. 5 (right) which gives a scatter plot comparing FRAT (resp. FRAT-XOR) proof sizes generated by \(\textsf{CMS}\)+\(\textsf{tbuddy}\) (resp. \(\textsf{CMS}\)+\(\textsf{frat}\)-\(\textsf{xor}\)) within 600 s on instances that were successfully certified by \(\textsf{CMS}\)+\(\textsf{tbuddy}\). Recall that the solver in \(\textsf{CMS}\)+\(\textsf{tbuddy}\) supports XOR reasoning and uses TBUDDY to emit its proof log in terms of a clausal proof system, i.e., without native XOR proof steps. Overall, our new proof format achieves an average 30-fold reduction in proof size, with the maximum reduction reaching up to 8,251 times.

6 Conclusion and Future Work

This work shows that it is feasible to use proof assistants to formalize practical randomized automated reasoning algorithms. Such formalizations are valuable—our end-to-end certification approach for \(\textsf{ApproxMCCert}\) has led to bug-fixes for both \(\textsf{ApproxMC}\) and its underlying CryptoMiniSat solver.

An interesting line of future work would be to support recently proposed techniques such as sparse hashing [39] or rounding [60] in the context of \(\textsf{ApproxMC}\). Furthermore, this work leaves preprocessing techniques, such as independent support identification, out of scope. It is worth noting that efficient identification of the independent support set, in conjunction with a new rounding-based algorithm [60], significantly boosts the counting performance of \(\textsf{ApproxMC}\); in the experimental setting of Table 1, this combination solves 1787 instances with a PAR-2 score of 625. Thus, certifying these extensions is a tantalizing avenue for future research. Another potential line of future work involves developing extensions for theories other than CNF-XOR model counting [59].