Keywords

1 Introduction

Most U.S. jurisdictions use computers to tabulate votes. Like all computers, vote tabulators are vulnerable to bugs, human error, and deliberate malfeasance—a fact that has been exploited (rhetorically, if not in reality) to undermine trust in U.S. elections [3, 4, 9, 10].

To deserve public trust, elections must be trustworthy, despite relying on untrustworthy software, hardware, and people: they should provide convincing affirmative evidence that the reported winners really won [1, 2, 20]. Risk-limiting audits (RLAs) are a useful tool for conducting such evidence-based elections. RLAs have a specified maximum chance—the risk limit \(\alpha \)—of not correcting the reported outcome if it is wrong, and never change the reported outcome if it is correct. Below we present methods to reduce the number of ballots that must be manually inspected in an RLA when the reported outcomes are correct, for stratified audit samples.

In a ballot-level comparison RLA, manual interpretations of the votes on randomly sampled ballot cards are compared to their corresponding cast vote records (CVRs), the system’s interpretation of the votes on those cards. In a ballot-polling RLA, votes are read manually from randomly selected cards, but those votes are not compared to the system’s interpretation of the cards. All else equal, ballot-level comparison RLAs are more efficient than ballot-polling RLAs, but they require the voting system to export CVRs in a way that the corresponding card can be uniquely identified. Not all voting systems can.

Stratified random sampling can be mandatory or expedient in RLAs. Some states’ laws require audit samples to be drawn independently across jurisdictions (e.g., California Election Code § 336.5 and § 15360), in which case the audit sample for any contest that crosses jurisdictional boundaries is stratified. Stratifying on the technology used to tabulate votes can increase efficiency by allowing hybrid audits [7, 11], which use ballot-level comparison in strata where the voting technology supports it and ballot-polling elsewhere. Another reason to use stratification is to allow RLAs to start before all ballots have been tabulated [17].

The next section briefly reviews prior work on stratified audits. Section 3 introduces notation and stratified risk measurement, then presents our improvements: (i) sharper P-values from new risk-measuring functions; (ii) sequential stratified sampling that adapts to the observed data in each stratum to increase efficiency; and (iii) a computationally efficient method for an arbitrary number of strata. Section 4 evaluates the innovations using case studies and simulations. Section 5 discusses the results and gives recommendations for practice.

2 Past Work

The first RLAs involved stratified batch comparison, using the maximum error across strata and contests as the test statistic [5, 13,14,15], a rigorous but inefficient approach. Higgins et al. [6] computed sharper P-values for the same test statistic using dynamic programming. SUITE [7, 11] uses union-intersection tests to represent the null hypothesis that one or more reported winners actually lost as a union of intersections of hypotheses about individual strata; it involves optimization problems that are hard to solve when there are more than two strata.

More recently, SHANGRLA [18] has reduced RLAs to a canonical form: testing whether the means of finite, bounded lists of numbers (representing ballot cards) are all less than 1/2, which allows advances in statistical inference about bounded populations to be applied directly to RLAs. Stark [18] showed that union-intersection tests can be used with SHANGRLA to allow any risk-measuring function to be used in any stratum in stratified audits.

Stark [19] provided a new approach to union-intersection tests using nonnegative supermartingales (NNSMs): intersection supermartingales, which open the possibility of reducing sample sizes by adaptive stratum selection (using the first t sampled cards to select the stratum from which to draw the \((t+1)\)th card). Stark [19] does not provide an algorithm for stratum selection or evaluate the performance of the approach; this paper does both.

3 Stratified Audits

We shall formalize stratified audits using the SHANGRLA framework [18], which unifies comparison and polling audits. We then show how to construct a stratified comparison audit using SHANGRLA, how to measure the risk based on a stratified sample, and how adaptive sequential stratified sampling can improve efficiency.

3.1 Assorters and Assertions

Ballot cards are denoted \(\{b_i\}_{i=1}^N\). An assorter A assigns a number \(A(b_i) \equiv x_i \in [0,u]\) to ballot card \(b_i\) [18] and the value \(A(c_i)\) to CVR i. The value an assorter assigns to a card depends on the votes on the card, the social choice function, and possibly on the machine interpretation of that card and others (for comparison audits). Stark [18] describes how to define a set of assorters for many social choice functions (including majority, multiwinner majority, supermajority, Borda count, approval voting, all scoring rules, D’Hondt, STAR-Voting, and IRV) such that the reported winner(s) really won if the mean of every assorter in the set is greater than 1/2. The claim that an assorter mean is \(> 1/2\) is called an assertion. An RLA with risk limit \(\alpha \) confirms the outcome of a contest if it rejects the complementary null that the assorter mean is \(\le 1/2\) at significance level \(\alpha \) for every assorter relevant to that contest.

In a stratified audit, the population of ballot cards is partitioned into K disjoint strata. Stratum k contains \(N_k\) ballot cards, so \(N = \sum _k N_k\). The weight of stratum k is \(w_k := N_k/N\); the weight vector is \(\boldsymbol{w} := [w_1,...,w_K]^T\). For each assorter A there is a set of assorter values \(\{x_i\}_{i=1}^N\). Each assorter may have its own upper bound \(u_k\) in stratum k.Footnote 1 The true mean of the assorter values in stratum k is \(\mu _k\); \(\boldsymbol{\mu } := [\mu _1,...,\mu _K]^T\). The overall assorter mean is

$$\mu := \frac{1}{N} \sum _{i=1}^N x_i = \sum _{k=1}^K \frac{N_k}{N} \mu _k = \boldsymbol{w}^T \boldsymbol{\mu }.$$

Let \(\boldsymbol{\theta } = [\theta _{1},...,\theta _K]^T\) with \(0 \le \theta _k \le u_k\). A single intersection null is of the form \(\boldsymbol{\mu } \le \boldsymbol{\theta }\), i.e., \(\cap _{k=1}^K \{\mu _k \le \theta _k\}\). The union-intersection form of the complementary null that the outcome is incorrect is:

$$\begin{aligned} H_0: \bigcup _{\boldsymbol{\theta }: \boldsymbol{w}^T \boldsymbol{\theta } \le \frac{1}{2}} \bigcap _{k=1}^K \{\mu _k \le \theta _k \}. \end{aligned}$$
(1)

From stratum k we have \(n_k\) samples \(X^{n_k}_k := \{X_{1k},...,X_{n_k k}\}\) drawn by simple random sampling, with or without replacement, independently across strata. Section 3.3 shows how to use single-stratum hypothesis tests (of the the null \(\mu _k \le \theta _k\)) to test (1). First, we show how to write stratified comparison audits in this form.

3.2 Stratified Comparison Audits

In SHANGRLA, comparison audits involve translating the original assertions about the true votes into assertions about the reported results and discrepancies between the true votes and the machine’s record of the votes [18, Section 3.2]. For each assertion, the corresponding overstatement assorter assigns ballot card \(b_i\) a bounded, nonnegative number that depends on the votes on that card, that card’s CVR, and the reported results. The original assertion is true if the average of the overstatement assorter values is greater than 1/2.

We now show that for stratified audits, the math is simpler if, as before, we assign a nonnegative number to each card that depends on the votes and reported votes, but instead of comparing the average of the resulting list to 1/2, we compare it to a threshold that depends on the hypothesized stratum mean \(\theta _k\).

Let \(u^A_k\) be the upper bound on the original assorter for stratum k and \(\omega _{ik} := A(c_{ik}) - A(b_{ik}) \in [-u^A_k, u^A_k]\) be the overstatement for the ith card in stratum k, where \(A(c_{ik})\) is the value of the assorter applied to the CVR and \(A(b_{ik})\) is the value of the assorter for the true votes on that card. Let \(\bar{A}^b_k\), \(\bar{A}^c_k\), and \(\bar{w}_k = \bar{A}^c_k - \bar{A}^b_k\) be the true assorter mean, reported assorter mean, and average overstatement, all for stratum k.

For a particular \(\boldsymbol{\theta }\), the intersection null claims that in stratum k, \(\bar{A}^b_k \le \theta _k\). Adding \(u_k^A-\bar{A}_k^c\) to both sides of the inequality yields

$$u^A_k - \bar{\omega }_k \le \theta _k + u^A_k - \bar{A}^c_k.$$

Letting \(u_k := 2 u^A_k\), take \(B_{ik} := u^A_k - \omega _{ik} \in [0, u_k]\) and \(\bar{B}_k := \frac{1}{N_k} \sum _{i=1}^{N_k} B_{ik}\). Then \(\{B_{ik}\}\) is a bounded list of nonnegative numbers, and the assertion in stratum k is true if \(\bar{B}_k > \beta _k := \theta _k + u^A_k - \bar{A}_k^c\), where all terms on the right are known. Testing whether \(\bar{B} \le \beta _k\) is the canonical problem solved by ALPHA [19]. The intersection null can be written

$$\bar{B}_k \le \beta _k ~~\text{ for } \text{ all }~~ k \in \{1,\ldots ,K\}.$$

Define \(\boldsymbol{u} := [u_1, \ldots , u_K]^T\). As before, we can reject the complementary null if we can reject all intersection nulls \(\boldsymbol{\theta }\) for which \(\boldsymbol{0} \le \boldsymbol{\theta } \le \boldsymbol{u}\) and \(\boldsymbol{w}^T \boldsymbol{\theta } \le 1/2\).

3.3 Union-intersection Tests

A union-intersection test for (1) combines evidence across strata to see whether any intersection null in the union is plausible given the data, that is, to check whether the P-value of any intersection null in the union is greater than the risk limit.

Consider a fixed vector \(\boldsymbol{\theta }\) of within-stratum nulls. Let \(P(\boldsymbol{\theta })\) be a valid P-value for the intersection null \(\boldsymbol{\mu } \le \boldsymbol{\theta }\). Many functions can be used to construct \(P(\boldsymbol{\theta })\) from tests in individual strata; two are presented below. We can reject the union-intersection null (1) if we can reject the intersection null for all feasible \(\boldsymbol{\theta }\) in the half-space \(\boldsymbol{w}^T \boldsymbol{\theta } \le 1/2\). Equivalently, \(P(\boldsymbol{\theta })\) maximized over feasible \(\boldsymbol{\theta }\) is a P-value for (1):

$$\begin{aligned} P^* := \max _{\boldsymbol{\theta }}~ \{ P(\boldsymbol{\theta }): \boldsymbol{0} \le \boldsymbol{\theta } \le \boldsymbol{u} ~ \text{ and } ~ \boldsymbol{w}^T \boldsymbol{\theta } \le {1}/{2}\}. \end{aligned}$$

This method is fully general in that it can construct a valid P-value for (1) from stratified samples and any mix of risk-measuring functions that are individually valid under simple random sampling. However, the tractability of the optimization problem depends on the within-stratum risk-measuring functions and the form of P used to pool risk. So does the efficiency of the audit.

We next give two valid combining rules \(P(\boldsymbol{\theta })\). Section 3.6 presents some choices for within-stratum risk measurement to construct \(P(\boldsymbol{\theta })\).

3.4 Combining Functions

Ottoboni et al. [11] and Stark [18] calculate P for the intersection null using Fisher’s combining function. Let \(p_k(\theta _k)\) be a P-value for the single-stratum null \(H_{0k}: \mu _k \le \theta _k\). Define the pooling function

$$P_F(\boldsymbol{\theta }) := 1 - \chi ^2_{2K} \left( -2 \sum _{k=1}^K \log p_k(\theta _k) \right) ,$$

where \(\chi ^2_{2K}\) is the CDF of the chi-squared distribution with 2K degrees of freedom. The term inside the CDF, \(-2 \sum _{k=1}^K \log p_k(\theta _k)\), is Fisher’s combining functionFootnote 2. Because samples are independent across strata, \(\{p_k(\theta _k)\}_{k=1}^K\) are independent random variables, so Fisher’s combining function is dominated by the chi-squared distribution with 2K degrees of freedom [11]. The maximum over \(\boldsymbol{\theta }\), \(P_F^*\), is a valid P-value for (1).

3.5 Intersection Supermartingales

Stark [19] derives a simple form for the P-value for an intersection null when supermartingales are used as test statistics within strata. Let \(M_{n_k}^k(\theta _k)\) be a supermartingale constructed from \(n_k\) samples drawn from stratum k when the null \(\mu _k \le \theta _k\) is true. Then the product of these supermartingales is also a supermartingale under the intersection null, so its reciprocal (truncated above at 1) is a valid P-value [19, 23]:

$$P_M(\boldsymbol{\theta }) := 1 \wedge \prod _{k=1}^K M_{n_k}^k(\theta _k)^{-1}.$$

Maximizing \(P_M(\boldsymbol{\theta })\) (equivalently, minimizing the intersection supermartingale) yields \(P_M^*\), a valid P-value for (1).

3.6 Within-Stratum P-values

The class of within-stratum P-values that can be used to construct \(P_F\) is very large, but \(P_M\) is limited to functions that are supermartingales under the null. Possibilities include:

  • SUITE, which computes \(P_F^*\) for two-stratum hybrid audits. The P-value in the CVR stratum uses the MACRO test statistic [16]; the P-value in the no-CVR stratum takes a maximum over many values of Wald’s SPRT indexed by a nuisance parameter representing the number of non-votes in the stratum. The maximations in MACRO and over a nuisance parameter in the SPRT make SUITE less efficient than newer methods based on SHANGRLA [18].

  • ALPHA, which constructs a betting supermartingale as in Waudby-Smith and Ramdas [22], but with an alternate parameterization [19]. Such methods are among the most efficient for RLAs [19, 23], but the efficiency depends on how the tuning parameter \(\tau _{ik}\) is chosen. Stark [19] offers a sensible strategy based on setting \(\tau _{ik}\) to a stabilized estimate of the true mean \(\mu _k\). We implement that approach and a modification that is more efficient for comparison audits. Both \(P_M^*\) and \(P_F^*\) can be computed from stratum-wise ALPHA supermartingales. However, finding the maximum P-value over the union is prohibitively slow when \(K>2\).

  • Empirical Bernstein (EB), which is a supermartingale presented in Howard et al. [8] and Waudby-Smith and Ramdas [22]. Although they are generally not as efficient as ALPHA and other betting supermartingales [22], EB supermartingales have an exponential analytical form that makes \(\log P_M(\boldsymbol{\theta })\) or \(\log P_F(\boldsymbol{\theta })\) linear or piecewise linear in \(\boldsymbol{\theta }\). Hence, \(P_M^*\) and \(P_F^*\) can be computed quickly for large K by solving a linear program.

We compare the efficiency of these risk-measuring functions in Sects. 4.1 and 4.2.

3.7 Sequential Stratum Selection

The use of sequential sampling in combination with stratification presents a new possibility for reducing workload: sample more from strata that are providing evidence against the intersection null and less from strata that are not helping. To set the stage, suppose we are conducting a ballot-polling audit with two strata of equal size and testing the intersection null \(\boldsymbol{\theta } = [0.25, 0.75]^T\). We have drawn 50 ballot cards from each stratum and found sample assorter means of \([0.5, 0.6]^T\). Given the data, it seems plausible that drawing more samples from the first stratum will strengthen the evidence that \(\mu _1 > 0.25\), but additional sampling from the second stratum might not provide evidence that \(\mu _2 > 0.75\): to reject the intersection null, it might help to draw disproportionately from the first stratum. Perhaps suprisingly, such adaptive sampling yields valid inferences when the P-value is constructed from supermartingales and the stratum selection function depends only on past data. We now sketch why this is true.

For \(t \in \mathbb {N}\) and a particular vector of hypothesized stratum means \({\boldsymbol{\theta }}\), let

$${\kappa }_t({\boldsymbol{\theta }}) \in \{1,...,K\}$$

denote the stratum from which the t-th sample was drawn for testing the hypothesis \(\boldsymbol{\mu } \le {\boldsymbol{\theta }}\). We call \({\kappa }({\boldsymbol{\theta }}) := ({\kappa }_t({\boldsymbol{\theta }}))_{t \in \mathbb {N}}\) the stratum selector for null \({\boldsymbol{\theta }}\). Crucially, \({\kappa }({\boldsymbol{\theta }})\) is a predictable sequence with respect to \((X_t)_{t \in \mathbb {N}}\) in the sense that \({\kappa }_t({\boldsymbol{\theta }})\) can depend on \(X^{t-1} := \{X_1,\ldots ,X_{t-1}\}\) but not on \(X_i\) for \(i \ge t\); it could be deterministic given \(X^{t-1}\) or may also depend on auxiliary randomness.

For example, a stratum selector could ignore past data and select strata in a deterministic round-robin sequence or at random with probability proportional to stratum size. Alternatively, a rule might select strata adaptively, for instance picking a stratum at random with probability proportional to the current value of each within-stratum supermartingale, so that strata with larger \(M_{t_k}^k(\theta _k)\) are more likely to be chosen—an “exploration–exploitation” strategy. In what follows we suppress the dependence on \({\boldsymbol{\theta }}\) except when it is explicitly required for clarity.

Now, let \(M_t^{{\kappa }}({\boldsymbol{\theta }}) := \prod _{i=0}^t Z_i\) be the test statistic for testing the null hypothesis that the vector of stratumwise means is less than or equal to \({\boldsymbol{\theta }}\). This is a supermartingale if the individual terms \(Z_i\) satisfy a simple condition. Let \(Z_0=1\) and \(Z_i \ge 0\) for all i. If

$$\begin{aligned} \mathbb {E}_{\boldsymbol{\theta }}[Z_t | X^{t-1}] \le 1, \end{aligned}$$
(2)

then \((M_t^{{\kappa }}({\boldsymbol{\theta }}))_{t \in \mathbb {N}_0}\) is a nonnegative supermartingale starting at 1 under the null. By Ville’s inequality [21], the thresholded inverse \((1 \wedge M_t^{{\kappa }}({\boldsymbol{\theta }})^{-1})_{t \in \mathbb {N}_0}\) is an anytime P-value sequence when \(\boldsymbol{\mu } \le \boldsymbol{\theta }\).

Condition (2) holds if the \(Z_i\) are terms extracted from a set of within-stratum supermartingales using a predictable stratum selector: Let

$$\begin{aligned} \nu _t^{\kappa }:= \# \{i \le t : {\kappa }_i = {\kappa }_t \} \end{aligned}$$
(3)

be the number of draws from stratum k as of time t. Suppose that for \(k \in \{1, \ldots , K\}\), \(M_t^{k}( \theta _k) := \prod _{i=1}^t Y_i^{k}(\theta _k)\) is a nonnegative supermartingale starting at 1 when \(X_{ik}\) is the ith draw from stratum k and the kth stratum mean is \(\mu _k \le \theta _k\). Then if

$$\begin{aligned} Z_i := Y_{\nu _i^\kappa }^{{\kappa }_i}(\theta _{{\kappa }_i}), \end{aligned}$$
(4)

condition (2) holds and the interleaved test statistic \(M_t^{{\kappa }}({\boldsymbol{\theta }})\) is an intersection supermartingale under the null. We compare two stratum selection rules in Sect. 4.1.

4 Evaluations

4.1 Combination and Allocation Rules

We simulated a variety of two-stratum ballot-level comparison audits at risk limit \(\alpha = 5\%\), with assorters defined as in Sect. 3.2. The strata each contained \(N_k = 1000\) ballot cards, all with valid votes. Cards were sampled without replacement. The stratum-wise true margins were \([0\%, 20\%]\), \([0\%,10\%]\) or \([0\%,2\%]\), corresponding to global margins of 10%, 5%, and 1%, respectively. Stratum-wise reported margins were also \([0\%, 20\%]\), \([0\%,10\%]\) or \([0\%,2\%]\), so error was always confined to the second stratum. Each reported margin was audited against each true margin in 300 simulations. Risk was measured by ALPHA or EB combined either as intersection supermartingales (\(P_M^*\)) or with Fisher’s combining function (\(P_F^*\)), with one of two stratum selectors: proportional allocation or lower-sided testing.

In proportional allocation, the number of samples from each stratum is in proportion to the number of cards in the stratum. Allocation by lower-sided testing involves testing the null \(\mu _k \ge \theta _{k}\) sequentially at level 5% using the same supermartingale (ALPHA or EB) used to test the main (upper-sided) hypothesis of interest. This allocation rule ignores samples from a given stratum once the lower-sided hypothesis test rejects, since there is strong evidence that the null is true in that stratum. This “hard stop” algorithm is unlikely to be optimal, but it leads to a computationally efficient implementation and illustrates the potential improvement in workload from adaptive stratum selection.

Fig. 1.
figure 1

Probability that the audit will stop (y-axis) at or before different given sample sizes (x-axis) under different allocation rules (indicated by line color: orange for lower-sided testing and blue for proportional allocation) for different combining functions (indicated by line type: solid for Fisher’s combining function and dashed for the intersection supermartingale) at risk limit \(\alpha = 5\%\). The true margins are in the rows (1% or 5%) while the reported margin is always 10%. Overstatement errors are confined to one stratum. ALPHA-ST = ALPHA with shrink-truncate \(\tau _{ik}\); ALPHA-UB = ALPHA with \(\tau _{ik}\) biased towards \(u_{k}\).

Tuning parameters were chosen as follows. ALPHA supermartingales were specified either with \(\tau _{ik}\) as described in Stark [19, Section 2.5.2] (ALPHA-ST, “shrink-truncate”) or with a strategy that biases \(\tau _{ik}\) towards \(u_k\): (ALPHA-UB, “upward bias”). The ALPHA-UB strategy helps in comparison audits because the distribution of assorter values consists of a point mass at \(u_A^k = u_k/2\) and typically small masses (with weight equal to the overstatement rates) at 0 and another small value. This concentration of mass makes it advantageous to bet more aggressively that the next draw will be above the null mean; that amounts to biasing \(\tau _{ik}\) towards the upper bound \(u_{k}\). Before running EB, the population and null were transformed to [0,1] by dividing by \(u_k\). The EB supermartingale parameters \(\lambda _{ik}\) were then specified following the “predictable mixture” strategy [22, Section 3.2], truncated to be below 0.75. Appendix A gives more details of the ALPHA-ST and ALPHA-UB strategies and the computations.

Sample size distributions for some combinations of reported and true margins are plotted in Fig. 1 as (simulated) probabilities of stopping at or before a given sample size. Table 1 gives estimated expected and 90th percentile sample sizes for each scenario and method. Table 2 lists aggregate scores, computed by finding the ratio of the workload for each method over the smallest workload in each scenario, then averaging over scenarios by taking the geometric mean of these ratios.

Intersection supermartingales tend to dominate Fisher pooling unless the stratum selector is chosen poorly (e.g., the bottom-right panel of Fig. 1 and the last row of Table 2). Stratum selection with the lower-sided testing procedure is about as efficient as proportional allocation for the ALPHA supermartingales, but far more efficient than proportional allocation for EB. The biggest impact of the allocation rule occurred for EB combined by intersection supermartingales when the reported margin was 0.01 and the true margin was 0.1: proportional allocation produced an expected workload of 752 cards, while lower-sided testing produced an expected workload of 271 cards—a 74% reduction. Table 2 shows that ALPHA-UB with intersection supermartingale combining and lower-sided testing is the best method overall; ALPHA-UB with intersection combining and proportional allocation is a close second; EB with intersection combining and lower-sided testing is also relatively sharp; ALPHA-ST with Fisher combining is least efficient.

We also ran simulations at risk limits 1% and 10%, which did not change the relative performance of the methods. However, compared to a 5% risk limit, a 10% risk limit requires counting about 17% fewer cards and a 1% risk limit requires about 38% more, on average across scenarios and methods.

Table 1. Expected and 90th percentile sample sizes for various risk-measurement functions, reported margins, and true margins, estimated from 300 simulated audits at risk-limit \(\alpha = 5\%\). The best result for each combination of reported margin, true margin, and summary statistic is highlighted. Comparison audit sample sizes are deterministic when there is no error, so the expected value and 90th percentile are equal when the reported and true margins are equal.
Table 2. Score for each method: the geometric mean of the expected workload over the minimum expected workload in each scenario. A lower score is better: a 1.00 would mean that the method always had the minimum expected workload. The best score is highlighted. A score of 2 means that workloads were twice as large as the best method, on average, across simulations and scenarios.

4.2 Comparison to SUITE

SUITE was used in a pilot RLA of the 2018 gubernatorial election in Michigan [7]. Three jurisdictions—Kalamazoo, Rochester Hills, and Lansing—were audited, but only Kalamazoo successfully ran a hybrid audit. We recalculated the risk on audit data from the closest race in Kalamazoo (Whitmer vs Schuette) using ALPHA with the optimized intersection supermartingale P-value \(P_M^*\), ALPHA with the optimized Fisher P-value \(P_F^*\), EB with \(P_F^*\), and EB with \(P_M^*\), and compared these with the SUITE P-value. Because we could not access the original order of sampled ballots in the ballot-polling stratum, we simulated P-values for 10,000 random ballot orders with the marginal totals in the sample. We computed the mean, standard deviation, and 90th percentile of these P-values for each method.

To get the ALPHA P-values, we used ALPHA-UB in the CVR stratum and ALPHA-ST in the no-CVR stratum. For EB P-values, we used the predictable mixture parameters of Waudby-Smith and Ramdas [22] to choose \(\lambda _{ik}\), truncating at 0.75 in both strata. Sample allocation was dictated by the original pilot audit: 8 cards from the CVR stratum (5,294 votes cast; diluted margin 0.55) and 32 from the no CVR stratum (22,732 votes cast; diluted margin 0.57).

Table 3 presents P-values for each method. For ALPHA, the mean \(P_F^*\) is about half the SUITE P-value; for \(P_M^*\), the mean is more than an order of magnitude smaller than the SUITE P-value. The P-value distributions for ALPHA are concentrated near the mean. On the other hand, the EB \(P_M^*\) and \(P_F^*\) P-values are both an order of magnitude larger than the SUITE P-value and their distributions are substantially more dispersed than the distributions of ALPHA P-values.

Table 3. Measured risks (P-values) computed from the 2018 Kalamazoo MI audit data. For SUITE, the original P-value is shown. For replications, the mean, standard deviation (SD), and 90th percentile of P-values in 10,000 reshufflings of the sampled ballot-polling data are shown.

4.3 A Highly Stratified Audit

As mentioned in Sect. 3.6, many within-stratum risk-measuring functions do not yield tractable expressions for \(P_F(\boldsymbol{\theta })\) or \(P_M(\boldsymbol{\theta })\) as a function of \(\boldsymbol{\theta }\), making it hard to find the maximum P-value over the union unless K is small. Indeed, previous implementations of SUITE only work for \(K = 2\). However, the combined log-P-value for EB is linear in \(\boldsymbol{\theta }\) for \(P_M^*\) and piecewise linear for \(P_F^*\). Maximizing the combined log-P-value over the union of intersections is then a linear program that can be solved efficiently even when K is large.

To demonstrate, we simulated a stratified ballot-polling audit of the 2020 presidential election in California, in which \(N = 17,500,881\) ballots were cast across \(K = 58\) counties (the strata), using a risk limit of 5%. The simulations assumed that the reported results were correct, and checked whether reported winner Joseph R. Biden really beat reported loser Donald J. Trump. The audit assumed that every ballot consisted of one card; workloads would be proportionately higher if the sample were drawn from a collection of cards that includes some cards that do not contain the contest. Sample sizes were set to be proportional to turnout, plus 10 cards, ensuring that at least 10 cards were sampled from every county. Risk was measured within strata by EB with predictable mixture \(\lambda _{ik}\) thresholded at 0.9 [22]. Within-stratum P-values were combined using \(P_F^*\) (\(P_M^*\) did not work well for EB with proportional allocation in simulations). To approximate the distribution of sample sizes needed to stop, we simulated 30 audits at each increment of 5,000 cards from 5,580 to 100,580 cards. We then simulated 300 audits at 70,580 cards, roughly the 90th percentile according to the smaller simulations.

In 91% of the 300 runs, the audit stopped by the time 70,580 cards had been drawn statewide. Drawing 70,580 ballots by our modified proportional allocation rule produces within-county sample sizes ranging from 13 (Alpine County, with the fewest voters) to 17,067 (Los Angeles County, with the most). A comparison or hybrid audit using sampling without replacement would presumably require inspecting substantially fewer ballots. It took about 3.5 s to compute each P-value in R (4.1.2) using a linear program solver from the lpSolve package (5.6.15) on a mid-range laptop (2021 Apple Macbook Pro).

5 Discussion

ALPHA intersection supermartingales were most efficient compared to the SUITE pilot audit in Michigan and in simulations. Lower-sided testing allocation was better than proportional allocation, especially for EB. Fisher pooling limits the damage that a poor allocation rule can do, but is less efficient than intersection supermartingales with a good stratum selection rule. For comparison audits, it helps to bet more aggressively than ALPHA-ST by using ALPHA-UB or EB. However, EB was not efficient compared to SUITE when replicating the Michigan hybrid audit due to poor performance in the ballot-polling stratum.

Our general recommendation for hybrid audits is: (i) use an intersection supermartingale test with (ii) adaptive stratum selection and (iii) ALPHA-UB (or another method that can exploit low sample variance to bet more aggressively) as the risk-measuring function in the comparison stratum and (iv) ALPHA-ST (or a method that “learns” the population mean) as the risk-measuring function in the ballot-polling stratum. When the number of strata is large, audits can leverage the log-linear form of the EB supermartingale to quickly find the maximum P-value, as illustrated by our simulated audit spread across California’s 58 counties.

In future work, we hope to construct better stratum allocation rules and characterize (if not construct) optimal rules. The log-linear structure of the EB supermartingale may make it simpler to derive optimal allocation rules.

While stratum selection is not an instance of a traditional multi-armed bandit (MAB) problem, there are connections, and successful strategies for MAB might help. For instance, stratum selection could be probabilistic and involve continuous exploration and exploitation, in contrast to the “hard stop” rules we used in our simulations here.