1 Introduction

Despite significant progress in theorem provers and successful formalizations of research-level mathematics [17], theorem provers have not yet enjoyed broad adoption among mathematicians [12, 29, 40]. A common criticism levelled against theorem provers by mathematicians is that formalizations are hard to write and read: they are written in unfamiliar languages, contain clutter that is irrelevant to the core mathematical ideas, may require knowledge of various specialized proof tactics, and are simply longer overall (see de Bruijn factor [39]).

Natural theorem provers are a direct answer to this critique; they aim to check texts as written by mathematicians: in natural language and with many proof gaps. Similar ambitions can already be found in the work of pioneers in theorem proving, such as P. Abrahams’s 1960s Proofchecker [1], which was intended to check the reasoning of textbooks as-is. Some theorem provers are partly natural, such as the influential Mizar system [16] which uses a quasi-natural input language and allows obvious inferences [34] as proof gaps. There are significant challenges to the natural approach, both in the processing of natural language and in the high degree of proof automation required to fill proof gaps. However, advances in automated theorem proving and computer hardware have made this approach more feasible.

\(\mathbb {N}\)aproche [9, 26] is a natural theorem prover based on A. Paskevich’s implementation of SAD [27, 38], extending it with, e.g., set-theoretic primitives, more efficient checking, and an integrated development environment. Students have completed formalizations in Naproche in various areas of mathematics, such as analysis, axiomatic set theory, representation theory, and combinatorics. However, typical formalizations in \(\mathbb {N}\)aproche use ad hoc axiomatic preliminaries and struggle to scale beyond chapter-length. Medium-sized formalizations of around 3000 lines can take half an hour to check and proving new theorems becomes increasingly difficult.

Naproche-ZFFootnote 1 is a reimplementation of key ideas in \(\mathbb {N}\)aproche with larger formalizations in mind. We shall compare the two systems throughout this paper. Naproche-ZF is developed in tandem with a growing modular standard library (see Footnote 1) containing formalizations of foundational material on sets, relations, functions, orders, ordinals, algebraic structures, topological spaces, and more.

2 Controlled Natural Language

The input language of Naproche-ZF is a new controlled natural language (CNL): it is a carefully chosen and formally specified subset of mathematical English that is embedded into for mathematical notation and document structuring. Most mathematicians are familiar with , which makes the language easier to learn. Ideally formalizing in a CNL should feel like writing with a strict style guide. The following example shows a theorem formalized in Naproche-ZF, first the source and then the rendering after typesetting.

figure e

Naproche-ZF treats everything outside of fixed formal environments such as definition, theorem, and proof, as comments. This facilitates writing literate formalizations that mix informal commentary and formal mathematics in the same document (e.g. [8], cf. literate programming [21]). Other theorem provers also support literate formalization or advanced typesetting; examples include Literate Agda, Isabelle’s documents, and a Mizar-to- translator [2]. An advantage of CNLs in literate formalization is that it is rarely necessary to restate theorems and proofs steps, since the formal statement is already readable.

Parsing. Natural mathematical language is dynamic: definitions introduce new lexical items, which can be symbols, words, or phrases. Dynamism complicates the processing of mathematical language. \(\mathbb {N}\)aproche and Naproche-ZF take different approaches to parsing their input languages and modelling dynamism.

\(\mathbb {N}\)aproche’s parser is defined with monadic parser combinators [18]. It statefully modifies itself as it encounters definitions and translates to an internal formula representation on the fly. Such tight coupling makes it harder to extend its CNL. There are also cases where parsing takes exponential time.

Naproche-ZF splits this process into phases. First it finds all lexical items using a scanner written with applicative regex combinators [7]. For nouns and verbs it then guesses the plural forms with basic smart paradigms [10] in the sense of GF [31]. The resulting lexicon becomes a parameter for the grammar of the CNL. Using the Earley [13] Haskell library, the CNL is specified as a context-free grammar in an embedded domain-specific language and the derived Earley parser [11] parses in cubic time in the worst-case or in quadratic time if the grammar is unambiguous. This grammar-oriented approach makes the initial design and future extensions of a CNL easier compared to parser combinators: new rules can be stated declaratively and there is no need to worry about exponential parsing times or eliminating left-recursion in the grammar.

Accuracy. \(\mathbb {N}\)aproche supports a plain text dialect [28] and a dialect. Naproche-ZF drops support for the plain text format and uses markup in its CNL to avoid ambiguities. For instance, “a” can be ambiguous in \(\mathbb {N}\)aproche (variable vs. determiner), but is clarified by distinguishing “a” (“$a$”) from “a” (“a”). Such distinctions and avoidance of the backtracking behaviour of combinator parsers significantly improve error specificity and locality. For instance, \(\mathbb {N}\)aproche often mistakes an unexpected word as a new variable name and will usually offer nonspecific and mislocated error messages along the lines of an “unexpected "."”, whereas Naproche-ZF can reliably offer a list of valid tokens at the location of the error. Naproche-ZF uses a stateful tokenizer to handle nested math and text modes, to support, e.g., “ ” within set comprehensions.

\(\mathbb {N}\)aproche accommodates grammatical number via a synonym instruction. For example, one uses “[synonym number/-​s]” to identify “natural number” and “natural numbers”. Thus \(\mathbb {N}\)aproche accepts ungrammatical sentences such as “x,y is natural numbers”. Naproche-ZF guesses plural forms via smart paradigms and requires number agreement. Overall, the grammar of Naproche-ZF is stricter with the aim of avoiding ambiguity, ungrammaticality, as well as pitfalls observed in formalizations written by students, where statements had unintuitive meanings in \(\mathbb {N}\)aproche.

Naproche-ZF supports various idioms. For example, binary relations can be chained and multiple terms can be related to each other (“\(a,b < c < d\)”), they can appear in bounded quantifiers (“For all \(x\in X\) ...”), and sets can be used as binary relations (“\(x\mathrel {R} y\)”). Naproche-ZF also models some pragmatic phenomena [32, 35]: for example, an existential claim in a proof implicitly introduces a local constant, the same way that the “Consider ...” step does.

3 Semantics and Proof Checking

Translation. Naproche-ZF translates from its CNL to a set theory in higher-order logic (HOL) with Henkin semantics [5], using generalized de Bruijn indices [19, 20] to handle quantifiers and other binders. However, the system emphasizes reasoning within the first-order fragment where possible to use the strength of mature first-order automated theorem provers (ATPs).

Adjectives, verbs, and nouns are translated as predicates. Bounding phrases in quantifications are translated to type guards. For instance, “Every natural number is an ordinal” is translated to “\(\forall n.\ \textsf {natural}\_\textsf {number}(n)\rightarrow \textsf {ordinal}(n)\)”.

Proof automation is currently first-order only, using strong first-order ATPs such as E [36] and Vampire [22, 33]. Every nontrivial proof step or intermediate claim leads to a proof task that is exported to an ATP. By default an ATP is given 10 s per tasks, but most tasks can be solved within fractions of a second. Naproche-ZF will also integrate with ATPs supporting HOL via TPTP THF0 [4]. For an impression of how ATPs are used, consider the step “Then \(B\in 2^A\)” in the following formalization of Cantor’s theorem.

figure j

This step leads to a proof task in which all preceding first-order definitions and theorems, as well as all local assumptions, definitions, and claims can be used as premises. Here the exported TPTP [37] problem contains a few hundred premises, shown below with the conjecture and recent theorems at the top, along with local premises at the bottom. Note that Naproche-ZF transformed the local definition of B via set comprehension into the first-order premise cantor1 by an automatic application of the axiom of separation.

figure k

Sets. Naproche-ZF’s built-in constructs are geared towards higher-order set theories [5, 15] extending Zermelo–Fraenkel (ZF) set theory. ZF with the axiom of choice (ZFC) is the de facto foundation of informal mathematics, but additional axioms such as the universe axiom of Tarski–Grothendieck set theory (TG) can be convenient for, e.g., category theory. Variants of TG are used by Mizar, Egal, and Megalodon [6]. Second-order axioms of ZF have corresponding built-in syntax or proof steps in Naproche-ZF that make it possible to use them with first-order proof automation. As we have seen in Cantor’s theorem above, set comprehensions are automatically eliminated in some situations using the axioms of separation or n-ary replacement. Naproche-ZF has a built-in proof method for \(\in \)-induction and will also feature a mechanism for defining one’s own induction principles (proved as higher-order theorems). There is potential for interoperability or integration with systems based on set theory, such as Mizar, Isabelle/ZF, and Megalodon. Naproche-ZF has an experimental export feature that translates theorem statements to Megalodon.

Structures. \(\mathbb {N}\)aproche has no dedicated features for mathematical structures, which means that users have to set up structures themselves. Dealing with notation becomes cumbersome, as you have to explicitly annotate which structure an operation belongs to. In Naproche-ZF one can define structures directly. They are encoded as record datatypes in set theory, with structure operations acting as field projections. The noun phrase of a structure is translated as a predicate and structure axioms are translated to first-order introduction and elimination rules. This first-order encoding enables structures subtyping and multiple inheritance. In the example below, topological spaces inherit from the built-in onesorted structure, which has only a projection “\(\left| -\right| \)” to the carrier set as an operation and has no axioms.

figure l

Structure operations are typeset using macros that take the structure as an optional argument. In theorems and proofs the structure argument may be omitted when a suitable structure was instantiated beforehand. For example, if a theorem statement has a premise of the form “Let X be a topological space”, one can subsequently write \(\mathcal {O}\) ( ) for \(\mathcal {O}_X\) ( ). When multiple structures with the same operation are instantiated, the last instantiation shadows the previous ones.

Imports. The import mechanism of \(\mathbb {N}\)aproche works similar to an include directive, which led to redundant checking of shared imports. Naproche-ZF tracks imports in a graph to avoid this. The import mechanism in Naproche-ZF uses the command “ ” which may be hidden in the rendered document.

Proofs. The simplest way of proving a theorem in Naproche-ZF is to leave it entirely to the ATP by not writing an explicit proof. One can also provide a proof of the form “Follows by \(\langle \textit{justification}\rangle \)”, where

$$ \langle \textit{justification}\rangle = \text {``set extensionality''} \mid {\text {``assumption''}} \mid {\text {``definition''}} \mid \langle \textit{ref}\rangle . $$

The justification “by set extensionality” splits an atomic equation into two goals expressing mutual set inclusion, which is convenient when the ATP is reluctant about using extensionality. Next, “by assumption” and “by definition” each restrict the available premises for the ATP to just the local assumptions or previous definition, respectively. Finally, a \(\langle \textit{ref}\rangle \) is an explicit reference to previous theorems, reusing commonly used citation commands, such as from the Cleveref package. Only these explicitly cited theorems are then used as premises for the ATP, together with the local assumptions and relevant definitions. Most other proof steps can also be justified in this manner, but we will disregard justifications below for the sake of brevity.

Next, one can state intermediate claims using one of many equivalent phrases such as “We have \( \varPhi \)” or “Thus \( \varPhi \)”. This creates an ATP task for the claim and then adds the claim as an additional assumption for the remainder of the proof. Claims may be justified by a subproof.

One can also perform goal-directed proof steps, similar to many other formalization languages. There are straightforward proof steps like “Assume \( \varPhi \)”, “Suppose not”, and case distinctions. One can obtain witness with “Consider \(x, y, z\sim X\) such that \( \varPhi \)” or simplify universal goals with “Fix \(x, y, z \sim X\) such that \( \varPhi \)”, where both the bound by an arbitrary relation symbol \(\sim \) and the such-that refinement are optional.

A proof step of the form “It suffices to show \( \varPhi \)” creates a proof task of showing that \( \varPhi \) implies the current goal, and then sets \( \varPhi \) as the new goal.

Naproche-ZF supports calculational reasoning in the align* environment: each equation may be followed by an with a citation. Currently calculational reasoning works with equations and biconditionals. Other systems such as Lean and Isabelle have similar features that also support inequalities [3]. We will extend calculational reasoning as needed alongside future formalizations.

Premise Selection. Irrelevant premises make it harder for ATPs to find proofs. Premise selection [24] is a process that attempts to identify the relevant premises in a problem. \(\mathbb {N}\)aproche lacks premise selection, which is a major barrier to scaling beyond chapter-sized formalizations. Work is in progress to add premise selection to Naproche-ZF similar to the premise selection of Sledgehammer [23, 30]. The checker already includes a basic MePo-like [25] filter. There is also experimental premise selection using graph neural networks (GNNs), thanks to the help of Mirek Olšák and Josef Urban. The first-order problems exported by Naproche-ZF are structurally similar to the Mizar corpus on which GNN-based filters have performed well [14]. Training data for the GNN can be extracted from explicitly justified proof steps (those that use “by \(\langle \textit{ref}\rangle \)”). We also expect that premise selection trained on the Mizar corpus would perform well for Naproche-ZF. GNN-based premise selection is experimental and not yet integrated into the checker. We plan to scale up premise selection as we slowly grow the standard library.

Performance. An apples-to-apples performance comparison of \(\mathbb {N}\)aproche and Naproche-ZF is difficult since there are no exactly parallel formalizations in the two systems. Naproche-ZF processes texts faster overall, in part due to having a parser with better asymptotic behaviour. The total checking time however is dominated by proof searches in external ATPs. ATP tasks are single-threaded in \(\mathbb {N}\)aproche, whereas Naproche-ZF uses a thread pool to make use of modern multi-core CPUs. Moreover, when an ATP task fails in \(\mathbb {N}\)aproche, it retries the task after unfolding definitions. When developing large formalizations, this behaviour can lead to sudden explosions in checking time, as proofs that used to be fast suddenly become slow because they are retried multiple times. This behaviour dates back to SAD, where it was useful in the context of smaller formalizations and weaker ATPs. Naproche-ZF calls the ATP once per problem (but does use portfolio modes of ATPs). The standard library of Naproche-ZF is currently at a modest 4600 lines (excl. comments and blank lines). Using Vampire as the ATP, it takes less than 10 s to check on an Intel i7-13700K and less than 22 s on an Apple M1. In comparison, \(\mathbb {N}\)aproche can take over 10 times as long when checking formalizations of similar length.

Naproche-ZF optionally caches the initial segment of successful ATP proofs between runs, resuming checking at the first failed proof, which saves time during proof writing. The cache of a proof is invalidated if the premises differ to avoid reproducibility problems: we do not use monotonicity of entailment since adding irrelevant premises can make ATP proofs fail.

4 Conclusion and Future Work

Even in its early state, Naproche-ZF is a new theorem prover that shows that natural theorem provers can scale beyond chapter-sized formalizations. It features an extensible grammar-based approach to natural language, familiar set-theoretical foundation in higher-order logic, and proof automation powered by strong first-order ATPs. Use of concurrency, more control over the proof search process, premise selection, faster parsing, a module system, and other refinements result in a performance improvement by an order of magnitude compared to its predecessor \(\mathbb {N}\)aproche.

Naproche-ZF is still experimental research-quality software and requires more features, grammar refinement, bug fixes, user testing, and documentation to become user friendly software. Naproche-ZF’s checker is a command line tool and lacks an integrated development environment (IDE), which would make the system more user friendly. Currently, user interaction with the ATP within Naproche-ZF is limited: failed or slow proofs sometimes require digging through large logs and experimenting with the ATP on the command line. An IDE for Naproche-ZF should also facilitate better interaction with the ATP, e.g. by giving Sledgehammer-like suggestions after finding proofs.

Naproche-ZF would also benefit from improvements to general purpose proof automation (e.g. better premise selection) and from including special-purpose proof automation, e.g. for arithmetic.

The included standard library is still fairly small and it would be nice to update student formalizations completed in older versions of Naproche to also work in Naproche-ZF.

Currently the files are typeset as-is. It would be worthwhile to generate richer HTML (with MathML) or PDF documents, by, e.g., linking lexical items to their definition or enabling progressive disclosure of more complicated proofs.