Abstract
We introduce a machine-learning-based tool for the Lean proof assistant that suggests relevant premises for theorems being proved by a user. The design principles for the tool are (1) tight integration with the proof assistant, (2) ease of use and installation, (3) a lightweight and fast approach. For this purpose, we designed a custom version of the random forest model, trained in an online fashion. It is implemented directly in Lean, which was possible thanks to the rich and efficient metaprogramming features of Lean 4. The random forest is trained on data extracted from mathlib – Lean’s mathematics library. We experiment with various options for producing training features and labels. The advice from a trained model is accessible to the user via the tactic which can be called in an editor while constructing a proof interactively.
The results were supported by the Hoskinson Center for Formal Mathematics (BP, RFM, EA), the Kościuszko Foundation (BP), and the Principal’s Career Development Scholarship of the University of Edinburgh (RFM).
You have full access to this open access chapter, Download conference paper PDF
Keywords
1 Introduction
Formalizing mathematics in proof assistants is an ambitious and hard undertaking. One of the major challenges in constructing formal proofs of theorems depending on multiple other results is the prerequisite of having a good familiarity with the structure and contents of the library. Tools for helping users search through formal libraries are currently limited.
In the case of the Lean proof assistant [13], users may look for relevant lemmas in its formal library, mathlib [5], either by (1) using general textual search tools and keywords, (2) browsing the related source files manually, (3) using mathlib’s or tactics.
Approaches (1) and (2) are often slow and tedious. The limitation of approach (3) is the fact that or propose lemmas that strictly match the goal at the current proof state. This is often very useful, but it also means that these tactics often fail to direct the user to relevant lemmas that do not match the current goal exactly. They may also suggest too many trivial lemmas if the goal is simple.
The aim of this project is to make progress towards improving the situation of a Lean user looking for relevant lemmas while building proofs. We develop a new tool that efficiently computes a ranking of potentially useful lemmas selected by a machine learning (ML) model trained on data extracted from mathlib. This ranking can be accessed and used interactively via the tactic.
The project described here belongs to the already quite broad body of work dealing with the problem of fact selection for theorem proving [1, 7, 9, 11, 12, 15, 16]. This problem, commonly referred to as the premise selection problem, is crucial when performing automated reasoning in large formal libraries – both in the context of automated (ATP) and interactive (ITP) theorem proving, and regardless of the underlying logical calculus. Most of the existing work on premise selection focuses on the ATP context. Our main contribution is the development of a premise selection tool that is practically usable in a proof assistant (Lean in that case), tightly integrated with it, lightweight, extendable, and equipped with a convenient interface. The tool is available in a public GitHub repository: https://github.com/BartoszPiotrowski/lean-premise-selection.
2 Dataset Collection
A crucial requirement of a useful ML model is a high-quality dataset of training examples. It should represent the learning task well and be suitable for the ML architecture being applied.
In this work, we use simple ML architectures that cannot process raw theorem statements and require featurization as a preprocessing step. The features need to be meaningful yet simple so that the model can use them appropriately. Our approach is described in Sect. 2.1. The notion of relevant premise may be understood differently depending on the context. In Sect. 2.2, we describe the different specifications of this notion that we used in our experiments.
The tool developed in this work is implemented and meant to be used in Lean 4 together with mathlib 4. However, since, at the time of writing, Lean 4’s version of the library is still being ported from Lean 3, we use mathlib3portFootnote 1 as our main data source.
2.1 Features
The features, similar to those used in [8, 15], consist of the symbols used in the theorem statement with different degrees of structure. In particular, three types of features are used: names, bigrams and trigrams.
As an illustration, take this theorem about groups with zero:
This statement comes from one of the source files of mathlib. When producing the features for it, we do not use it directly as printed above but rather we take its elaborated counterpart – a much more detailed version where all the hidden assumptions are made explicit by the Lean’s elaborator so that the expression precisely conforms to Lean’s dependent type theory.
The most basic form of featurization is the bag-of-words model, where we simply collect all the names (and numerical constants) involved in the theorem.
Following this definition, we obtain names , , and , which are visible in the source version of the statement,Footnote 2 plus many more hidden names only appearing in the elaborated expression, e.g., that is related to interpreting numerical literals as natural numbers.
During the featurization we distinguish features coming from the conclusion and the hypotheses (assumptions) of the theorem, and we mark them by prepending either or , respectively.
For our running example of theorem , all this results in the list of names that looks as follows:
It would be desirable, however, to keep track of which symbols appear next to each other in the syntactic trees of the theorem hypotheses and its statement. Thus, we extract bigrams that are formed by the head symbol and each of its arguments (separated by / below).
Similarly, we also consider trigrams, taking all paths of length 3 from the syntactic tree of the expression.
2.2 Relevant Premises
To obtain the list of all the premises used in a proof of a given theorem it suffices to traverse the theorem’s proof termFootnote 3 and keep track of all the constants whose type is a proposition. For instance, the raw list of premises that appear in the proof of is:
For more complicated examples, this approach results in a large number of premises including lemmas used implicitly by tactics (for instance, those picked by the ‘simplify’ tactic ), or simple facts that a user would rarely write explicitly. Three different filters are applied to mitigate this issue: all, source, and math. They are described below and their overall effect is shown in Table 1.
-
1.
The all filter preserves almost all premises from the original, raw list, removing those that were generated automatically by Lean. They contain a leading underscore in their names, e.g., . In our example, there are no such premises. Examples from this filter are not appropriate for training an ML advisor for interactive use as the suggestions would contain many lemmas used implicitly by tactics. Yet, such an advisor could be used for automated ITP approaches such as hammers [3].
-
2.
The source filter leaves only those premises that appear in the proof’s source code. The idea is to model the explicit usage of premises by a user. Following our example, we would take the following proof as a string and list only the three premises appearing there:
-
3.
The math filter preserves only lemmas that are clearly of mathematical nature, discarding basic, technical ones. The names of all theorems and definitions from mathlib are extracted and used as a white list. In particular, this means that many basic lemmas from Lean’s core library (e.g. from our example) are filtered out.
In addition to our base datasets containing one data point per theorem, we also created a dataset (labeled as intermediate) representing intermediate proof states. In the standard data sets we recorded features of an initial proof state (the hypotheses and the conclusion of the theorem to be proved) and the premises used in a full proof. In the intermediate data set we instead record features of a proof state encountered during constructing a proof, and premises used in the next proof step only.
To this end, we used LeanInk,Footnote 4 a helper tool for Alectryon [17] – a toolkit that aids exploration of tactical proof scripts without running the proof assistant. Given a Lean file, LeanInk generates all the states that a user might be able to see in the infoview (a panel in Lean that displays goal states and other information about the prover’s state) by clicking on the file. The file is split into fragments, each containing a string of Lean code, represented by a list of tokens, together with the proof states before and after. In this way, the file can be loaded statically simulating the effect of running Lean. Furthermore, it can be configured to keep track of typing information, which is key to detecting which tokens are premises. We modified LeanInk so that every fragment that appears inside a proof is treated as its own theorem by our extractor. We gather all the premises found in the list of tokens and featurize the hypotheses and goals in the “before” proof state.
This dataset consists of 91 292 examples and 143 165 premises, which gives an average of around 1.57 premises per example. It represents a more fine-grained use of the premises, which does not exactly correspond to our main objective of providing rankings of premises on the level of theorem statements. We treat it as an auxiliary dataset potentially useful for augmenting our base data sets.
3 Machine Learning Models
The task modelled here with ML is predicting a ranking of likely useful premises (lemmas and theorems) conditioned by the features of the statement of a theorem being proved by a user. The nature of this problem is different than common applications of classical ML: both the number of features and labels (premises) to predict is large, and the training examples are sparse in the feature space. Thus, we could not directly rely on traditional implementations of ML algorithms, and using custom-built versions was necessary. As one of our design requirements was tight integration with the proof assistant, we implemented the ML algorithms directly in Lean 4, without needing to call external tools. This also served as a test for the maturity and efficiency of Lean 4 as a programming language.
In Sects. 3.1 and 3.2 we describe two machine learning algorithms implemented in this work: k-nearest neighbours (k-NN) and random forest.
3.1 k-Nearest Neighbours
This is a classical and conceptually simple ML algorithm [6], which has already been used multiple times for premise selection [2, 9, 10]. It belongs to the lazy learning category, meaning that it does not result in a prediction model trained beforehand on the dataset, but rather the dataset is an input to the algorithm while producing the predictions.
Given an unlabeled example, k-NN produces a prediction by extracting the labels of the k most similar examples in the dataset and returning an averaged (or most frequent) label. In our case, the labels are lists of premises. We compose multiple labels into a ranking of premises according to the frequency of appearance in the concatenated labels.
The similarity measure in the feature space calculates how many features are shared between the two data points, but additionally puts more weight on those features that are rarer in the whole training dataset \(\mathcal {D}\). The formula for the similarity of the two examples \(x_1\) and \(x_2\) associated with sets of features \(f_1\) and \(f_2\), respectively, is given below.
where \(\mathcal {D}_f\) are those training examples that contain the feature f.
The advantages of k-NN are its simplicity and the lack of training. A disadvantage, however, is the need to traverse the whole training dataset in order to produce a single prediction (a ranking). This may be slow, and thus not optimal for interactive usage in proof assistants.
3.2 Random Forest
As an alternative to k-NN, we use random forest [4] – an ML algorithm from the eager learning category, with a separate training phase resulting in a prediction model consisting of a collection of decision trees. The leaves of the trees contain labels, and their nodes contain decision rules based on the features. In our case, the labels are sets of premises, and the rules are simple tests that check if a given feature appears in an example.
When predicting, unlabeled examples are passed down the trees to the leaves, the reached labels are recorded, and the final prediction is averaged across the trees via voting. The trees are trained in such a way as to avoid correlations between them, and the averaged prediction from them is of better quality than the prediction from a single tree.
Our version of random forest, adapted to deal with sparse binary features and a large number of labels, is similar to the one used in [19], where the task was to predict the next tactic progressing a proof in Coq proof assistant. There, the features were also sparse, however, the difference is that here we need to predict sets of labels (premises), not just one label (the next tactic).
Our random forest is trained in an online manner, i.e., it is updated sequentially with single training examples – not with the entire training dataset at once, as is typically done. The rationale for this is to make it easy to update the model with data coming from new theorems proved by a user. This allows the model to immediately provide suggestions taking into account these recently added theorems.Footnote 5
Algorithm 1 provides a sketch of how a training example updates a tree – for all the details see the actual implementation in our public GitHub repository.Footnote 6 A crucial part of the algorithm is the MakeSplitRule function creating node splitting rules. Searching for the rules resulting in optimal splits would be costly, thus this function relies on heuristics.
Figure 1 schematically depicts how a simple decision tree from a trained random forest predicts a set of premises for an input example.
4 Evaluation Setup and Results
To assess the performance of the ML algorithms, the data points extracted from mathlib were split into training and testing sets. The testing examples come from the modules that are not dependencies of any other modules (there are 592 of them). This simulates a realistic scenario in which a user utilizing the suggestion tool develops a new mathlib module. The rest of the modules (2436) served as the source of training examples.
Two measures of the quality of the rankings produced by ML are defined: Cover and Cover\(_+\). Assuming a theorem T depends on the set of premises P of size n, and R is the ranking of premises predicted by the ML advisor for T, these measures are defined as follows:
where \(R\texttt {[:} k\texttt {]}\) is a set of k initial premises from ranking R. Both Cover and Cover\(_+\) return values in [0, 1]. Cover gives the score of 1 only for a “perfect” prediction where the premises actually used in the proof form an initial segment of the ranking. Cover\(_+\) may also give a perfect score to less precise predictions. The rationale for Cover\(_+\) is that the user in practice may look through 10 or more suggested premises. This is often more than the n premises actually used in the proof, so we consider initial segments of length \(n + 10\) in Cover\(_+\).
Both k-NN and random forest are evaluated on data subject to all three premise filters described in Sect. 2.2. For each of these variants of data, three combinations of features are tested: (1) names only, (2) names and bigrams, (3) names, bigrams, and trigrams. The hyper-parameters for the ML algorithms were selected by an experiment on a smaller dataset. For k-NN, the number of neighbours was fixed to 100. For random forest, the number of trees was set to 300, each example was used for training a particular decision tree with probability equal to 0.3, and the training algorithm passed through the whole training data 3 times.
Table 2 shows the results of the experiment. In terms of the Cover metric, random forest performed better than k-NN for all data configurations. However, for Cover\(_+\) metric, k-NN surpassed random forest for the math filter.
It turned out that the union of names and bigrams constitutes the best features for all the filters and both ML algorithms. It likely means that the more complex trigrams did not help the algorithms to generalize well but rather caused over-fitting on the training set.
The results for the all filter appear to be much higher than for the other two filters. However, this is because applying all results in many simple examples containing just a few common, basic premises (e.g., just a single rfl lemma). They increase the average score.
Overall, random forest with names\(\, + \,\)bigrams (n+b) features gives the best results. An additional practical advantage of this model over k-NN is the speed of outputting predictions. For instance, for the source filter and n+b features, the average times of predicting a ranking of premises per theorem were 0.28 s and 5.65 s for random forest and k-NN, respectively.
Additionally, we evaluated the ML models on the intermediate dataset, using n+b features. The random forest achieved Cover = 0.09 and Cover\(_+\) = 0.24, whereas k-NN resulted in Cover = 0.08 and Cover\(_+\) = 0.21 on the testing part of the data. Then, we used the intermediate dataset in an attempt to improve the testing results on the base dataset with the source filter (as intermediate only contains premises exposed in the source files). We used the intermediate data as a pre-training dataset, first training a random forest on it, and later on the base data. We also used intermediate to augment the base data, mixing the two together. However, neither in the pre-training, nor in the augmentation mode statistically significant improvements in the testing performance were achieved. It is possible that the prediction quality from the practical perspective actually improved, being more proof-state-dependent and not only theorem-dependent, but it did not manifest in our theorem-dependent evaluation.
The evaluation may be reproduced by following the instructions in the linked source code.Footnote 7
5 Interactive Tool
The ML predictor is wrapped in an interactive tactic that users can type into their proof script. It will invoke the predictor and produce a list of suggestions. This list is displayed in the infoview. The display makes use of the new remote-procedure-call (RPC) feature in Lean 4 [14], to then asynchronously run various tactics for each suggestion. Given a suggested premise p, the system will attempt to run tactics p, p and p , and return the first successful tactic application that advances the state. This will then be displayed to the user as shown in Fig. 2. She can select the resulting tactic to insert into the proof script. By using an asynchronous approach, we can display results rapidly without waiting for a slow tactic search to complete.
6 Future Work
There are several directions in which the current work may be developed.
The results may be improved by augmenting the dataset with, for instance, synthetic theorems, as well as developing better features, utilizing the well-defined structure of Lean expressions.
The evaluation may be extended to assess the proof-state level performance, and to compare with the standard Lean’s suggestion tactics: and . It could be beneficial to combine these tactics – which use sctrict matching – with our tool based on statistical matching.
Applying modern neural architectures in place of the simpler ML algorithms used here is a promising path [7, 12, 16, 18]. It would depart from our philosophy of a lightweight, self-contained approach as the suggestions would come from an external tool, possibly placed on a remote server. However, given the strength of the current neural networks, we could hope for higher-quality predictions. Moreover, neural models do not require hand-engineered features. The results presented here could serve as a baseline for comparison.
Finally, premise selection is an important component of ITP hammer systems [3]. The presented tool may be readily used for a hammer in Lean, which has not yet been developed.
Notes
- 1.
https://github.com/leanprover-community/mathlib3port (commit f4e5dfe).
- 2.
In fact, we use translations of these symbols from the elaborated counterpart of the theorem; so, for instance, we use instead of the notation \(\ne \), etc.
- 3.
A proof term is an internal Lean expression whose type is the theorem, constructed based on the proof written by a user, possibly using tactics.
- 4.
- 5.
This mode, however, has not yet been tested in the current stage of this work.
- 6.
The decision tree implementation is in a file PremiseSelection/Tree.lean.
- 7.
References
Alama, J., Heskes, T., Kühlwein, D., Tsivtsivadze, E., Urban, J.: Premise selection for mathematics by corpus analysis and kernel methods. J. Autom. Reason. 52(2), 191–213 (2014). https://doi.org/10.1007/s10817-013-9286-5
Blanchette, J.C., Greenaway, D., Kaliszyk, C., Kühlwein, D., Urban, J.: A learning-based fact selector for Isabelle/HOL. J. Autom. Reason. 57(3), 219–244 (2016). https://doi.org/10.1007/s10817-016-9362-8
Blanchette, J.C., Kaliszyk, C., Paulson, L.C., Urban, J.: Hammering towards QED. J. Formaliz. Reason. 9(1), 101–148 (2016). https://doi.org/10.6092/issn.1972-5787/4593
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
The mathlib Community. The lean mathematical library. In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, pp. 367–381. CPP 2020, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3372885.3373824
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. SSS, Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7
Irving, G., Szegedy, C., Alemi, A.A., Eén, N., Chollet, F., Urban, J.: DeepMath - deep sequence models for premise selection. In: Lee, D.D., Sugiyama, M., von Luxburg, U., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016(December), pp. 5–10, 2016. Barcelona, Spain, pp. 2235–2243 (2016), https://proceedings.neurips.cc/paper/2016/hash/f197002b9a0853eca5e046d9ca4663d5-Abstract.html
Jakubův, J., Urban, J.: ENIGMA: efficient learning-based inference guiding machine. In: Geuvers, H., England, M., Hasan, O., Rabe, F., Teschke, O. (eds.) CICM 2017. LNCS (LNAI), vol. 10383, pp. 292–302. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-62075-6_20
Kaliszyk, C., Urban, J.: Learning-assisted automated reasoning with Flyspeck. J. Autom. Reason. 53(2), 173–213 (2014). https://doi.org/10.1007/s10817-014-9303-3
Kaliszyk, C., Urban, J.: MizAR 40 for Mizar 40. J. Autom. Reason. 55(3), 245–256 (2015). https://doi.org/10.1007/s10817-015-9330-8
Kühlwein, D., van Laarhoven, T., Tsivtsivadze, E., Urban, J., Heskes, T.: Overview and evaluation of premise selection techniques for large theory mathematics. In: Gramlich, B., Miller, D., Sattler, U. (eds.) IJCAR 2012. LNCS (LNAI), vol. 7364, pp. 378–392. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31365-3_30
Mikula, M., et al.: Magnushammer: A transformer-based approach to premise selection. CoRR abs/2303.04488 (2023).https://doi.org/10.48550/arXiv.2303.04488,https://doi.org/10.48550/arXiv.2303.04488
Moura, L., Ullrich, S.: The lean 4 theorem prover and programming language. In: Platzer, A., Sutcliffe, G. (eds.) CADE 2021. LNCS (LNAI), vol. 12699, pp. 625–635. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-79876-5_37
Nawrocki, W., Ayers, E.W., Ebner, G.: An extensible user interface for Lean 4. In: 14th International Conference on Interactive Theorem Proving, ITP 2023, July 31-August 4, 2023, Białystok, Poland. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2023)
Piotrowski, B., Urban, J.: ATPboost: Learning Premise Selection in Binary Setting with ATP Feedback. In: Galmiche, D., Schulz, S., Sebastiani, R. (eds.) IJCAR 2018. LNCS (LNAI), vol. 10900, pp. 566–574. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-94205-6_37
Piotrowski, B., Urban, J.: Stateful premise selection by recurrent neural networks. In: Albert, E., Kovács, L. (eds.) LPAR 2020: 23rd International Conference on Logic for Programming, Artificial Intelligence and Reasoning, Alicante, Spain, 22–27 May 2020. EPiC Series in Computing, vol. 73, pp. 409–422. EasyChair (2020). https://doi.org/10.29007/j5hd
Pit-Claudel, C.: Untangling mechanized proofs. In: Proceedings of the 13th ACM SIGPLAN International Conference on Software Language Engineering, pp. 155–174. SLE 2020, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3426425.3426940,https://pit-claudel.fr/clement/papers/alectryon-SLE20.pdf
Tworkowski, S., et al.: Formal premise selection with language models. In: The 7th Conference on Artificial Intelligence and Theorem Proving, AITP 2022(September), pp. 4–9, 2022. Aussois and online, France (2022). http://aitp-conference.org/2022/abstract/AITP_2022_paper_32.pdf
Zhang, L., Blaauwbroek, L., Piotrowski, B., Černỳ, P., Kaliszyk, C., Urban, J.: Online machine learning techniques for coq: a comparison. In: Kamareddine, F., Sacerdoti Coen, C. (eds.) CICM 2021. LNCS (LNAI), vol. 12833, pp. 67–83. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-81097-9_5
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this paper
Cite this paper
Piotrowski, B., Mir, R.F., Ayers, E. (2023). Machine-Learned Premise Selection for Lean. In: Ramanayake, R., Urban, J. (eds) Automated Reasoning with Analytic Tableaux and Related Methods. TABLEAUX 2023. Lecture Notes in Computer Science(), vol 14278. Springer, Cham. https://doi.org/10.1007/978-3-031-43513-3_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-43513-3_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43512-6
Online ISBN: 978-3-031-43513-3
eBook Packages: Computer ScienceComputer Science (R0)