Saved in:
Bibliographic Details
Main Authors: Dekoninck, Jasper, Petrov, Ivo, Minchev, Kristian, Balunovic, Mislav, Vechev, Martin, Marinov, Miroslav, Drencheva, Maria, Konova, Lyuba, Shumanov, Milen, Tsvetkov, Kaloyan, Drenchev, Nikolay, Todorov, Lazar, Nikolova, Kalina, Georgiev, Nikolay, Kalinkova, Vanesa, Ismoldayev, Margulan
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2506.21621
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • In recent months, large language models (LLMs) have made significant progress in mathematical proof generation, but further advancement is hindered by the lack of a large-scale, high-quality dataset of human-evaluated proofs. While expensive to create, such a dataset is essential for driving improvements in training and enabling a rigorous analysis of proof generation capabilities. In this work, we present the Open Proof Corpus (OPC), a dataset comprising over 5,000 human-evaluated proofs produced by state-of-the-art LLMs. The OPC was specifically designed for broad applicability and downstream usage in proof generation research and is the first to include a substantial number of correct, LLM-generated solutions to problems from prestigious mathematics competitions such as the USAMO and IMO. Using the OPC, we explore critical questions in automated proof generation: (1) the performance gap between natural language and formal proof generation, (2) the discrepancy between final-answer accuracy and full-proof validity, and (3) the impact of best-of-n selection on proof quality. Finally, to showcase the utility of the OPC, we finetune an 8B-parameter model on the dataset, obtaining a model that performs on par with the best model, Gemini-2.5-Pro, on the task of evaluating proof correctness.