Saved in:
| Main Authors: | , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.09464 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866911437260062720 |
|---|---|
| author | Zhao, Haoyu Yang, Ziran Li, Jiawei He, Deyuan Li, Zenan Jin, Chi Veeravalli, Venugopal V. Gupta, Aarti Arora, Sanjeev |
| author_facet | Zhao, Haoyu Yang, Ziran Li, Jiawei He, Deyuan Li, Zenan Jin, Chi Veeravalli, Venugopal V. Gupta, Aarti Arora, Sanjeev |
| contents | Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, but a unified methodology for cross-paradigm evaluation is lacking. Existing benchmarks test only individual languages/tools (e.g., Dafny, Verus, and Lean) and each covers very different tasks, so the performance numbers are not directly comparable. We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean. By enforcing identical functional contracts, AlgoVeri reveals critical capability gaps in verification systems. While frontier models achieve tractable success in Dafny ($40.3$% for Gemini-3 Flash), where high-level abstractions and SMT automation simplify the workflow, performance collapses under the systems-level memory constraints of Verus ($24.7$%) and the explicit proof construction required by Lean (7.8%). Beyond aggregate metrics, we uncover a sharp divergence in test-time compute dynamics: Gemini-3 effectively utilizes iterative repair to boost performance (e.g., tripling pass rates in Dafny), whereas GPT-OSS saturates early. Finally, our error analysis shows that language design affects the refinement trajectory: while Dafny allows models to focus on logical correctness, Verus and Lean trap models in persistent syntactic and semantic barriers. All data and evaluation code can be found at https://github.com/haoyuzhao123/algoveri. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2602_09464 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms Zhao, Haoyu Yang, Ziran Li, Jiawei He, Deyuan Li, Zenan Jin, Chi Veeravalli, Venugopal V. Gupta, Aarti Arora, Sanjeev Software Engineering Artificial Intelligence Computation and Language Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, but a unified methodology for cross-paradigm evaluation is lacking. Existing benchmarks test only individual languages/tools (e.g., Dafny, Verus, and Lean) and each covers very different tasks, so the performance numbers are not directly comparable. We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean. By enforcing identical functional contracts, AlgoVeri reveals critical capability gaps in verification systems. While frontier models achieve tractable success in Dafny ($40.3$% for Gemini-3 Flash), where high-level abstractions and SMT automation simplify the workflow, performance collapses under the systems-level memory constraints of Verus ($24.7$%) and the explicit proof construction required by Lean (7.8%). Beyond aggregate metrics, we uncover a sharp divergence in test-time compute dynamics: Gemini-3 effectively utilizes iterative repair to boost performance (e.g., tripling pass rates in Dafny), whereas GPT-OSS saturates early. Finally, our error analysis shows that language design affects the refinement trajectory: while Dafny allows models to focus on logical correctness, Verus and Lean trap models in persistent syntactic and semantic barriers. All data and evaluation code can be found at https://github.com/haoyuzhao123/algoveri. |
| title | AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms |
| topic | Software Engineering Artificial Intelligence Computation and Language |
| url | https://arxiv.org/abs/2602.09464 |