Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhao, Haoyu, Yang, Ziran, Li, Jiawei, He, Deyuan, Li, Zenan, Jin, Chi, Veeravalli, Venugopal V., Gupta, Aarti, Arora, Sanjeev
Format:	Preprint
Published:	2026
Subjects:	Software Engineering Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2602.09464
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911437260062720
author	Zhao, Haoyu Yang, Ziran Li, Jiawei He, Deyuan Li, Zenan Jin, Chi Veeravalli, Venugopal V. Gupta, Aarti Arora, Sanjeev
author_facet	Zhao, Haoyu Yang, Ziran Li, Jiawei He, Deyuan Li, Zenan Jin, Chi Veeravalli, Venugopal V. Gupta, Aarti Arora, Sanjeev
contents	Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, but a unified methodology for cross-paradigm evaluation is lacking. Existing benchmarks test only individual languages/tools (e.g., Dafny, Verus, and Lean) and each covers very different tasks, so the performance numbers are not directly comparable. We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean. By enforcing identical functional contracts, AlgoVeri reveals critical capability gaps in verification systems. While frontier models achieve tractable success in Dafny ($40.3$% for Gemini-3 Flash), where high-level abstractions and SMT automation simplify the workflow, performance collapses under the systems-level memory constraints of Verus ($24.7$%) and the explicit proof construction required by Lean (7.8%). Beyond aggregate metrics, we uncover a sharp divergence in test-time compute dynamics: Gemini-3 effectively utilizes iterative repair to boost performance (e.g., tripling pass rates in Dafny), whereas GPT-OSS saturates early. Finally, our error analysis shows that language design affects the refinement trajectory: while Dafny allows models to focus on logical correctness, Verus and Lean trap models in persistent syntactic and semantic barriers. All data and evaluation code can be found at https://github.com/haoyuzhao123/algoveri.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_09464
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms Zhao, Haoyu Yang, Ziran Li, Jiawei He, Deyuan Li, Zenan Jin, Chi Veeravalli, Venugopal V. Gupta, Aarti Arora, Sanjeev Software Engineering Artificial Intelligence Computation and Language Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, but a unified methodology for cross-paradigm evaluation is lacking. Existing benchmarks test only individual languages/tools (e.g., Dafny, Verus, and Lean) and each covers very different tasks, so the performance numbers are not directly comparable. We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean. By enforcing identical functional contracts, AlgoVeri reveals critical capability gaps in verification systems. While frontier models achieve tractable success in Dafny ($40.3$% for Gemini-3 Flash), where high-level abstractions and SMT automation simplify the workflow, performance collapses under the systems-level memory constraints of Verus ($24.7$%) and the explicit proof construction required by Lean (7.8%). Beyond aggregate metrics, we uncover a sharp divergence in test-time compute dynamics: Gemini-3 effectively utilizes iterative repair to boost performance (e.g., tripling pass rates in Dafny), whereas GPT-OSS saturates early. Finally, our error analysis shows that language design affects the refinement trajectory: while Dafny allows models to focus on logical correctness, Verus and Lean trap models in persistent syntactic and semantic barriers. All data and evaluation code can be found at https://github.com/haoyuzhao123/algoveri.
title	AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms
topic	Software Engineering Artificial Intelligence Computation and Language
url	https://arxiv.org/abs/2602.09464

Similar Items