Saved in:
Bibliographic Details
Main Authors: Zhao, Haoyu, Yang, Ziran, Li, Jiawei, He, Deyuan, Li, Zenan, Jin, Chi, Veeravalli, Venugopal V., Gupta, Aarti, Arora, Sanjeev
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.09464
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911437260062720
author Zhao, Haoyu
Yang, Ziran
Li, Jiawei
He, Deyuan
Li, Zenan
Jin, Chi
Veeravalli, Venugopal V.
Gupta, Aarti
Arora, Sanjeev
author_facet Zhao, Haoyu
Yang, Ziran
Li, Jiawei
He, Deyuan
Li, Zenan
Jin, Chi
Veeravalli, Venugopal V.
Gupta, Aarti
Arora, Sanjeev
contents Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, but a unified methodology for cross-paradigm evaluation is lacking. Existing benchmarks test only individual languages/tools (e.g., Dafny, Verus, and Lean) and each covers very different tasks, so the performance numbers are not directly comparable. We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean. By enforcing identical functional contracts, AlgoVeri reveals critical capability gaps in verification systems. While frontier models achieve tractable success in Dafny ($40.3$% for Gemini-3 Flash), where high-level abstractions and SMT automation simplify the workflow, performance collapses under the systems-level memory constraints of Verus ($24.7$%) and the explicit proof construction required by Lean (7.8%). Beyond aggregate metrics, we uncover a sharp divergence in test-time compute dynamics: Gemini-3 effectively utilizes iterative repair to boost performance (e.g., tripling pass rates in Dafny), whereas GPT-OSS saturates early. Finally, our error analysis shows that language design affects the refinement trajectory: while Dafny allows models to focus on logical correctness, Verus and Lean trap models in persistent syntactic and semantic barriers. All data and evaluation code can be found at https://github.com/haoyuzhao123/algoveri.
format Preprint
id arxiv_https___arxiv_org_abs_2602_09464
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms
Zhao, Haoyu
Yang, Ziran
Li, Jiawei
He, Deyuan
Li, Zenan
Jin, Chi
Veeravalli, Venugopal V.
Gupta, Aarti
Arora, Sanjeev
Software Engineering
Artificial Intelligence
Computation and Language
Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, but a unified methodology for cross-paradigm evaluation is lacking. Existing benchmarks test only individual languages/tools (e.g., Dafny, Verus, and Lean) and each covers very different tasks, so the performance numbers are not directly comparable. We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean. By enforcing identical functional contracts, AlgoVeri reveals critical capability gaps in verification systems. While frontier models achieve tractable success in Dafny ($40.3$% for Gemini-3 Flash), where high-level abstractions and SMT automation simplify the workflow, performance collapses under the systems-level memory constraints of Verus ($24.7$%) and the explicit proof construction required by Lean (7.8%). Beyond aggregate metrics, we uncover a sharp divergence in test-time compute dynamics: Gemini-3 effectively utilizes iterative repair to boost performance (e.g., tripling pass rates in Dafny), whereas GPT-OSS saturates early. Finally, our error analysis shows that language design affects the refinement trajectory: while Dafny allows models to focus on logical correctness, Verus and Lean trap models in persistent syntactic and semantic barriers. All data and evaluation code can be found at https://github.com/haoyuzhao123/algoveri.
title AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms
topic Software Engineering
Artificial Intelligence
Computation and Language
url https://arxiv.org/abs/2602.09464