Saved in:
Bibliographic Details
Main Authors: Bursuc, Sergiu, Ehrenborg, Theodore, Lin, Shaowei, Astefanoaei, Lacramioara, Chiosa, Ionel Emilian, Kukovec, Jure, Singh, Alok, Butterley, Oliver, Bizid, Adem, Dougherty, Quinn, Zhao, Miranda, Tan, Max, Tegmark, Max
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2509.22908
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912609552302080
author Bursuc, Sergiu
Ehrenborg, Theodore
Lin, Shaowei
Astefanoaei, Lacramioara
Chiosa, Ionel Emilian
Kukovec, Jure
Singh, Alok
Butterley, Oliver
Bizid, Adem
Dougherty, Quinn
Zhao, Miranda
Tan, Max
Tegmark, Max
author_facet Bursuc, Sergiu
Ehrenborg, Theodore
Lin, Shaowei
Astefanoaei, Lacramioara
Chiosa, Ionel Emilian
Kukovec, Jure
Singh, Alok
Butterley, Oliver
Bizid, Adem
Dougherty, Quinn
Zhao, Miranda
Tan, Max
Tegmark, Max
contents We present and test the largest benchmark for vericoding, LLM-generation of formally verified code from formal specifications - in contrast to vibe coding, which generates potentially buggy code from a natural language description. Our benchmark contains 12,504 formal specifications, with 3,029 in Dafny, 2,334 in Verus/Rust and 7,141 in Lean. Of these, 6,174 are new unseen problems. We find vericoding success rates of 27% in Lean, 44% in Verus/Rust and 82% in Dafny using off-the-shelf LLMs. Adding natural-language descriptions does not significantly improve performance. We also find that LLM progress has improved progress on pure Dafny verification from 68% to 96% over the past year. The benchmark and vericoding results are shared at https://github.com/Beneficial-AI-Foundation/vericoding-benchmark
format Preprint
id arxiv_https___arxiv_org_abs_2509_22908
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle A benchmark for vericoding: formally verified program synthesis
Bursuc, Sergiu
Ehrenborg, Theodore
Lin, Shaowei
Astefanoaei, Lacramioara
Chiosa, Ionel Emilian
Kukovec, Jure
Singh, Alok
Butterley, Oliver
Bizid, Adem
Dougherty, Quinn
Zhao, Miranda
Tan, Max
Tegmark, Max
Software Engineering
Machine Learning
Programming Languages
We present and test the largest benchmark for vericoding, LLM-generation of formally verified code from formal specifications - in contrast to vibe coding, which generates potentially buggy code from a natural language description. Our benchmark contains 12,504 formal specifications, with 3,029 in Dafny, 2,334 in Verus/Rust and 7,141 in Lean. Of these, 6,174 are new unseen problems. We find vericoding success rates of 27% in Lean, 44% in Verus/Rust and 82% in Dafny using off-the-shelf LLMs. Adding natural-language descriptions does not significantly improve performance. We also find that LLM progress has improved progress on pure Dafny verification from 68% to 96% over the past year. The benchmark and vericoding results are shared at https://github.com/Beneficial-AI-Foundation/vericoding-benchmark
title A benchmark for vericoding: formally verified program synthesis
topic Software Engineering
Machine Learning
Programming Languages
url https://arxiv.org/abs/2509.22908