Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Bursuc, Sergiu, Ehrenborg, Theodore, Lin, Shaowei, Astefanoaei, Lacramioara, Chiosa, Ionel Emilian, Kukovec, Jure, Singh, Alok, Butterley, Oliver, Bizid, Adem, Dougherty, Quinn, Zhao, Miranda, Tan, Max, Tegmark, Max
Format:	Preprint
Published:	2025
Subjects:	Software Engineering Machine Learning Programming Languages
Online Access:	https://arxiv.org/abs/2509.22908
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912609552302080
author	Bursuc, Sergiu Ehrenborg, Theodore Lin, Shaowei Astefanoaei, Lacramioara Chiosa, Ionel Emilian Kukovec, Jure Singh, Alok Butterley, Oliver Bizid, Adem Dougherty, Quinn Zhao, Miranda Tan, Max Tegmark, Max
author_facet	Bursuc, Sergiu Ehrenborg, Theodore Lin, Shaowei Astefanoaei, Lacramioara Chiosa, Ionel Emilian Kukovec, Jure Singh, Alok Butterley, Oliver Bizid, Adem Dougherty, Quinn Zhao, Miranda Tan, Max Tegmark, Max
contents	We present and test the largest benchmark for vericoding, LLM-generation of formally verified code from formal specifications - in contrast to vibe coding, which generates potentially buggy code from a natural language description. Our benchmark contains 12,504 formal specifications, with 3,029 in Dafny, 2,334 in Verus/Rust and 7,141 in Lean. Of these, 6,174 are new unseen problems. We find vericoding success rates of 27% in Lean, 44% in Verus/Rust and 82% in Dafny using off-the-shelf LLMs. Adding natural-language descriptions does not significantly improve performance. We also find that LLM progress has improved progress on pure Dafny verification from 68% to 96% over the past year. The benchmark and vericoding results are shared at https://github.com/Beneficial-AI-Foundation/vericoding-benchmark
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_22908
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	A benchmark for vericoding: formally verified program synthesis Bursuc, Sergiu Ehrenborg, Theodore Lin, Shaowei Astefanoaei, Lacramioara Chiosa, Ionel Emilian Kukovec, Jure Singh, Alok Butterley, Oliver Bizid, Adem Dougherty, Quinn Zhao, Miranda Tan, Max Tegmark, Max Software Engineering Machine Learning Programming Languages We present and test the largest benchmark for vericoding, LLM-generation of formally verified code from formal specifications - in contrast to vibe coding, which generates potentially buggy code from a natural language description. Our benchmark contains 12,504 formal specifications, with 3,029 in Dafny, 2,334 in Verus/Rust and 7,141 in Lean. Of these, 6,174 are new unseen problems. We find vericoding success rates of 27% in Lean, 44% in Verus/Rust and 82% in Dafny using off-the-shelf LLMs. Adding natural-language descriptions does not significantly improve performance. We also find that LLM progress has improved progress on pure Dafny verification from 68% to 96% over the past year. The benchmark and vericoding results are shared at https://github.com/Beneficial-AI-Foundation/vericoding-benchmark
title	A benchmark for vericoding: formally verified program synthesis
topic	Software Engineering Machine Learning Programming Languages
url	https://arxiv.org/abs/2509.22908

Similar Items