Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Rao, Balaji, Harrison, John, Kong, Soonho, Lee, Juneyoung, Lipizzi, Carlo
Format:	Preprint
Published:	2026
Subjects:	Programming Languages Artificial Intelligence Cryptography and Security Logic in Computer Science
Online Access:	https://arxiv.org/abs/2603.14628
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914488014340096
author	Rao, Balaji Harrison, John Kong, Soonho Lee, Juneyoung Lipizzi, Carlo
author_facet	Rao, Balaji Harrison, John Kong, Soonho Lee, Juneyoung Lipizzi, Carlo
contents	Neurosymbolic approaches leveraging Large Language Models (LLMs) with formal methods have recently achieved strong results on mathematics-oriented theorem-proving benchmarks. However, success on competition-style mathematics does not by itself demonstrate the ability to construct proofs about real-world implementations. We address this gap with a benchmark derived from an industrial cryptographic library whose assembly routines are already verified in HOL Light. s2n-bignum is a library used at AWS for providing fast assembly routines for cryptography, and its correctness is established by formal verification. The task of formally verifying this library has been a significant achievement for the Automated Reasoning Group. It involved two tasks: (1) precisely specifying the correct behavior of a program as a mathematical proposition, and (2) proving that the proposition is correct. In the case of s2n-bignum, both tasks were carried out by human experts. In \textit{s2n-bignum-bench}, we provide the formal specification and ask the LLM to generate a proof script that is accepted by HOL Light within a fixed proof-check timeout. To our knowledge, \textit{s2n-bignum-bench} is the first public benchmark focused on machine-checkable proof synthesis for industrial low-level cryptographic assembly routines in HOL Light. This benchmark provides a challenging and practically relevant testbed for evaluating LLM-based theorem proving beyond competition mathematics. The code to set up and use the benchmark is available here: \href{https://github.com/kings-crown/s2n-bignum-bench}{s2n-bignum-bench}.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_14628
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs Rao, Balaji Harrison, John Kong, Soonho Lee, Juneyoung Lipizzi, Carlo Programming Languages Artificial Intelligence Cryptography and Security Logic in Computer Science Neurosymbolic approaches leveraging Large Language Models (LLMs) with formal methods have recently achieved strong results on mathematics-oriented theorem-proving benchmarks. However, success on competition-style mathematics does not by itself demonstrate the ability to construct proofs about real-world implementations. We address this gap with a benchmark derived from an industrial cryptographic library whose assembly routines are already verified in HOL Light. s2n-bignum is a library used at AWS for providing fast assembly routines for cryptography, and its correctness is established by formal verification. The task of formally verifying this library has been a significant achievement for the Automated Reasoning Group. It involved two tasks: (1) precisely specifying the correct behavior of a program as a mathematical proposition, and (2) proving that the proposition is correct. In the case of s2n-bignum, both tasks were carried out by human experts. In \textit{s2n-bignum-bench}, we provide the formal specification and ask the LLM to generate a proof script that is accepted by HOL Light within a fixed proof-check timeout. To our knowledge, \textit{s2n-bignum-bench} is the first public benchmark focused on machine-checkable proof synthesis for industrial low-level cryptographic assembly routines in HOL Light. This benchmark provides a challenging and practically relevant testbed for evaluating LLM-based theorem proving beyond competition mathematics. The code to set up and use the benchmark is available here: \href{https://github.com/kings-crown/s2n-bignum-bench}{s2n-bignum-bench}.
title	s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs
topic	Programming Languages Artificial Intelligence Cryptography and Security Logic in Computer Science
url	https://arxiv.org/abs/2603.14628

Similar Items