Taula de continguts: :: Library Catalog

Guardat en:

Dades bibliogràfiques
Autor principal:	Rahnama, Moses
Format:	Recurso digital
Idioma:
Publicat:	Zenodo 2026
Matèries:	term rewriting systems strong normalization dependency pairs LLM large language models AI safety benchmarks compositional measures impossibility theorems automated termination proving Lean 4 step duplication
Accés en línia:	https://doi.org/10.5281/zenodo.20034530
Etiquetes:	Afegir etiqueta Sense etiquetes, Sigues el primer a etiquetar aquest registre!

Taula de continguts:

The primitive recursor is an irreducible floor of bounded sequential computation: F(x,y,Z) → x        F(x,y,S(n)) → G(y,F(x,y,n)). Variants of the same counted-step skeleton appear as primitive recursors in Lean, Rocq/Coq, and Agda, as fold/iterator combinators in functional programming, and as trip-counted loop-carried state in compiler IRs and verified systems such as CompCert and EverCrypt. The base/step/counter pattern advances a system from one computational state to the next. This two-rule system has no binding, no hidden axioms, and an externally certified termination witness. Proving termination for this simple yet foundational rule is where frontier LLMs collapse at formal reasoning.  We introduce the Primitive Recursor Test (PRT), a 27-model, 1,188-session benchmark over nine task families, graded on termination verdict, mathematical adequacy, and boundary-admissibility. On the KO7 kernel, which TTT2 certifies in 0.06 seconds, 94/324 (29.0%) claimed the system does not terminate, only 3/324 sessions (0.9%) retrieve the correct transformed-call witness, while 175/324 (54.0%) answer yes with a method that is provably false on this kernel; 276/324 (85.2%) of the answers are outright mathematically false. Under a fruit-renaming control, admissible retrieval falls to 0/162 while the termination-verdict rate stays fixed at 111/162. The failure is a retrieval-and-exclusion gap. When dependency pairs are placed in a five-method menu (as Method D), 104/108 sessions (96.3%) classify the correct method fully correctly, with Method D drawing the highest termination and admissibility rates across the five candidates. A boundary result sharpens this: under the direct whole-term observer, the step-duplicating recursor and a circular-reference kernel satisfy the same linear-growth predicate and are observationally indistinguishable under this class of measure. We call this a representation-shift bottleneck: the first adequate and boundary-admissible witness sits across a transformed recursive-call proof-language shift that the models rarely make. Model reasoning instability becomes even more apparent when, across different sessions of the same test, in at least one of the three open-ended kernel tests, 27/27 models propose two or more distinct method classes, 24/27 propose three or more, and 14/27 flip their termination verdict. Institutional hazard. In proof-assistant workflows that underwrite flight-control software and formally verified hardware, the danger is false formal legitimacy: proof-shaped output that is mathematically false or boundary-external but looks valid at the surface. All session data, test files, prompts, Lean artifacts, and TTT2 artifacts are publicly released at https://github.com/MosesRahnama/PRT-Benchmark, with a Croissant 1.0 metadata file, datasheet, data statement, citation file, and JSON-LD landing page.

Ítems similars