MARC Görünümü: :: Library Catalog

Kaydedildi:

Detaylı Bibliyografya
Yazar:	Rahnama, Moses
Materyal Türü:	Recurso digital
Dil:
Baskı/Yayın Bilgisi:	Zenodo 2026
Konular:	term rewriting systems strong normalization dependency pairs LLM large language models AI safety benchmarks compositional measures impossibility theorems automated termination proving Lean 4 step duplication
Online Erişim:	https://doi.org/10.5281/zenodo.20034530
Etiketler:	Etiketle Etiket eklenmemiş, İlk siz ekleyin!

_version_	1866901813363474432
author	Rahnama, Moses
author_facet	Rahnama, Moses
contents	<p>The primitive recursor is an irreducible floor of bounded sequential computation:</p> <p>F(x,y,Z) → x        F(x,y,S(n)) → G(y,F(x,y,n)).</p> <p>Variants of the same counted-step skeleton appear as primitive recursors in Lean, Rocq/Coq, and Agda, as fold/iterator combinators in functional programming, and as trip-counted loop-carried state in compiler IRs and verified systems such as CompCert and EverCrypt. The base/step/counter pattern advances a system from one computational state to the next. This two-rule system has no binding, no hidden axioms, and an externally certified termination witness. Proving termination for this simple yet foundational rule is where <strong>frontier LLMs collapse at formal reasoning.</strong> </p> <p>We introduce the <em>Primitive Recursor Test</em> (PRT), a 27-model, 1,188-session benchmark over nine task families, graded on termination verdict, mathematical adequacy, and boundary-admissibility. On the KO7 kernel, which TTT2 certifies in 0.06 seconds, 94/324 (29.0%) claimed the system does not terminate, only 3/324 sessions (0.9%) retrieve the correct transformed-call witness, while 175/324 (54.0%) answer yes with a method that is provably false on this kernel; 276/324 (85.2%) of the answers are outright mathematically false. Under a fruit-renaming control, admissible retrieval falls to 0/162 while the termination-verdict rate stays fixed at 111/162. The failure is a retrieval-and-exclusion gap. When dependency pairs are placed in a five-method menu (as Method D), 104/108 sessions (96.3%) classify the correct method fully correctly, with Method D drawing the highest termination and admissibility rates across the five candidates.</p> <p>A boundary result sharpens this: under the direct whole-term observer, the step-duplicating recursor and a <em>circular-reference kernel</em> satisfy the same linear-growth predicate and are observationally indistinguishable under this class of measure. We call this a <em>representation-shift bottleneck</em>: the first adequate and boundary-admissible witness sits across a transformed recursive-call proof-language shift that the models rarely make. Model reasoning instability becomes even more apparent when, across different sessions of the same test, in at least one of the three open-ended kernel tests, 27/27 models propose two or more distinct method classes, 24/27 propose three or more, and 14/27 flip their termination verdict.</p> <p><strong>Institutional hazard.</strong> In proof-assistant workflows that underwrite flight-control software and formally verified hardware, the danger is <em>false formal legitimacy</em>: proof-shaped output that is mathematically false or boundary-external but looks valid at the surface.</p> <p>All session data, test files, prompts, Lean artifacts, and TTT2 artifacts are publicly released at https://github.com/MosesRahnama/PRT-Benchmark, with a Croissant 1.0 metadata file, datasheet, data statement, citation file, and JSON-LD landing page.</p>
format	Recurso digital
id	zenodo_https___doi_org_10_5281_zenodo_20034530
institution	Zenodo
language
publishDate	2026
publisher	Zenodo
record_format	zenodo
spellingShingle	Failing at the Floor: LLM Formal Reasoning Collapse on the Primitive Duplicating Recursor Rahnama, Moses term rewriting systems strong normalization dependency pairs LLM large language models AI safety benchmarks compositional measures impossibility theorems automated termination proving Lean 4 step duplication <p>The primitive recursor is an irreducible floor of bounded sequential computation:</p> <p>F(x,y,Z) → x        F(x,y,S(n)) → G(y,F(x,y,n)).</p> <p>Variants of the same counted-step skeleton appear as primitive recursors in Lean, Rocq/Coq, and Agda, as fold/iterator combinators in functional programming, and as trip-counted loop-carried state in compiler IRs and verified systems such as CompCert and EverCrypt. The base/step/counter pattern advances a system from one computational state to the next. This two-rule system has no binding, no hidden axioms, and an externally certified termination witness. Proving termination for this simple yet foundational rule is where <strong>frontier LLMs collapse at formal reasoning.</strong> </p> <p>We introduce the <em>Primitive Recursor Test</em> (PRT), a 27-model, 1,188-session benchmark over nine task families, graded on termination verdict, mathematical adequacy, and boundary-admissibility. On the KO7 kernel, which TTT2 certifies in 0.06 seconds, 94/324 (29.0%) claimed the system does not terminate, only 3/324 sessions (0.9%) retrieve the correct transformed-call witness, while 175/324 (54.0%) answer yes with a method that is provably false on this kernel; 276/324 (85.2%) of the answers are outright mathematically false. Under a fruit-renaming control, admissible retrieval falls to 0/162 while the termination-verdict rate stays fixed at 111/162. The failure is a retrieval-and-exclusion gap. When dependency pairs are placed in a five-method menu (as Method D), 104/108 sessions (96.3%) classify the correct method fully correctly, with Method D drawing the highest termination and admissibility rates across the five candidates.</p> <p>A boundary result sharpens this: under the direct whole-term observer, the step-duplicating recursor and a <em>circular-reference kernel</em> satisfy the same linear-growth predicate and are observationally indistinguishable under this class of measure. We call this a <em>representation-shift bottleneck</em>: the first adequate and boundary-admissible witness sits across a transformed recursive-call proof-language shift that the models rarely make. Model reasoning instability becomes even more apparent when, across different sessions of the same test, in at least one of the three open-ended kernel tests, 27/27 models propose two or more distinct method classes, 24/27 propose three or more, and 14/27 flip their termination verdict.</p> <p><strong>Institutional hazard.</strong> In proof-assistant workflows that underwrite flight-control software and formally verified hardware, the danger is <em>false formal legitimacy</em>: proof-shaped output that is mathematically false or boundary-external but looks valid at the surface.</p> <p>All session data, test files, prompts, Lean artifacts, and TTT2 artifacts are publicly released at https://github.com/MosesRahnama/PRT-Benchmark, with a Croissant 1.0 metadata file, datasheet, data statement, citation file, and JSON-LD landing page.</p>
title	Failing at the Floor: LLM Formal Reasoning Collapse on the Primitive Duplicating Recursor
topic	term rewriting systems strong normalization dependency pairs LLM large language models AI safety benchmarks compositional measures impossibility theorems automated termination proving Lean 4 step duplication
url	https://doi.org/10.5281/zenodo.20034530

Benzer Materyaller