Kaydedildi:
Detaylı Bibliyografya
Yazar: Rahnama, Moses
Materyal Türü: Recurso digital
Dil:
Baskı/Yayın Bilgisi: Zenodo 2026
Konular:
Online Erişim:https://doi.org/10.5281/zenodo.20034530
Etiketler: Etiketle
Etiket eklenmemiş, İlk siz ekleyin!
_version_ 1866901813363474432
author Rahnama, Moses
author_facet Rahnama, Moses
contents <p>The primitive recursor is an irreducible floor of bounded sequential computation:</p> <p>F(x,y,Z) → x        F(x,y,S(n)) → G(y,F(x,y,n)).</p> <p>Variants of the same counted-step skeleton appear as primitive recursors in Lean, Rocq/Coq, and Agda, as fold/iterator combinators in functional programming, and as trip-counted loop-carried state in compiler IRs and verified systems such as CompCert and EverCrypt. The base/step/counter pattern advances a system from one computational state to the next. This two-rule system has no binding, no hidden axioms, and an externally certified termination witness. Proving termination for this simple yet foundational rule is where <strong>frontier LLMs collapse at formal reasoning.</strong> </p> <p>We introduce the <em>Primitive Recursor Test</em> (PRT), a 27-model, 1,188-session benchmark over nine task families, graded on termination verdict, mathematical adequacy, and boundary-admissibility. On the KO7 kernel, which TTT2 certifies in 0.06 seconds, 94/324 (29.0%) claimed the system does not terminate, only 3/324 sessions (0.9%) retrieve the correct transformed-call witness, while 175/324 (54.0%) answer yes with a method that is provably false on this kernel; 276/324 (85.2%) of the answers are outright mathematically false. Under a fruit-renaming control, admissible retrieval falls to 0/162 while the termination-verdict rate stays fixed at 111/162. The failure is a retrieval-and-exclusion gap. When dependency pairs are placed in a five-method menu (as Method D), 104/108 sessions (96.3%) classify the correct method fully correctly, with Method D drawing the highest termination and admissibility rates across the five candidates.</p> <p>A boundary result sharpens this: under the direct whole-term observer, the step-duplicating recursor and a <em>circular-reference kernel</em> satisfy the same linear-growth predicate and are observationally indistinguishable under this class of measure. We call this a <em>representation-shift bottleneck</em>: the first adequate and boundary-admissible witness sits across a transformed recursive-call proof-language shift that the models rarely make. Model reasoning instability becomes even more apparent when, across different sessions of the same test, in at least one of the three open-ended kernel tests, 27/27 models propose two or more distinct method classes, 24/27 propose three or more, and 14/27 flip their termination verdict.</p> <p><strong>Institutional hazard.</strong> In proof-assistant workflows that underwrite flight-control software and formally verified hardware, the danger is <em>false formal legitimacy</em>: proof-shaped output that is mathematically false or boundary-external but looks valid at the surface.</p> <p>All session data, test files, prompts, Lean artifacts, and TTT2 artifacts are publicly released at https://github.com/MosesRahnama/PRT-Benchmark, with a Croissant 1.0 metadata file, datasheet, data statement, citation file, and JSON-LD landing page.</p>
format Recurso digital
id zenodo_https___doi_org_10_5281_zenodo_20034530
institution Zenodo
language
publishDate 2026
publisher Zenodo
record_format zenodo
spellingShingle Failing at the Floor: LLM Formal Reasoning Collapse on the Primitive Duplicating Recursor
Rahnama, Moses
term rewriting systems
strong normalization
dependency pairs
LLM
large language models
AI safety benchmarks
compositional measures
impossibility theorems
automated termination proving
Lean 4
step duplication
<p>The primitive recursor is an irreducible floor of bounded sequential computation:</p> <p>F(x,y,Z) → x        F(x,y,S(n)) → G(y,F(x,y,n)).</p> <p>Variants of the same counted-step skeleton appear as primitive recursors in Lean, Rocq/Coq, and Agda, as fold/iterator combinators in functional programming, and as trip-counted loop-carried state in compiler IRs and verified systems such as CompCert and EverCrypt. The base/step/counter pattern advances a system from one computational state to the next. This two-rule system has no binding, no hidden axioms, and an externally certified termination witness. Proving termination for this simple yet foundational rule is where <strong>frontier LLMs collapse at formal reasoning.</strong> </p> <p>We introduce the <em>Primitive Recursor Test</em> (PRT), a 27-model, 1,188-session benchmark over nine task families, graded on termination verdict, mathematical adequacy, and boundary-admissibility. On the KO7 kernel, which TTT2 certifies in 0.06 seconds, 94/324 (29.0%) claimed the system does not terminate, only 3/324 sessions (0.9%) retrieve the correct transformed-call witness, while 175/324 (54.0%) answer yes with a method that is provably false on this kernel; 276/324 (85.2%) of the answers are outright mathematically false. Under a fruit-renaming control, admissible retrieval falls to 0/162 while the termination-verdict rate stays fixed at 111/162. The failure is a retrieval-and-exclusion gap. When dependency pairs are placed in a five-method menu (as Method D), 104/108 sessions (96.3%) classify the correct method fully correctly, with Method D drawing the highest termination and admissibility rates across the five candidates.</p> <p>A boundary result sharpens this: under the direct whole-term observer, the step-duplicating recursor and a <em>circular-reference kernel</em> satisfy the same linear-growth predicate and are observationally indistinguishable under this class of measure. We call this a <em>representation-shift bottleneck</em>: the first adequate and boundary-admissible witness sits across a transformed recursive-call proof-language shift that the models rarely make. Model reasoning instability becomes even more apparent when, across different sessions of the same test, in at least one of the three open-ended kernel tests, 27/27 models propose two or more distinct method classes, 24/27 propose three or more, and 14/27 flip their termination verdict.</p> <p><strong>Institutional hazard.</strong> In proof-assistant workflows that underwrite flight-control software and formally verified hardware, the danger is <em>false formal legitimacy</em>: proof-shaped output that is mathematically false or boundary-external but looks valid at the surface.</p> <p>All session data, test files, prompts, Lean artifacts, and TTT2 artifacts are publicly released at https://github.com/MosesRahnama/PRT-Benchmark, with a Croissant 1.0 metadata file, datasheet, data statement, citation file, and JSON-LD landing page.</p>
title Failing at the Floor: LLM Formal Reasoning Collapse on the Primitive Duplicating Recursor
topic term rewriting systems
strong normalization
dependency pairs
LLM
large language models
AI safety benchmarks
compositional measures
impossibility theorems
automated termination proving
Lean 4
step duplication
url https://doi.org/10.5281/zenodo.20034530