Guardat en:
| Autor principal: | |
|---|---|
| Format: | Recurso digital |
| Idioma: | |
| Publicat: |
Zenodo
2026
|
| Matèries: | |
| Accés en línia: | https://doi.org/10.5281/zenodo.20034530 |
| Etiquetes: |
Afegir etiqueta
Sense etiquetes, Sigues el primer a etiquetar aquest registre!
|
Taula de continguts:
- <p>The primitive recursor is an irreducible floor of bounded sequential computation:</p> <p>F(x,y,Z) → x F(x,y,S(n)) → G(y,F(x,y,n)).</p> <p>Variants of the same counted-step skeleton appear as primitive recursors in Lean, Rocq/Coq, and Agda, as fold/iterator combinators in functional programming, and as trip-counted loop-carried state in compiler IRs and verified systems such as CompCert and EverCrypt. The base/step/counter pattern advances a system from one computational state to the next. This two-rule system has no binding, no hidden axioms, and an externally certified termination witness. Proving termination for this simple yet foundational rule is where <strong>frontier LLMs collapse at formal reasoning.</strong> </p> <p>We introduce the <em>Primitive Recursor Test</em> (PRT), a 27-model, 1,188-session benchmark over nine task families, graded on termination verdict, mathematical adequacy, and boundary-admissibility. On the KO7 kernel, which TTT2 certifies in 0.06 seconds, 94/324 (29.0%) claimed the system does not terminate, only 3/324 sessions (0.9%) retrieve the correct transformed-call witness, while 175/324 (54.0%) answer yes with a method that is provably false on this kernel; 276/324 (85.2%) of the answers are outright mathematically false. Under a fruit-renaming control, admissible retrieval falls to 0/162 while the termination-verdict rate stays fixed at 111/162. The failure is a retrieval-and-exclusion gap. When dependency pairs are placed in a five-method menu (as Method D), 104/108 sessions (96.3%) classify the correct method fully correctly, with Method D drawing the highest termination and admissibility rates across the five candidates.</p> <p>A boundary result sharpens this: under the direct whole-term observer, the step-duplicating recursor and a <em>circular-reference kernel</em> satisfy the same linear-growth predicate and are observationally indistinguishable under this class of measure. We call this a <em>representation-shift bottleneck</em>: the first adequate and boundary-admissible witness sits across a transformed recursive-call proof-language shift that the models rarely make. Model reasoning instability becomes even more apparent when, across different sessions of the same test, in at least one of the three open-ended kernel tests, 27/27 models propose two or more distinct method classes, 24/27 propose three or more, and 14/27 flip their termination verdict.</p> <p><strong>Institutional hazard.</strong> In proof-assistant workflows that underwrite flight-control software and formally verified hardware, the danger is <em>false formal legitimacy</em>: proof-shaped output that is mathematically false or boundary-external but looks valid at the surface.</p> <p>All session data, test files, prompts, Lean artifacts, and TTT2 artifacts are publicly released at https://github.com/MosesRahnama/PRT-Benchmark, with a Croissant 1.0 metadata file, datasheet, data statement, citation file, and JSON-LD landing page.</p>