Uloženo v:
Podrobná bibliografie
Hlavní autor: Cantrell, Cole
Médium: Recurso digital
Jazyk:
Vydáno: Zenodo 2026
Témata:
On-line přístup:https://doi.org/10.5281/zenodo.20124875
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Obsah:
  • <p>Best-of-N reasoning over multiple chains-of-thought is a standard test-time compute strategy, but naive<br>implementations run all candidate chains to completion before selecting a winner. This work introduces a<br>self-calibrating divergence detector that identifies when chain trajectories have meaningfully separated,<br>paired with a hybrid disposition policy that hard-kills clearly failed chains and scaffolds borderline ones at<br>a verifier-cost discount. The detector uses a z-scored gap statistic against a per-problem null distribution<br>established during the shared-prefix grace period, eliminating the need for dataset-specific detection<br>thresholds.</p> <p><br>In static-label simulations on PRM800K, the mechanism reduces step-equivalent best-of-N compute by<br>22.8% at 99.6% winner accuracy, matching the candidate-pool oracle of 99.6% (100.0% of oracle). On<br>Math-Shepherd, where auto-rollout labels produce a much lower candidate-pool ceiling, the same<br>architecture with identical parameters reduces compute by 13.6% at 58.4% winner accuracy — 96.4% of<br>the dataset’s 60.6% oracle. Across a 3×3 sensitivity sweep of the disposition parameters, winner<br>accuracy is unchanged and compute saving varies by at most 3.3 percentage points, with zero<br>killed-correct chains throughout. The mechanism is compute-cheap (negligible overhead relative to the<br>inference it monitors), training-free, and reaches the candidate-pool oracle on PRM800K and 96.4% of<br>the oracle on Math-Shepherd within sensitivity bounds.</p> <p><br>The contribution is the mechanism, not the saving number: three layers of structure beyond exponential<br>smoothing (per-problem null-calibrated detection, hybrid level-based disposition, literature-grounded<br>scaffold cost) that together produce a deployment-oriented adaptive branching scheme with no learned<br>components.</p>