Saved in:
| Main Author: | |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.19509 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866913023713607680 |
|---|---|
| author | Khaled, Arindam |
| author_facet | Khaled, Arindam |
| contents | We observe that LLM cascading and routing implicitly solves an anytime computation problem -- a class of algorithms, well-studied in classical AI, that improve solutions as additional computation is allocated. We formalize this connection and propose Pyramid MoA, a hierarchical Mixture-of-Agents architecture governed by a decision-theoretic router that escalates queries only when necessary. We establish a Probabilistic Anytime Property with provable monotonicity guarantees and derive a generalized escalation rule from Value of Computation theory that accounts for imperfect oracles, extending the Hansen-Zilberstein monitoring framework to stochastic LLM inference. On MBPP, the router intercepts 81.6% of bugs; on GSM8K/MMLU, the system nearly matches the 68.1% Oracle baseline while achieving up to 42.9% compute savings. The router transfers zero-shot to unseen benchmarks: matching Oracle accuracy on HumanEval (81.1%) and MATH 500 (58.0%) with significant cost reductions. We further discover a context-conditioned anchoring effect across four benchmarks: passing correct SLM reasoning improves Oracle accuracy by up to +19.2pp, while incorrect reasoning degrades it by up to -18.0pp, revealing a fundamental tension in hierarchical MoA architectures. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2602_19509 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference Khaled, Arindam Computation and Language Artificial Intelligence Machine Learning We observe that LLM cascading and routing implicitly solves an anytime computation problem -- a class of algorithms, well-studied in classical AI, that improve solutions as additional computation is allocated. We formalize this connection and propose Pyramid MoA, a hierarchical Mixture-of-Agents architecture governed by a decision-theoretic router that escalates queries only when necessary. We establish a Probabilistic Anytime Property with provable monotonicity guarantees and derive a generalized escalation rule from Value of Computation theory that accounts for imperfect oracles, extending the Hansen-Zilberstein monitoring framework to stochastic LLM inference. On MBPP, the router intercepts 81.6% of bugs; on GSM8K/MMLU, the system nearly matches the 68.1% Oracle baseline while achieving up to 42.9% compute savings. The router transfers zero-shot to unseen benchmarks: matching Oracle accuracy on HumanEval (81.1%) and MATH 500 (58.0%) with significant cost reductions. We further discover a context-conditioned anchoring effect across four benchmarks: passing correct SLM reasoning improves Oracle accuracy by up to +19.2pp, while incorrect reasoning degrades it by up to -18.0pp, revealing a fundamental tension in hierarchical MoA architectures. |
| title | Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference |
| topic | Computation and Language Artificial Intelligence Machine Learning |
| url | https://arxiv.org/abs/2602.19509 |