Saved in:
| Main Authors: | , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.07236 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866911628890472448 |
|---|---|
| author | Jung, Sungwoo Son, Seonil |
| author_facet | Jung, Sungwoo Son, Seonil |
| contents | Agent harnesses -- the stateful programs that wrap a language model and decide what it sees at each step -- are now known to change end-to-end performance on a fixed model by as much as six times. That raises a question asked less often than it should be: how much of an agent's competence does the harness itself already carry, and how much genuinely still needs the LLM? We externalize a planning harness for noisy Collaborative Battleship into four progressively richer layers -- posterior belief tracking, declarative planning, symbolic reflec tion, and an LLM-backed revision gate -- under a common runtime, taking \emph{win rate} as the primary metric and \emph{F1} as secondary, and pre-specifying \emph{heavy lifting} as the single largest positive marginal to the primary metric. Across 54 games, declarative pla nning carries the heavy lifting ($+24.1$pp win rate over a belief-only harness, zero LLM calls); symbolic reflection is mechanistically real but calibration-sensitive, with signed board-level effects up to $\pm0.140$ F1 that cancel on aggregate; and LLM-backed revision ac tivates on only $4.3\%$ of turns with a bounded, non-monotonic effect. The contribution is methodological: once harness layers are made externally measurable, the LLM's role can be quantified as residual rather than assumed central. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2604_07236 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | How Much Heavy Lifting Can an Agent Harness Do?: Measuring the LLM's Residual Role in a Planning Agent Jung, Sungwoo Son, Seonil Artificial Intelligence Computation and Language Agent harnesses -- the stateful programs that wrap a language model and decide what it sees at each step -- are now known to change end-to-end performance on a fixed model by as much as six times. That raises a question asked less often than it should be: how much of an agent's competence does the harness itself already carry, and how much genuinely still needs the LLM? We externalize a planning harness for noisy Collaborative Battleship into four progressively richer layers -- posterior belief tracking, declarative planning, symbolic reflec tion, and an LLM-backed revision gate -- under a common runtime, taking \emph{win rate} as the primary metric and \emph{F1} as secondary, and pre-specifying \emph{heavy lifting} as the single largest positive marginal to the primary metric. Across 54 games, declarative pla nning carries the heavy lifting ($+24.1$pp win rate over a belief-only harness, zero LLM calls); symbolic reflection is mechanistically real but calibration-sensitive, with signed board-level effects up to $\pm0.140$ F1 that cancel on aggregate; and LLM-backed revision ac tivates on only $4.3\%$ of turns with a bounded, non-monotonic effect. The contribution is methodological: once harness layers are made externally measurable, the LLM's role can be quantified as residual rather than assumed central. |
| title | How Much Heavy Lifting Can an Agent Harness Do?: Measuring the LLM's Residual Role in a Planning Agent |
| topic | Artificial Intelligence Computation and Language |
| url | https://arxiv.org/abs/2604.07236 |