Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Miller, Jan
Format:	Preprint
Published:	2026
Subjects:	Software Engineering D.2.5; I.2.2
Online Access:	https://arxiv.org/abs/2603.03406
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917311773933568
author	Miller, Jan
author_facet	Miller, Jan
contents	How should two language models interact to produce better code than either can alone? The conventional approach -- a reasoning model plans, a code specialist implements -- seems natural but fails: on HumanEval+, plan-then-code degrades performance by 2.4 percentage points versus the code specialist alone. We show that reversing the interaction changes everything. When the code specialist generates freely and the reasoning model reviews instead of plans, the same two models on the same hardware achieve 90.2% pass@1 -- exceeding GPT-4o (87.2%) and O1 Preview (89.0%) -- on ~$2/hr of commodity GPU. Cross-benchmark validation across 542 problems (HumanEval+ and MBPP+) reveals a moderating variable: review effectiveness scales with specification richness, yielding 4x more improvement on richly-specified problems (+9.8pp) than on lean ones (+2.3pp), while remaining net-positive in both cases. The practical implication is twofold: compose models by their cognitive strengths (reviewers review, coders code), and invest in specification quality to amplify the returns.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_03406
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Review Beats Planning: Dual-Model Interaction Patterns for Code Synthesis Miller, Jan Software Engineering D.2.5; I.2.2 How should two language models interact to produce better code than either can alone? The conventional approach -- a reasoning model plans, a code specialist implements -- seems natural but fails: on HumanEval+, plan-then-code degrades performance by 2.4 percentage points versus the code specialist alone. We show that reversing the interaction changes everything. When the code specialist generates freely and the reasoning model reviews instead of plans, the same two models on the same hardware achieve 90.2% pass@1 -- exceeding GPT-4o (87.2%) and O1 Preview (89.0%) -- on ~$2/hr of commodity GPU. Cross-benchmark validation across 542 problems (HumanEval+ and MBPP+) reveals a moderating variable: review effectiveness scales with specification richness, yielding 4x more improvement on richly-specified problems (+9.8pp) than on lean ones (+2.3pp), while remaining net-positive in both cases. The practical implication is twofold: compose models by their cognitive strengths (reviewers review, coders code), and invest in specification quality to amplify the returns.
title	Review Beats Planning: Dual-Model Interaction Patterns for Code Synthesis
topic	Software Engineering D.2.5; I.2.2
url	https://arxiv.org/abs/2603.03406

Similar Items