Saved in:
Bibliographic Details
Main Authors: Ye, Ruijie, Zhang, Jiayi, Liu, Zhuoxin, Zhu, Zihao, Yang, Siyuan, Li, Li, Fu, Tianfu, Dernoncourt, Franck, Zhao, Yue, Zhu, Jiacheng, Rossi, Ryan, Chai, Wenhao, Tu, Zhengzhong
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.09084
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912916953890816
author Ye, Ruijie
Zhang, Jiayi
Liu, Zhuoxin
Zhu, Zihao
Yang, Siyuan
Li, Li
Fu, Tianfu
Dernoncourt, Franck
Zhao, Yue
Zhu, Jiacheng
Rossi, Ryan
Chai, Wenhao
Tu, Zhengzhong
author_facet Ye, Ruijie
Zhang, Jiayi
Liu, Zhuoxin
Zhu, Zihao
Yang, Siyuan
Li, Li
Fu, Tianfu
Dernoncourt, Franck
Zhao, Yue
Zhu, Jiacheng
Rossi, Ryan
Chai, Wenhao
Tu, Zhengzhong
contents We study instruction-based image editing under professional workflows and identify three persistent challenges: (i) editors often over-edit, modifying content beyond the user's intent; (ii) existing models are largely single-turn, while multi-turn edits can alter object faithfulness; and (iii) evaluation at around 1K resolution is misaligned with real workflows that often operate on ultra high-definition images (e.g., 4K). We propose Agent Banana, a hierarchical agentic planner-executor framework for high-fidelity, object-aware, deliberative editing. Agent Banana introduces two key mechanisms: (1) Context Folding, which compresses long interaction histories into structured memory for stable long-horizon control; and (2) Image Layer Decomposition, which performs localized layer-based edits to preserve non-target regions while enabling native-resolution outputs. To support rigorous evaluation, we build HDD-Bench, a high-definition, dialogue-based benchmark featuring verifiable stepwise targets and native 4K images (11.8M pixels) for diagnosing long-horizon failures. On HDD-Bench, Agent Banana achieves the best multi-turn consistency and background fidelity (e.g., IC 0.871, SSIM-OM 0.84, LPIPS-OM 0.12) while remaining competitive on instruction following, and also attains strong performance on standard single-turn editing benchmarks. We hope this work advances reliable, professional-grade agentic image editing and its integration into real workflows.
format Preprint
id arxiv_https___arxiv_org_abs_2602_09084
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Agent Banana: High-Fidelity Image Editing with Agentic Thinking and Tooling
Ye, Ruijie
Zhang, Jiayi
Liu, Zhuoxin
Zhu, Zihao
Yang, Siyuan
Li, Li
Fu, Tianfu
Dernoncourt, Franck
Zhao, Yue
Zhu, Jiacheng
Rossi, Ryan
Chai, Wenhao
Tu, Zhengzhong
Computer Vision and Pattern Recognition
We study instruction-based image editing under professional workflows and identify three persistent challenges: (i) editors often over-edit, modifying content beyond the user's intent; (ii) existing models are largely single-turn, while multi-turn edits can alter object faithfulness; and (iii) evaluation at around 1K resolution is misaligned with real workflows that often operate on ultra high-definition images (e.g., 4K). We propose Agent Banana, a hierarchical agentic planner-executor framework for high-fidelity, object-aware, deliberative editing. Agent Banana introduces two key mechanisms: (1) Context Folding, which compresses long interaction histories into structured memory for stable long-horizon control; and (2) Image Layer Decomposition, which performs localized layer-based edits to preserve non-target regions while enabling native-resolution outputs. To support rigorous evaluation, we build HDD-Bench, a high-definition, dialogue-based benchmark featuring verifiable stepwise targets and native 4K images (11.8M pixels) for diagnosing long-horizon failures. On HDD-Bench, Agent Banana achieves the best multi-turn consistency and background fidelity (e.g., IC 0.871, SSIM-OM 0.84, LPIPS-OM 0.12) while remaining competitive on instruction following, and also attains strong performance on standard single-turn editing benchmarks. We hope this work advances reliable, professional-grade agentic image editing and its integration into real workflows.
title Agent Banana: High-Fidelity Image Editing with Agentic Thinking and Tooling
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2602.09084