Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Zhang, Zhixiong, Li, Yizhuo, Ding, Shuangrui, Zang, Yuhang, Ding, Shengyuan, Xing, Long, Wang, Yibin, Zhang, Qiaosheng, Wang, Jiaqi
Format:	Preprint
Veröffentlicht:	2026
Schlagworte:	Computer Vision and Pattern Recognition
Online-Zugang:	https://arxiv.org/abs/2605.20110
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866916028610510848
author	Zhang, Zhixiong Li, Yizhuo Ding, Shuangrui Zang, Yuhang Ding, Shengyuan Xing, Long Wang, Yibin Zhang, Qiaosheng Wang, Jiaqi
author_facet	Zhang, Zhixiong Li, Yizhuo Ding, Shuangrui Zang, Yuhang Ding, Shengyuan Xing, Long Wang, Yibin Zhang, Qiaosheng Wang, Jiaqi
contents	Referring segmentation grounds natural-language queries to pixel-level masks, but extending it to complex scenarios with multiple instances, cross-category groups, or open-ended target sets remains challenging. Previous Large Vision Language Model (LVLM)-based methods represent referred targets with one or more special tokens sequentially, treating multiple targets as separate outputs rather than a coherent set and offering little incentive to capture set-level properties such as completeness and mutual exclusivity. We reformulate open-ended referring segmentation as explicit set-level concept prediction and propose Set-Concept Segmentation (SetCon), which uses LVLM-generated natural-language concepts, instead of segmentation-specific tokens, as semantic conditions for joint mask-set decoding. A hierarchical semantic decomposition first predicts a shared set-level concept defining the target scope and then refines it into fine-grained concept groups aligned with target subsets. To support this, a two-stage annotation pipeline augments existing reasoning segmentation datasets with hierarchical semantic supervision (236k samples, 784k concept phrases). SetCon achieves state-of-the-art results on image benchmarks (+3.3 gIoU on gRefCOCO, +12.1 gIoU on MUSE), with margins that grow as the number of referred targets increases. The concept interface also transfers to video under a detect-and-track setting, yielding new state-of-the-art results on seven referring video benchmarks, including +10.9 J&F on MeViS and +12.4 J&F on Ref-SeCVOS.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_20110
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction Zhang, Zhixiong Li, Yizhuo Ding, Shuangrui Zang, Yuhang Ding, Shengyuan Xing, Long Wang, Yibin Zhang, Qiaosheng Wang, Jiaqi Computer Vision and Pattern Recognition Referring segmentation grounds natural-language queries to pixel-level masks, but extending it to complex scenarios with multiple instances, cross-category groups, or open-ended target sets remains challenging. Previous Large Vision Language Model (LVLM)-based methods represent referred targets with one or more special tokens sequentially, treating multiple targets as separate outputs rather than a coherent set and offering little incentive to capture set-level properties such as completeness and mutual exclusivity. We reformulate open-ended referring segmentation as explicit set-level concept prediction and propose Set-Concept Segmentation (SetCon), which uses LVLM-generated natural-language concepts, instead of segmentation-specific tokens, as semantic conditions for joint mask-set decoding. A hierarchical semantic decomposition first predicts a shared set-level concept defining the target scope and then refines it into fine-grained concept groups aligned with target subsets. To support this, a two-stage annotation pipeline augments existing reasoning segmentation datasets with hierarchical semantic supervision (236k samples, 784k concept phrases). SetCon achieves state-of-the-art results on image benchmarks (+3.3 gIoU on gRefCOCO, +12.1 gIoU on MUSE), with margins that grow as the number of referred targets increases. The concept interface also transfers to video under a detect-and-track setting, yielding new state-of-the-art results on seven referring video benchmarks, including +10.9 J&F on MeViS and +12.4 J&F on Ref-SeCVOS.
title	SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2605.20110

Ähnliche Einträge