Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhou, Yikang, Zhang, Tao, Gong, Dengxian, Wu, Yuanzheng, Tian, Ye, Wang, Haochen, Yuan, Haobo, Wang, Jiacong, Qi, Lu, Fei, Hao, Wang, Anran, Wang, Zhuochen, Wang, Yujing, Chen, Cheng, Ji, Shunping, Li, Xiangtai
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2601.16093
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911392437633024
author	Zhou, Yikang Zhang, Tao Gong, Dengxian Wu, Yuanzheng Tian, Ye Wang, Haochen Yuan, Haobo Wang, Jiacong Qi, Lu Fei, Hao Wang, Anran Wang, Zhuochen Wang, Yujing Chen, Cheng Ji, Shunping Li, Xiangtai
author_facet	Zhou, Yikang Zhang, Tao Gong, Dengxian Wu, Yuanzheng Tian, Ye Wang, Haochen Yuan, Haobo Wang, Jiacong Qi, Lu Fei, Hao Wang, Anran Wang, Zhuochen Wang, Yujing Chen, Cheng Ji, Shunping Li, Xiangtai
contents	Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities. Our code and models are available.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_16093
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	SAMTok: Representing Any Mask with Two Words Zhou, Yikang Zhang, Tao Gong, Dengxian Wu, Yuanzheng Tian, Ye Wang, Haochen Yuan, Haobo Wang, Jiacong Qi, Lu Fei, Hao Wang, Anran Wang, Zhuochen Wang, Yujing Chen, Cheng Ji, Shunping Li, Xiangtai Computer Vision and Pattern Recognition Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities. Our code and models are available.
title	SAMTok: Representing Any Mask with Two Words
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2601.16093

Similar Items