:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Tang, Lv, Zheng, Tianyi, Li, Bo, Li, Xingyu
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Artificial Intelligence Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2602.01554
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression
by: Ye, Haotian, et al.
Published: (2025)

Visual Text Compression as Measure Transport
by: Tang, Lv, et al.
Published: (2026)

UniTok: A Unified Tokenizer for Visual Generation and Understanding
by: Ma, Chuofan, et al.
Published: (2025)

Mitigating Visual Hallucinations via Semantic Curriculum Preference Optimization in MLLMs
by: Li, Yuanshuai, et al.
Published: (2025)

UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model
by: Zhuang, Shaobin, et al.
Published: (2026)

IDPruner: Harmonizing Importance and Diversity in Visual Token Pruning for MLLMs
by: Tan, Yifan, et al.
Published: (2026)

VidTok: A Versatile and Open-Source Video Tokenizer
by: Tang, Anni, et al.
Published: (2024)

MacTok: Robust Continuous Tokenization for Image Generation
by: Zeng, Hengyu, et al.
Published: (2026)

HieraTok: Multi-Scale Visual Tokenizer Improves Image Reconstruction and Generation
by: Chen, Cong, et al.
Published: (2025)

VISA: Group-wise Visual Token Selection and Aggregation via Graph Summarization for Efficient MLLMs Inference
by: Jiang, Pengfei, et al.
Published: (2025)

TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
by: Lin, Haokun, et al.
Published: (2025)

CrystaL: Spontaneous Emergence of Visual Latents in MLLMs
by: Zhang, Yang, et al.
Published: (2026)

Efficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs
by: Lin, Yuhui, et al.
Published: (2026)

SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation
by: Chen, Zisheng, et al.
Published: (2025)

Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
by: Zhang, Qizhe, et al.
Published: (2025)

Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation
by: Mao, Jiawei, et al.
Published: (2025)

ToolTok: Tool Tokenization for Efficient and Generalizable GUI Agents
by: Wang, Xiaoce, et al.
Published: (2026)

PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation
by: Susladkar, Onkar, et al.
Published: (2026)

On the Limits of Token Reduction for Efficient Unified Vision Language Training
by: Chen, Siyi, et al.
Published: (2026)

SweetTok: Semantic-Aware Spatial-Temporal Tokenizer for Compact Video Discretization
by: Tan, Zhentao, et al.
Published: (2024)

Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs
by: Ou, Siqu, et al.
Published: (2026)

Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs
by: Huang, Jincai, et al.
Published: (2026)

Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference
by: Yin, Hao, et al.
Published: (2025)

AdaTok: Adaptive Token Compression with Object-Aware Representations for Efficient Multimodal LLMs
by: Zhang, Xinliang, et al.
Published: (2025)

V2Flow: Unifying Visual Tokenization and Large Language Model Vocabularies for Autoregressive Image Generation
by: Zhang, Guiwei, et al.
Published: (2025)

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events
by: Liu, Xiaolin, et al.
Published: (2026)

InfoDisent: Explainability of Image Classification Models by Information Disentanglement
by: Struski, Łukasz, et al.
Published: (2024)

MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization
by: Li, Siyuan, et al.
Published: (2025)

Geodesics with Unified Tangent-constrained Priors and Curvature Regularization
by: Di, Chong, et al.
Published: (2026)

What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis
by: Li, Xirui, et al.
Published: (2026)

UniHOI: Unified Human-Object Interaction Understanding via Unified Token Space
by: Yang, Panqi, et al.
Published: (2025)

A More Word-like Image Tokenization for MLLMs
by: Lee, Hyun, et al.
Published: (2026)

Bridging the Semantic-Action Gap in Visual Token Pruning for Efficient VLA Inference
by: Liu, Ziyan, et al.
Published: (2025)

QMoP: Query Guided Mixture-of-Projector for Efficient Visual Token Compression
by: Li, Zhongyang, et al.
Published: (2026)

SSMamba: A Self-Supervised Hybrid State Space Model for Pathological Image Classification
by: Chai, Enhui, et al.
Published: (2026)

VITAL: Visual-Semantic Dual Supervision for Enhanced and Interpretable Latent Reasoning in Medical MLLMs
by: Li, Qiaoru, et al.
Published: (2026)

How Do Medical MLLMs Fail? A Study on Visual Grounding in Medical Images
by: Liu, Guimeng, et al.
Published: (2026)

UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding
by: Jiao, Yang, et al.
Published: (2025)

Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization
by: Wu, Linzhi, et al.
Published: (2024)

Learning Shared RGB-D Fields: Unified Self-supervised Pre-training for Label-efficient LiDAR-Camera 3D Perception
by: Xu, Xiaohao, et al.
Published: (2024)