Saved in:
| Main Authors: | Tang, Lv, Zheng, Tianyi, Li, Bo, Li, Xingyu |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.01554 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression
by: Ye, Haotian, et al.
Published: (2025)
by: Ye, Haotian, et al.
Published: (2025)
Visual Text Compression as Measure Transport
by: Tang, Lv, et al.
Published: (2026)
by: Tang, Lv, et al.
Published: (2026)
UniTok: A Unified Tokenizer for Visual Generation and Understanding
by: Ma, Chuofan, et al.
Published: (2025)
by: Ma, Chuofan, et al.
Published: (2025)
Mitigating Visual Hallucinations via Semantic Curriculum Preference Optimization in MLLMs
by: Li, Yuanshuai, et al.
Published: (2025)
by: Li, Yuanshuai, et al.
Published: (2025)
UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model
by: Zhuang, Shaobin, et al.
Published: (2026)
by: Zhuang, Shaobin, et al.
Published: (2026)
IDPruner: Harmonizing Importance and Diversity in Visual Token Pruning for MLLMs
by: Tan, Yifan, et al.
Published: (2026)
by: Tan, Yifan, et al.
Published: (2026)
VidTok: A Versatile and Open-Source Video Tokenizer
by: Tang, Anni, et al.
Published: (2024)
by: Tang, Anni, et al.
Published: (2024)
MacTok: Robust Continuous Tokenization for Image Generation
by: Zeng, Hengyu, et al.
Published: (2026)
by: Zeng, Hengyu, et al.
Published: (2026)
HieraTok: Multi-Scale Visual Tokenizer Improves Image Reconstruction and Generation
by: Chen, Cong, et al.
Published: (2025)
by: Chen, Cong, et al.
Published: (2025)
VISA: Group-wise Visual Token Selection and Aggregation via Graph Summarization for Efficient MLLMs Inference
by: Jiang, Pengfei, et al.
Published: (2025)
by: Jiang, Pengfei, et al.
Published: (2025)
TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
by: Lin, Haokun, et al.
Published: (2025)
by: Lin, Haokun, et al.
Published: (2025)
CrystaL: Spontaneous Emergence of Visual Latents in MLLMs
by: Zhang, Yang, et al.
Published: (2026)
by: Zhang, Yang, et al.
Published: (2026)
Efficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs
by: Lin, Yuhui, et al.
Published: (2026)
by: Lin, Yuhui, et al.
Published: (2026)
SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation
by: Chen, Zisheng, et al.
Published: (2025)
by: Chen, Zisheng, et al.
Published: (2025)
Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
by: Zhang, Qizhe, et al.
Published: (2025)
by: Zhang, Qizhe, et al.
Published: (2025)
Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation
by: Mao, Jiawei, et al.
Published: (2025)
by: Mao, Jiawei, et al.
Published: (2025)
ToolTok: Tool Tokenization for Efficient and Generalizable GUI Agents
by: Wang, Xiaoce, et al.
Published: (2026)
by: Wang, Xiaoce, et al.
Published: (2026)
PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation
by: Susladkar, Onkar, et al.
Published: (2026)
by: Susladkar, Onkar, et al.
Published: (2026)
On the Limits of Token Reduction for Efficient Unified Vision Language Training
by: Chen, Siyi, et al.
Published: (2026)
by: Chen, Siyi, et al.
Published: (2026)
SweetTok: Semantic-Aware Spatial-Temporal Tokenizer for Compact Video Discretization
by: Tan, Zhentao, et al.
Published: (2024)
by: Tan, Zhentao, et al.
Published: (2024)
Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs
by: Ou, Siqu, et al.
Published: (2026)
by: Ou, Siqu, et al.
Published: (2026)
Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs
by: Huang, Jincai, et al.
Published: (2026)
by: Huang, Jincai, et al.
Published: (2026)
Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference
by: Yin, Hao, et al.
Published: (2025)
by: Yin, Hao, et al.
Published: (2025)
AdaTok: Adaptive Token Compression with Object-Aware Representations for Efficient Multimodal LLMs
by: Zhang, Xinliang, et al.
Published: (2025)
by: Zhang, Xinliang, et al.
Published: (2025)
V2Flow: Unifying Visual Tokenization and Large Language Model Vocabularies for Autoregressive Image Generation
by: Zhang, Guiwei, et al.
Published: (2025)
by: Zhang, Guiwei, et al.
Published: (2025)
Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events
by: Liu, Xiaolin, et al.
Published: (2026)
by: Liu, Xiaolin, et al.
Published: (2026)
InfoDisent: Explainability of Image Classification Models by Information Disentanglement
by: Struski, Łukasz, et al.
Published: (2024)
by: Struski, Łukasz, et al.
Published: (2024)
MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization
by: Li, Siyuan, et al.
Published: (2025)
by: Li, Siyuan, et al.
Published: (2025)
Geodesics with Unified Tangent-constrained Priors and Curvature Regularization
by: Di, Chong, et al.
Published: (2026)
by: Di, Chong, et al.
Published: (2026)
What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis
by: Li, Xirui, et al.
Published: (2026)
by: Li, Xirui, et al.
Published: (2026)
UniHOI: Unified Human-Object Interaction Understanding via Unified Token Space
by: Yang, Panqi, et al.
Published: (2025)
by: Yang, Panqi, et al.
Published: (2025)
A More Word-like Image Tokenization for MLLMs
by: Lee, Hyun, et al.
Published: (2026)
by: Lee, Hyun, et al.
Published: (2026)
Bridging the Semantic-Action Gap in Visual Token Pruning for Efficient VLA Inference
by: Liu, Ziyan, et al.
Published: (2025)
by: Liu, Ziyan, et al.
Published: (2025)
QMoP: Query Guided Mixture-of-Projector for Efficient Visual Token Compression
by: Li, Zhongyang, et al.
Published: (2026)
by: Li, Zhongyang, et al.
Published: (2026)
SSMamba: A Self-Supervised Hybrid State Space Model for Pathological Image Classification
by: Chai, Enhui, et al.
Published: (2026)
by: Chai, Enhui, et al.
Published: (2026)
VITAL: Visual-Semantic Dual Supervision for Enhanced and Interpretable Latent Reasoning in Medical MLLMs
by: Li, Qiaoru, et al.
Published: (2026)
by: Li, Qiaoru, et al.
Published: (2026)
How Do Medical MLLMs Fail? A Study on Visual Grounding in Medical Images
by: Liu, Guimeng, et al.
Published: (2026)
by: Liu, Guimeng, et al.
Published: (2026)
UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding
by: Jiao, Yang, et al.
Published: (2025)
by: Jiao, Yang, et al.
Published: (2025)
Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization
by: Wu, Linzhi, et al.
Published: (2024)
by: Wu, Linzhi, et al.
Published: (2024)
Learning Shared RGB-D Fields: Unified Self-supervised Pre-training for Label-efficient LiDAR-Camera 3D Perception
by: Xu, Xiaohao, et al.
Published: (2024)
by: Xu, Xiaohao, et al.
Published: (2024)
Similar Items
-
InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression
by: Ye, Haotian, et al.
Published: (2025) -
Visual Text Compression as Measure Transport
by: Tang, Lv, et al.
Published: (2026) -
UniTok: A Unified Tokenizer for Visual Generation and Understanding
by: Ma, Chuofan, et al.
Published: (2025) -
Mitigating Visual Hallucinations via Semantic Curriculum Preference Optimization in MLLMs
by: Li, Yuanshuai, et al.
Published: (2025) -
UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model
by: Zhuang, Shaobin, et al.
Published: (2026)