Saved in:
| Main Authors: | Tang, Lv, Zheng, Tianyi, Liu, Yang, Li, Bo, Li, Xingyu |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.06708 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
InfoTok: Information-Theoretic Regularization for Capacity-Constrained Shared Visual Tokenization in Unified MLLMs
by: Tang, Lv, et al.
Published: (2026)
by: Tang, Lv, et al.
Published: (2026)
FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression
by: Li, Jianjian, et al.
Published: (2025)
by: Li, Jianjian, et al.
Published: (2025)
Interactive Visual Assessment for Text-to-Image Generation Models
by: Mi, Xiaoyue, et al.
Published: (2024)
by: Mi, Xiaoyue, et al.
Published: (2024)
What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis
by: Li, Xirui, et al.
Published: (2026)
by: Li, Xirui, et al.
Published: (2026)
SSMamba: A Self-Supervised Hybrid State Space Model for Pathological Image Classification
by: Chai, Enhui, et al.
Published: (2026)
by: Chai, Enhui, et al.
Published: (2026)
QMoP: Query Guided Mixture-of-Projector for Efficient Visual Token Compression
by: Li, Zhongyang, et al.
Published: (2026)
by: Li, Zhongyang, et al.
Published: (2026)
LensVLM: Selective Context Expansion for Compressed Visual Representation of Text
by: Xie, Roy, et al.
Published: (2026)
by: Xie, Roy, et al.
Published: (2026)
UAR-NVC: A Unified AutoRegressive Framework for Memory-Efficient Neural Video Compression
by: Wang, Jia, et al.
Published: (2025)
by: Wang, Jia, et al.
Published: (2025)
Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design
by: Sun, Haoxiang, et al.
Published: (2026)
by: Sun, Haoxiang, et al.
Published: (2026)
IPCV: Information-Preserving Compression for MLLM Visual Encoders
by: Chen, Yuan, et al.
Published: (2025)
by: Chen, Yuan, et al.
Published: (2025)
Causal-Story: Local Causal Attention Utilizing Parameter-Efficient Tuning For Visual Story Synthesis
by: Song, Tianyi, et al.
Published: (2023)
by: Song, Tianyi, et al.
Published: (2023)
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
by: Fu, Ling, et al.
Published: (2024)
by: Fu, Ling, et al.
Published: (2024)
Benchmarking and Analyzing Generative Data for Visual Recognition
by: Li, Bo, et al.
Published: (2023)
by: Li, Bo, et al.
Published: (2023)
Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning
by: Li, Zejun, et al.
Published: (2025)
by: Li, Zejun, et al.
Published: (2025)
Mitigating Visual Hallucinations via Semantic Curriculum Preference Optimization in MLLMs
by: Li, Yuanshuai, et al.
Published: (2025)
by: Li, Yuanshuai, et al.
Published: (2025)
Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench
by: Lin, Fenfen, et al.
Published: (2025)
by: Lin, Fenfen, et al.
Published: (2025)
Fourier Compressor: Frequency-Domain Visual Token Compression for Vision-Language Models
by: Wang, Huanyu, et al.
Published: (2025)
by: Wang, Huanyu, et al.
Published: (2025)
VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies
by: Gao, Mingjian, et al.
Published: (2026)
by: Gao, Mingjian, et al.
Published: (2026)
Bridging the Semantic-Action Gap in Visual Token Pruning for Efficient VLA Inference
by: Liu, Ziyan, et al.
Published: (2025)
by: Liu, Ziyan, et al.
Published: (2025)
Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering
by: Shen, Zhixuan, et al.
Published: (2024)
by: Shen, Zhixuan, et al.
Published: (2024)
Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning
by: Ma, Yinchao, et al.
Published: (2026)
by: Ma, Yinchao, et al.
Published: (2026)
Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training
by: Li, Wenbo, et al.
Published: (2024)
by: Li, Wenbo, et al.
Published: (2024)
Powerful Teachers Matter: Text-Guided Multi-view Knowledge Distillation with Visual Prior Enhancement
by: Zhang, Xin, et al.
Published: (2026)
by: Zhang, Xin, et al.
Published: (2026)
Visual Position Prompt for MLLM based Visual Grounding
by: Tang, Wei, et al.
Published: (2025)
by: Tang, Wei, et al.
Published: (2025)
Multi-Grained Compositional Visual Clue Learning for Image Intent Recognition
by: Tang, Yin, et al.
Published: (2025)
by: Tang, Yin, et al.
Published: (2025)
Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models
by: Liu, Zhihang, et al.
Published: (2025)
by: Liu, Zhihang, et al.
Published: (2025)
Thinking with Gaze: Sequential Eye-Tracking as Visual Reasoning Supervision for Medical VLMs
by: Li, Yiwei, et al.
Published: (2026)
by: Li, Yiwei, et al.
Published: (2026)
On the Faithfulness of Visual Thinking: Measurement and Enhancement
by: Liu, Zujing, et al.
Published: (2025)
by: Liu, Zujing, et al.
Published: (2025)
T2I-VeRW: Part-level Fine-grained Perception for Text-to-Image Vehicle Retrieval
by: Wang, Xiao, et al.
Published: (2026)
by: Wang, Xiao, et al.
Published: (2026)
Towards Visual Text Grounding of Multimodal Large Language Model
by: Li, Ming, et al.
Published: (2025)
by: Li, Ming, et al.
Published: (2025)
Global Context Compression with Interleaved Vision-Text Transformation
by: Jiao, Dian, et al.
Published: (2026)
by: Jiao, Dian, et al.
Published: (2026)
Audio-centric Video Understanding Benchmark without Text Shortcut
by: Yang, Yudong, et al.
Published: (2025)
by: Yang, Yudong, et al.
Published: (2025)
Task-Aware KV Compression For Cost-Effective Long Video Understanding
by: Qin, Minghao, et al.
Published: (2025)
by: Qin, Minghao, et al.
Published: (2025)
Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models
by: Chen, Junyu, et al.
Published: (2024)
by: Chen, Junyu, et al.
Published: (2024)
Voxel-based Point Cloud Geometry Compression with Space-to-Channel Context
by: Liu, Bojun, et al.
Published: (2025)
by: Liu, Bojun, et al.
Published: (2025)
Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding
by: Li, Yueying, et al.
Published: (2026)
by: Li, Yueying, et al.
Published: (2026)
Revisiting Visual Understanding in Multimodal Reasoning through a Lens of Image Perturbation
by: Li, Yuting, et al.
Published: (2025)
by: Li, Yuting, et al.
Published: (2025)
Exploring Visual Prompting: Robustness Inheritance and Beyond
by: Li, Qi, et al.
Published: (2025)
by: Li, Qi, et al.
Published: (2025)
Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech
by: Liu, Rui, et al.
Published: (2024)
by: Liu, Rui, et al.
Published: (2024)
Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Visual Foundation Models
by: Zhang, Jielu, et al.
Published: (2023)
by: Zhang, Jielu, et al.
Published: (2023)
Similar Items
-
InfoTok: Information-Theoretic Regularization for Capacity-Constrained Shared Visual Tokenization in Unified MLLMs
by: Tang, Lv, et al.
Published: (2026) -
FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression
by: Li, Jianjian, et al.
Published: (2025) -
Interactive Visual Assessment for Text-to-Image Generation Models
by: Mi, Xiaoyue, et al.
Published: (2024) -
What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis
by: Li, Xirui, et al.
Published: (2026) -
SSMamba: A Self-Supervised Hybrid State Space Model for Pathological Image Classification
by: Chai, Enhui, et al.
Published: (2026)