:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Huang, Zhaohong, Liu, Wenjing, Zhang, Yuxin, Chao, Fei, Ji, Rongrong
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2604.05601
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Prototype-Based Test-Time Adaptation of Vision-Language Models
by: Huang, Zhaohong, et al.
Published: (2026)

GS-Bias: Global-Spatial Bias Learner for Single-Image Test-Time Adaptation of Vision-Language Models
by: Huang, Zhaohong, et al.
Published: (2025)

VISA: Group-wise Visual Token Selection and Aggregation via Graph Summarization for Efficient MLLMs Inference
by: Jiang, Pengfei, et al.
Published: (2025)

DS$^2$Net: Detail-Semantic Deep Supervision Network for Medical Image Segmentation
by: Huang, Zhaohong, et al.
Published: (2025)

MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning
by: Ma, Yiwei, et al.
Published: (2025)

TextRefiner: Internal Visual Feature as Efficient Refiner for Vision-Language Models Prompt Tuning
by: Xie, Jingjing, et al.
Published: (2024)

IDPruner: Harmonizing Importance and Diversity in Visual Token Pruning for MLLMs
by: Tan, Yifan, et al.
Published: (2026)

Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference
by: Lin, Zhihang, et al.
Published: (2024)

Fooling the LVLM Judges: Visual Biases in LVLM-Based Evaluation
by: Hwang, Yerin, et al.
Published: (2025)

CoIDO: Efficient Data Selection for Visual Instruction Tuning via Coupled Importance-Diversity Optimization
by: Yan, Yichen, et al.
Published: (2025)

Learning Image Demoireing from Unpaired Real Data
by: Zhong, Yunshan, et al.
Published: (2024)

Event-Anchored Frame Selection for Effective Long-Video Understanding
by: Chen, Wang, et al.
Published: (2026)

KTV: Keyframes and Key Tokens Selection for Efficient Training-Free Video LLMs
by: Song, Baiyang, et al.
Published: (2026)

Parallel Vision Token Scheduling for Fast and Accurate Multimodal LMMs Inference
by: Zhan, Wengyi, et al.
Published: (2025)

AdaFlow: Efficient Long Video Editing via Adaptive Attention Slimming And Keyframe Selection
by: Zhang, Shuheng, et al.
Published: (2025)

FlexSelect: Flexible Token Selection for Efficient Long Video Understanding
by: Zhang, Yunzhu, et al.
Published: (2025)

Vision Remember: Recovering Visual Information in Efficient LVLM with Vision Feature Resampling
by: Feng, Ze, et al.
Published: (2025)

TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning
by: Zhang, Liang, et al.
Published: (2024)

POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation
by: Zhu, Lanyun, et al.
Published: (2025)

ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference
by: Pathak, Surendra, et al.
Published: (2026)

Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification
by: Zhang, Pingping, et al.
Published: (2024)

QuadMamba: Learning Quadtree-based Selective Scan for Visual State Space Model
by: Xie, Fei, et al.
Published: (2024)

MBQuant: A Novel Multi-Branch Topology Method for Arbitrary Bit-width Network Quantization
by: Zhong, Yunshan, et al.
Published: (2023)

Importance-Based Token Merging for Efficient Image and Video Generation
by: Wu, Haoyu, et al.
Published: (2024)

Towards Accurate Post-Training Quantization of Vision Transformers via Error Reduction
by: Zhong, Yunshan, et al.
Published: (2024)

Semantic Alignment and Reinforcement for Data-Free Quantization of Vision Transformers
by: Zhong, Yunshan, et al.
Published: (2024)

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
by: Zhang, Yuan, et al.
Published: (2024)

HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models
by: Zhu, Qihui, et al.
Published: (2026)

D2Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning
by: Zhang, Evelyn, et al.
Published: (2025)

Selective Visual Prompting in Vision Mamba
by: Yao, Yifeng, et al.
Published: (2024)

VASparse: Towards Efficient Visual Hallucination Mitigation via Visual-Aware Token Sparsification
by: Zhuang, Xianwei, et al.
Published: (2025)

Fwd2Bot: LVLM Visual Token Compression with Double Forward Bottleneck
by: Bulat, Adrian, et al.
Published: (2025)

Evidence Packing for Cross-Domain Image Deepfake Detection with LVLMs
by: Liu, Yuxin, et al.
Published: (2026)

Visual Implicit Autoregressive Modeling
by: Jiang, Pengfei, et al.
Published: (2026)

Revisit What You See: Revealing Visual Semantics in Vision Tokens to Guide LVLM Decoding
by: Cho, Beomsik, et al.
Published: (2025)

Select2Col: Leveraging Spatial-Temporal Importance of Semantic Information for Efficient Collaborative Perception
by: Liu, Yuntao, et al.
Published: (2023)

TR-PTS: Task-Relevant Parameter and Token Selection for Efficient Tuning
by: Luo, Siqi, et al.
Published: (2025)

Compressor-VLA: Instruction-Guided Visual Token Compression for Efficient Robotic Manipulation
by: Gao, Juntao, et al.
Published: (2025)

FlashVLM: Text-Guided Visual Token Selection for Large Multimodal Models
by: Cai, Kaitong, et al.
Published: (2025)

ObjectAdd: Adding Objects into Image via a Training-Free Diffusion Modification Fashion
by: Zhang, Ziyue, et al.
Published: (2024)