Saved in:
| Main Authors: | Li, Shuai, Xu, Jian, Li, Xiao-Hui, Deng, Chao, Huang, Lin-Lin |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.00654 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Efficient Whole Slide Pathology VQA via Token Compression
by: Lyu, Weimin, et al.
Published: (2025)
by: Lyu, Weimin, et al.
Published: (2025)
[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs
by: Wang, Ao, et al.
Published: (2024)
by: Wang, Ao, et al.
Published: (2024)
QTG-VQA: Question-Type-Guided Architectural for VideoQA Systems
by: He, Zhixian, et al.
Published: (2024)
by: He, Zhixian, et al.
Published: (2024)
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
by: Zhang, Xiaoman, et al.
Published: (2023)
by: Zhang, Xiaoman, et al.
Published: (2023)
WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering
by: Chen, Pingyi, et al.
Published: (2024)
by: Chen, Pingyi, et al.
Published: (2024)
QG-CoC: Question-Guided Chain-of-Captions for Large Multimodal Models
by: Kao, Kuei-Chun, et al.
Published: (2025)
by: Kao, Kuei-Chun, et al.
Published: (2025)
SCRA-VQA: Summarized Caption-Rerank for Augmented Large Language Models in Visual Question Answering
by: Zhang, Yan, et al.
Published: (2025)
by: Zhang, Yan, et al.
Published: (2025)
Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM
by: Wang, Han, et al.
Published: (2024)
by: Wang, Han, et al.
Published: (2024)
VQA$^2$: Visual Question Answering for Video Quality Assessment
by: Jia, Ziheng, et al.
Published: (2024)
by: Jia, Ziheng, et al.
Published: (2024)
Visual Robustness Benchmark for Visual Question Answering (VQA)
by: Ishmam, Md Farhan, et al.
Published: (2024)
by: Ishmam, Md Farhan, et al.
Published: (2024)
IDPruner: Harmonizing Importance and Diversity in Visual Token Pruning for MLLMs
by: Tan, Yifan, et al.
Published: (2026)
by: Tan, Yifan, et al.
Published: (2026)
Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information
by: Chen, Yi, et al.
Published: (2024)
by: Chen, Yi, et al.
Published: (2024)
QMoP: Query Guided Mixture-of-Projector for Efficient Visual Token Compression
by: Li, Zhongyang, et al.
Published: (2026)
by: Li, Zhongyang, et al.
Published: (2026)
Magic-MM-Embedding: Towards Visual-Token-Efficient Universal Multimodal Embedding with MLLMs
by: Li, Qi, et al.
Published: (2026)
by: Li, Qi, et al.
Published: (2026)
VISA: Group-wise Visual Token Selection and Aggregation via Graph Summarization for Efficient MLLMs Inference
by: Jiang, Pengfei, et al.
Published: (2025)
by: Jiang, Pengfei, et al.
Published: (2025)
VisualQuest: A Benchmark for Abstract Visual Reasoning in MLLMs
by: Xiao, Kelaiti, et al.
Published: (2025)
by: Xiao, Kelaiti, et al.
Published: (2025)
Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Guided Retrieval-Augmented Generation
by: Xu, Quanxing, et al.
Published: (2026)
by: Xu, Quanxing, et al.
Published: (2026)
RedundancyLens: Revealing and Exploiting Visual Token Processing Redundancy for Efficient Decoder-Only MLLMs
by: Li, Hongliang, et al.
Published: (2025)
by: Li, Hongliang, et al.
Published: (2025)
MacVQA: Adaptive Memory Allocation and Global Noise Filtering for Continual Visual Question Answering
by: Li, Zhifei, et al.
Published: (2026)
by: Li, Zhifei, et al.
Published: (2026)
OpenView: Empowering MLLMs with Out-of-view VQA
by: Chen, Qixiang, et al.
Published: (2025)
by: Chen, Qixiang, et al.
Published: (2025)
GridPrune: From "Where to Look" to "What to Select" in Visual Token Pruning for MLLMs
by: Duan, Yuxiang, et al.
Published: (2025)
by: Duan, Yuxiang, et al.
Published: (2025)
SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs
by: Yin, Yuanyang, et al.
Published: (2024)
by: Yin, Yuanyang, et al.
Published: (2024)
Adversarial Attacks on VQA-NLE: Exposing and Alleviating Inconsistencies in Visual Question Answering Explanations
by: Yeh, Yahsin, et al.
Published: (2025)
by: Yeh, Yahsin, et al.
Published: (2025)
Compression Tells Intelligence: Visual Coding, Visual Token Technology, and the Unification
by: Jin, Xin, et al.
Published: (2026)
by: Jin, Xin, et al.
Published: (2026)
BERT-VQA: Visual Question Answering on Plots
by: Vu, Tai, et al.
Published: (2025)
by: Vu, Tai, et al.
Published: (2025)
Efficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs
by: Lin, Yuhui, et al.
Published: (2026)
by: Lin, Yuhui, et al.
Published: (2026)
Detect2Interact: Localizing Object Key Field in Visual Question Answering (VQA) with LLMs
by: Wang, Jialou, et al.
Published: (2024)
by: Wang, Jialou, et al.
Published: (2024)
Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Challenge
by: Xu, Yinsong, et al.
Published: (2026)
by: Xu, Yinsong, et al.
Published: (2026)
MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering
by: Mao, Xianwei, et al.
Published: (2026)
by: Mao, Xianwei, et al.
Published: (2026)
RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation
by: Zhang, Sen, et al.
Published: (2026)
by: Zhang, Sen, et al.
Published: (2026)
MI-Pruner: Crossmodal Mutual Information-guided Token Pruner for Efficient MLLMs
by: Li, Jiameng, et al.
Published: (2026)
by: Li, Jiameng, et al.
Published: (2026)
CommVQA: Situating Visual Question Answering in Communicative Contexts
by: Naik, Nandita Shankar, et al.
Published: (2024)
by: Naik, Nandita Shankar, et al.
Published: (2024)
When MLLMs Meet Compression Distortion: A Coding Paradigm Tailored to MLLMs
by: Liu, Jinming, et al.
Published: (2025)
by: Liu, Jinming, et al.
Published: (2025)
StackOverflowVQA: Stack Overflow Visual Question Answering Dataset
by: Mirzaei, Motahhare, et al.
Published: (2024)
by: Mirzaei, Motahhare, et al.
Published: (2024)
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement
by: Wang, Xiyao, et al.
Published: (2025)
by: Wang, Xiyao, et al.
Published: (2025)
Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models
by: Peng, Tianfan, et al.
Published: (2025)
by: Peng, Tianfan, et al.
Published: (2025)
LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs
by: Lou, Haoran, et al.
Published: (2025)
by: Lou, Haoran, et al.
Published: (2025)
TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models
by: Tan, Xudong, et al.
Published: (2025)
by: Tan, Xudong, et al.
Published: (2025)
MedXplain-VQA: Multi-Component Explainable Medical Visual Question Answering
by: Nguyen, Hai-Dang, et al.
Published: (2025)
by: Nguyen, Hai-Dang, et al.
Published: (2025)
DisasterVQA: A Visual Question Answering Benchmark Dataset for Disaster Scenes
by: Al-Mohannadi, Aisha, et al.
Published: (2026)
by: Al-Mohannadi, Aisha, et al.
Published: (2026)
Similar Items
-
Efficient Whole Slide Pathology VQA via Token Compression
by: Lyu, Weimin, et al.
Published: (2025) -
[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs
by: Wang, Ao, et al.
Published: (2024) -
QTG-VQA: Question-Type-Guided Architectural for VideoQA Systems
by: He, Zhixian, et al.
Published: (2024) -
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
by: Zhang, Xiaoman, et al.
Published: (2023) -
WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering
by: Chen, Pingyi, et al.
Published: (2024)