Saved in:
| Main Authors: | Wang, Peiqi, Peng, ShengYun, Zhang, Xuewen, Yu, Hanchao, Yang, Yibo, Huang, Lifu, Liu, Fujun, Wang, Qifan |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.18855 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
CompCap: Improving Multimodal Large Language Models with Composite Captions
by: Chen, Xiaohui, et al.
Published: (2024)
by: Chen, Xiaohui, et al.
Published: (2024)
Modality-Specialized Synergizers for Interleaved Vision-Language Generalists
by: Xu, Zhiyang, et al.
Published: (2024)
by: Xu, Zhiyang, et al.
Published: (2024)
Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning
by: Xu, Zhiyang, et al.
Published: (2024)
by: Xu, Zhiyang, et al.
Published: (2024)
RoRA-VLM: Robust Retrieval-Augmented Vision Language Models
by: Qi, Jingyuan, et al.
Published: (2024)
by: Qi, Jingyuan, et al.
Published: (2024)
A Survey on Hallucination in Large Vision-Language Models
by: Liu, Hanchao, et al.
Published: (2024)
by: Liu, Hanchao, et al.
Published: (2024)
Error-driven Data-efficient Large Multimodal Model Tuning
by: Yao, Barry Menglong, et al.
Published: (2024)
by: Yao, Barry Menglong, et al.
Published: (2024)
Natural Language Inference Improves Compositionality in Vision-Language Models
by: Cascante-Bonilla, Paola, et al.
Published: (2024)
by: Cascante-Bonilla, Paola, et al.
Published: (2024)
MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization
by: Zhu, Kangyu, et al.
Published: (2024)
by: Zhu, Kangyu, et al.
Published: (2024)
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models
by: Wang, Haibo, et al.
Published: (2024)
by: Wang, Haibo, et al.
Published: (2024)
Unleashing Implicit Rewards: Prefix-Value Learning for Distribution-Level Optimization
by: Gao, Shiping, et al.
Published: (2026)
by: Gao, Shiping, et al.
Published: (2026)
Black-Box Visual Prompt Engineering for Mitigating Object Hallucination in Large Vision Language Models
by: Woo, Sangmin, et al.
Published: (2025)
by: Woo, Sangmin, et al.
Published: (2025)
Video-Based Reward Modeling for Computer-Use Agents
by: Song, Linxin, et al.
Published: (2026)
by: Song, Linxin, et al.
Published: (2026)
Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning
by: Dou, Zi-Yi, et al.
Published: (2024)
by: Dou, Zi-Yi, et al.
Published: (2024)
Self-Supervised Pre-Training for Table Structure Recognition Transformer
by: Peng, ShengYun, et al.
Published: (2024)
by: Peng, ShengYun, et al.
Published: (2024)
How Do Large Language Models Learn Concepts During Continual Pre-Training?
by: Yao, Barry Menglong, et al.
Published: (2026)
by: Yao, Barry Menglong, et al.
Published: (2026)
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models
by: Xia, Peng, et al.
Published: (2024)
by: Xia, Peng, et al.
Published: (2024)
AR-RAG: Autoregressive Retrieval Augmentation for Image Generation
by: Qi, Jingyuan, et al.
Published: (2025)
by: Qi, Jingyuan, et al.
Published: (2025)
VEGAS: Mitigating Hallucinations in Large Vision-Language Models via Vision-Encoder Attention Guided Adaptive Steering
by: Wang, Zihu, et al.
Published: (2025)
by: Wang, Zihu, et al.
Published: (2025)
Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension
by: Wang, Xiyao, et al.
Published: (2024)
by: Wang, Xiyao, et al.
Published: (2024)
OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding
by: Zhao, Tiancheng, et al.
Published: (2024)
by: Zhao, Tiancheng, et al.
Published: (2024)
Calibrated Self-Rewarding Vision Language Models
by: Zhou, Yiyang, et al.
Published: (2024)
by: Zhou, Yiyang, et al.
Published: (2024)
Toward Interactive Regional Understanding in Vision-Large Language Models
by: Lee, Jungbeom, et al.
Published: (2024)
by: Lee, Jungbeom, et al.
Published: (2024)
Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
by: Wang, Ye, et al.
Published: (2025)
by: Wang, Ye, et al.
Published: (2025)
Probing and Inducing Combinational Creativity in Vision-Language Models
by: Peng, Yongqian, et al.
Published: (2025)
by: Peng, Yongqian, et al.
Published: (2025)
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
by: Shen, Haozhan, et al.
Published: (2025)
by: Shen, Haozhan, et al.
Published: (2025)
Scaffolding Coordinates to Promote Vision-Language Coordination in Large Multi-Modal Models
by: Lei, Xuanyu, et al.
Published: (2024)
by: Lei, Xuanyu, et al.
Published: (2024)
Frame-Voyager: Learning to Query Frames for Video Large Language Models
by: Yu, Sicheng, et al.
Published: (2024)
by: Yu, Sicheng, et al.
Published: (2024)
On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training
by: Wu, Xueqing, et al.
Published: (2026)
by: Wu, Xueqing, et al.
Published: (2026)
Multimodal Abstractive Summarization of Instructional Videos with Vision-Language Models
by: Nazir, Maham, et al.
Published: (2026)
by: Nazir, Maham, et al.
Published: (2026)
Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models
by: Li, Bin, et al.
Published: (2025)
by: Li, Bin, et al.
Published: (2025)
SOLO: A Single Transformer for Scalable Vision-Language Modeling
by: Chen, Yangyi, et al.
Published: (2024)
by: Chen, Yangyi, et al.
Published: (2024)
Nemesis: Normalizing the Soft-prompt Vectors of Vision-Language Models
by: Fu, Shuai, et al.
Published: (2024)
by: Fu, Shuai, et al.
Published: (2024)
Adaptive Vision-Language Model Routing for Computer Use Agents
by: Liu, Xunzhuo, et al.
Published: (2026)
by: Liu, Xunzhuo, et al.
Published: (2026)
Enhancing Large Vision Language Models with Self-Training on Image Comprehension
by: Deng, Yihe, et al.
Published: (2024)
by: Deng, Yihe, et al.
Published: (2024)
SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding
by: Wu, Chang-Hsun, et al.
Published: (2025)
by: Wu, Chang-Hsun, et al.
Published: (2025)
Evaluating Vision-Language Models for Emotion Recognition
by: Bhattacharyya, Sree, et al.
Published: (2025)
by: Bhattacharyya, Sree, et al.
Published: (2025)
Fewer Denoising Steps or Cheaper Per-Step Inference: Towards Compute-Optimal Diffusion Model Deployment
by: Du, Zhenbang, et al.
Published: (2025)
by: Du, Zhenbang, et al.
Published: (2025)
ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos
by: Hannan, Tanveer, et al.
Published: (2024)
by: Hannan, Tanveer, et al.
Published: (2024)
CLIP-Adapter: Better Vision-Language Models with Feature Adapters
by: Gao, Peng, et al.
Published: (2021)
by: Gao, Peng, et al.
Published: (2021)
VisionZip: Longer is Better but Not Necessary in Vision Language Models
by: Yang, Senqiao, et al.
Published: (2024)
by: Yang, Senqiao, et al.
Published: (2024)
Similar Items
-
CompCap: Improving Multimodal Large Language Models with Composite Captions
by: Chen, Xiaohui, et al.
Published: (2024) -
Modality-Specialized Synergizers for Interleaved Vision-Language Generalists
by: Xu, Zhiyang, et al.
Published: (2024) -
Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning
by: Xu, Zhiyang, et al.
Published: (2024) -
RoRA-VLM: Robust Retrieval-Augmented Vision Language Models
by: Qi, Jingyuan, et al.
Published: (2024) -
A Survey on Hallucination in Large Vision-Language Models
by: Liu, Hanchao, et al.
Published: (2024)