Saved in:
| Main Authors: | Gong, Yifei, Wu, Xing, Liu, Wenda, Tu, Kang |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.07960 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
VIXEN: Visual Text Comparison Network for Image Difference Captioning
by: Black, Alexander, et al.
Published: (2024)
by: Black, Alexander, et al.
Published: (2024)
Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning
by: Hua, Jiacheng, et al.
Published: (2026)
by: Hua, Jiacheng, et al.
Published: (2026)
ReCAD: Reinforcement Learning Enhanced Parametric CAD Model Generation with Vision-Language Models
by: Li, Jiahao, et al.
Published: (2025)
by: Li, Jiahao, et al.
Published: (2025)
Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models
by: Sun, Haoyuan, et al.
Published: (2025)
by: Sun, Haoyuan, et al.
Published: (2025)
VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models
by: Zhou, Chenyu, et al.
Published: (2024)
by: Zhou, Chenyu, et al.
Published: (2024)
Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models
by: Zhu, Tinghui, et al.
Published: (2024)
by: Zhu, Tinghui, et al.
Published: (2024)
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models
by: Zhang, Kaichen, et al.
Published: (2024)
by: Zhang, Kaichen, et al.
Published: (2024)
ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models
by: Tu, Yahan, et al.
Published: (2024)
by: Tu, Yahan, et al.
Published: (2024)
GeoCAD: Local Geometry-Controllable CAD Generation with Large Language Models
by: Zhang, Zhanwei, et al.
Published: (2025)
by: Zhang, Zhanwei, et al.
Published: (2025)
Exploring Multimodal Large Language Models for Radiology Report Error-checking
by: Wu, Jinge, et al.
Published: (2023)
by: Wu, Jinge, et al.
Published: (2023)
Exploring Advanced Large Language Models with LLMsuite
by: Roffo, Giorgio
Published: (2024)
by: Roffo, Giorgio
Published: (2024)
Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding
by: Li, Yun, et al.
Published: (2025)
by: Li, Yun, et al.
Published: (2025)
EmotionHallucer: Evaluating Emotion Hallucinations in Multimodal Large Language Models
by: Xing, Bohao, et al.
Published: (2025)
by: Xing, Bohao, et al.
Published: (2025)
Learning Domain Knowledge in Multimodal Large Language Models through Reinforcement Fine-Tuning
by: Cao, Qinglong, et al.
Published: (2026)
by: Cao, Qinglong, et al.
Published: (2026)
HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning
by: Wei, Yanbin, et al.
Published: (2026)
by: Wei, Yanbin, et al.
Published: (2026)
LADR: Locality-Aware Dynamic Rescue for Efficient Text-to-Image Generation with Diffusion Large Language Models
by: Wang, Chenglin, et al.
Published: (2026)
by: Wang, Chenglin, et al.
Published: (2026)
PUMGPT: A Large Vision-Language Model for Product Understanding
by: Xue, Wei, et al.
Published: (2023)
by: Xue, Wei, et al.
Published: (2023)
MIRL: Mutual Information-Guided Reinforcement Learning for Vision-Language Models
by: Zhang, Yin, et al.
Published: (2026)
by: Zhang, Yin, et al.
Published: (2026)
Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs
by: Lu, Meng, et al.
Published: (2025)
by: Lu, Meng, et al.
Published: (2025)
ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2
by: Huang, Wenjun, et al.
Published: (2024)
by: Huang, Wenjun, et al.
Published: (2024)
Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models
by: Kriz, Anita, et al.
Published: (2025)
by: Kriz, Anita, et al.
Published: (2025)
OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models
by: Yu, Wenwen, et al.
Published: (2025)
by: Yu, Wenwen, et al.
Published: (2025)
A Cross-Modal Rumor Detection Scheme via Contrastive Learning by Exploring Text and Image internal Correlations
by: Ma, Bin, et al.
Published: (2025)
by: Ma, Bin, et al.
Published: (2025)
Evaluating Multimodal Large Language Models on Vertically Written Japanese Text
by: Sasagawa, Keito, et al.
Published: (2025)
by: Sasagawa, Keito, et al.
Published: (2025)
Reinforced Visual Perception with Tools
by: Zhou, Zetong, et al.
Published: (2025)
by: Zhou, Zetong, et al.
Published: (2025)
Jailbreaking Safeguarded Text-to-Image Models via Large Language Models
by: Jiang, Zhengyuan, et al.
Published: (2025)
by: Jiang, Zhengyuan, et al.
Published: (2025)
Vision-centric Token Compression in Large Language Model
by: Xing, Ling, et al.
Published: (2025)
by: Xing, Ling, et al.
Published: (2025)
Geometry-Aware Losses for Structure-Preserving Text-to-Sign Language Generation
by: Wu, Zetian, et al.
Published: (2025)
by: Wu, Zetian, et al.
Published: (2025)
VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools
by: Qi, Ji, et al.
Published: (2023)
by: Qi, Ji, et al.
Published: (2023)
Growing a Multi-head Twig via Distillation and Reinforcement Learning to Accelerate Large Vision-Language Models
by: Shao, Zhenwei, et al.
Published: (2025)
by: Shao, Zhenwei, et al.
Published: (2025)
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
by: Zhai, Yuexiang, et al.
Published: (2024)
by: Zhai, Yuexiang, et al.
Published: (2024)
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
by: Xu, Jiaqi, et al.
Published: (2023)
by: Xu, Jiaqi, et al.
Published: (2023)
Generative Sign-description Prompts with Multi-positive Contrastive Learning for Sign Language Recognition
by: Liang, Siyu, et al.
Published: (2025)
by: Liang, Siyu, et al.
Published: (2025)
MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models
by: Ruan, Jiacheng, et al.
Published: (2025)
by: Ruan, Jiacheng, et al.
Published: (2025)
EFUF: Efficient Fine-grained Unlearning Framework for Mitigating Hallucinations in Multimodal Large Language Models
by: Xing, Shangyu, et al.
Published: (2024)
by: Xing, Shangyu, et al.
Published: (2024)
Expressive and Generalizable Low-rank Adaptation for Large Models via Slow Cascaded Learning
by: Li, Siwei, et al.
Published: (2024)
by: Li, Siwei, et al.
Published: (2024)
VideoAVE: A Multi-Attribute Video-to-Text Attribute Value Extraction Dataset and Benchmark Models
by: Cheng, Ming, et al.
Published: (2025)
by: Cheng, Ming, et al.
Published: (2025)
Where do Large Vision-Language Models Look at when Answering Questions?
by: Xing, Xiaoying, et al.
Published: (2025)
by: Xing, Xiaoying, et al.
Published: (2025)
Do Vision-Language Models Really Understand Visual Language?
by: Hou, Yifan, et al.
Published: (2024)
by: Hou, Yifan, et al.
Published: (2024)
VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use
by: Jiang, Dongfu, et al.
Published: (2025)
by: Jiang, Dongfu, et al.
Published: (2025)
Similar Items
-
VIXEN: Visual Text Comparison Network for Image Difference Captioning
by: Black, Alexander, et al.
Published: (2024) -
Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning
by: Hua, Jiacheng, et al.
Published: (2026) -
ReCAD: Reinforcement Learning Enhanced Parametric CAD Model Generation with Vision-Language Models
by: Li, Jiahao, et al.
Published: (2025) -
Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models
by: Sun, Haoyuan, et al.
Published: (2025) -
VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models
by: Zhou, Chenyu, et al.
Published: (2024)