Saved in:
| Main Authors: | Giannone, Giorgio, Li, Ruoteng, Feng, Qianli, Perevodchikov, Evgeny, Chen, Rui, Martinez, Aleix |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2501.04568 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model
by: Zhang, Yongting, et al.
Published: (2024)
by: Zhang, Yongting, et al.
Published: (2024)
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences
by: Lu, Yujie, et al.
Published: (2024)
by: Lu, Yujie, et al.
Published: (2024)
Skip \n: A Simple Method to Reduce Hallucination in Large Vision-Language Models
by: Han, Zongbo, et al.
Published: (2024)
by: Han, Zongbo, et al.
Published: (2024)
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment
by: Li, Lei, et al.
Published: (2024)
by: Li, Lei, et al.
Published: (2024)
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
by: Yang, Rui, et al.
Published: (2025)
by: Yang, Rui, et al.
Published: (2025)
GaussianVision: Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting
by: Omri, Yasmine, et al.
Published: (2025)
by: Omri, Yasmine, et al.
Published: (2025)
Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation
by: Barsellotti, Luca, et al.
Published: (2024)
by: Barsellotti, Luca, et al.
Published: (2024)
IE-Critic-R1: Advancing the Explanatory Measurement of Text-Driven Image Editing for Human Perception Alignment
by: Qu, Bowen, et al.
Published: (2025)
by: Qu, Bowen, et al.
Published: (2025)
CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models
by: Chen, Kesheng, et al.
Published: (2026)
by: Chen, Kesheng, et al.
Published: (2026)
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
by: Jiang, Ziyan, et al.
Published: (2024)
by: Jiang, Ziyan, et al.
Published: (2024)
FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding
by: Faure, Gueter Josmy, et al.
Published: (2026)
by: Faure, Gueter Josmy, et al.
Published: (2026)
Cross-Modal Adapter for Vision-Language Retrieval
by: Jiang, Haojun, et al.
Published: (2022)
by: Jiang, Haojun, et al.
Published: (2022)
DOTA: Distributional Test-Time Adaptation of Vision-Language Models
by: Han, Zongbo, et al.
Published: (2024)
by: Han, Zongbo, et al.
Published: (2024)
Seeing Through Words, Speaking Through Pixels: Deep Representational Alignment Between Vision and Language Models
by: He, Zoe Wanying, et al.
Published: (2025)
by: He, Zoe Wanying, et al.
Published: (2025)
Flexible Tool Selection through Low-dimensional Attribute Alignment of Vision and Language
by: Hao, Guangfu, et al.
Published: (2025)
by: Hao, Guangfu, et al.
Published: (2025)
Improving Medical Large Vision-Language Models with Abnormal-Aware Feedback
by: Zhou, Yucheng, et al.
Published: (2025)
by: Zhou, Yucheng, et al.
Published: (2025)
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement
by: Wang, Xiyao, et al.
Published: (2024)
by: Wang, Xiyao, et al.
Published: (2024)
Unified Vision-Language Modeling via Concept Space Alignment
by: Qiu, Yifu, et al.
Published: (2026)
by: Qiu, Yifu, et al.
Published: (2026)
Self-Supervised Visual Preference Alignment
by: Zhu, Ke, et al.
Published: (2024)
by: Zhu, Ke, et al.
Published: (2024)
Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey
by: Winata, Genta Indra, et al.
Published: (2024)
by: Winata, Genta Indra, et al.
Published: (2024)
Can Vision Language Models Judge Action Quality? An Empirical Evaluation
by: Freitas, Miguel Monte e, et al.
Published: (2026)
by: Freitas, Miguel Monte e, et al.
Published: (2026)
Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models
by: Qu, Kevin, et al.
Published: (2026)
by: Qu, Kevin, et al.
Published: (2026)
Revisiting the Role of Language Priors in Vision-Language Models
by: Lin, Zhiqiu, et al.
Published: (2023)
by: Lin, Zhiqiu, et al.
Published: (2023)
Panoptic Vision-Language Feature Fields
by: Chen, Haoran, et al.
Published: (2023)
by: Chen, Haoran, et al.
Published: (2023)
Polos: Multimodal Metric Learning from Human Feedback for Image Captioning
by: Wada, Yuiga, et al.
Published: (2024)
by: Wada, Yuiga, et al.
Published: (2024)
Retrieval-based Disentangled Representation Learning with Natural Language Supervision
by: Zhou, Jiawei, et al.
Published: (2022)
by: Zhou, Jiawei, et al.
Published: (2022)
Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models
by: Deniz, Omer Faruk, et al.
Published: (2026)
by: Deniz, Omer Faruk, et al.
Published: (2026)
Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention
by: An, Wenbin, et al.
Published: (2024)
by: An, Wenbin, et al.
Published: (2024)
Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images
by: Wu, Shengguang, et al.
Published: (2025)
by: Wu, Shengguang, et al.
Published: (2025)
Analyzing Fine-Grained Alignment and Enhancing Vision Understanding in Multimodal Language Models
by: Jiang, Jiachen, et al.
Published: (2025)
by: Jiang, Jiachen, et al.
Published: (2025)
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model
by: Chen, Qiguang, et al.
Published: (2026)
by: Chen, Qiguang, et al.
Published: (2026)
TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation
by: Rajabi, Navid, et al.
Published: (2025)
by: Rajabi, Navid, et al.
Published: (2025)
Probing Collision Grounding in Vision-Language Models for Safe Human-Robot Collaboration
by: Wang, Jun, et al.
Published: (2026)
by: Wang, Jun, et al.
Published: (2026)
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models
by: Zou, Chengke, et al.
Published: (2024)
by: Zou, Chengke, et al.
Published: (2024)
Weaving Context Across Images: Improving Vision-Language Models through Focus-Centric Visual Chains
by: Zhang, Juntian, et al.
Published: (2025)
by: Zhang, Juntian, et al.
Published: (2025)
DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback
by: Chen, Yangyi, et al.
Published: (2023)
by: Chen, Yangyi, et al.
Published: (2023)
Multi-Object Hallucination in Vision-Language Models
by: Chen, Xuweiyi, et al.
Published: (2024)
by: Chen, Xuweiyi, et al.
Published: (2024)
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback
by: Xiao, Wenyi, et al.
Published: (2024)
by: Xiao, Wenyi, et al.
Published: (2024)
Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System
by: He, Lixuan, et al.
Published: (2025)
by: He, Lixuan, et al.
Published: (2025)
VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View
by: Schumann, Raphael, et al.
Published: (2023)
by: Schumann, Raphael, et al.
Published: (2023)
Similar Items
-
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model
by: Zhang, Yongting, et al.
Published: (2024) -
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences
by: Lu, Yujie, et al.
Published: (2024) -
Skip \n: A Simple Method to Reduce Hallucination in Large Vision-Language Models
by: Han, Zongbo, et al.
Published: (2024) -
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment
by: Li, Lei, et al.
Published: (2024) -
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
by: Yang, Rui, et al.
Published: (2025)