Saved in:
| Main Authors: | Aghdam, Amir, Hu, Vincent Tao, Ommer, Björn |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2506.22967 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Leum-VL Technical Report
by: He, Yuxuan, et al.
Published: (2026)
by: He, Yuxuan, et al.
Published: (2026)
MemeCraft: Contextual and Stance-Driven Multimodal Meme Generation
by: Wang, Han, et al.
Published: (2024)
by: Wang, Han, et al.
Published: (2024)
A Roadmap for Multilingual, Multimodal Domain Independent Deception Detection
by: Boumber, Dainis, et al.
Published: (2024)
by: Boumber, Dainis, et al.
Published: (2024)
Perception-Consistency Multimodal Large Language Models Reasoning via Caption-Regularized Policy Optimization
by: Tu, Songjun, et al.
Published: (2025)
by: Tu, Songjun, et al.
Published: (2025)
Universal Adversarial Attack on Aligned Multimodal LLMs
by: Rahmatullaev, Temurbek, et al.
Published: (2025)
by: Rahmatullaev, Temurbek, et al.
Published: (2025)
Cinéaste: A Fine-grained Contextual Movie Question Answering Benchmark
by: Shah, Nisarg A., et al.
Published: (2025)
by: Shah, Nisarg A., et al.
Published: (2025)
ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment
by: Bian, Zhipeng, et al.
Published: (2026)
by: Bian, Zhipeng, et al.
Published: (2026)
Labels or Input? Rethinking Augmentation in Multimodal Hate Detection
by: Singh, Sahajpreet, et al.
Published: (2025)
by: Singh, Sahajpreet, et al.
Published: (2025)
PhysicsArena: The First Multimodal Physics Reasoning Benchmark Exploring Variable, Process, and Solution Dimensions
by: Dai, Song, et al.
Published: (2025)
by: Dai, Song, et al.
Published: (2025)
Hateful Meme Detection through Context-Sensitive Prompting and Fine-Grained Labeling
by: Ouyang, Rongxin, et al.
Published: (2024)
by: Ouyang, Rongxin, et al.
Published: (2024)
RCI: A Score for Evaluating Global and Local Reasoning in Multimodal Benchmarks
by: Agarwal, Amit, et al.
Published: (2025)
by: Agarwal, Amit, et al.
Published: (2025)
Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning
by: Tong, Jingqi, et al.
Published: (2025)
by: Tong, Jingqi, et al.
Published: (2025)
ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use
by: Li, Kaixin, et al.
Published: (2025)
by: Li, Kaixin, et al.
Published: (2025)
Evaluating Voice Command Pipelines for Drone Control: From STT and LLM to Direct Classification and Siamese Networks
by: Simões, Lucca Emmanuel Pineli, et al.
Published: (2024)
by: Simões, Lucca Emmanuel Pineli, et al.
Published: (2024)
OpenMap: Instruction Grounding via Open-Vocabulary Visual-Language Mapping
by: Li, Danyang, et al.
Published: (2025)
by: Li, Danyang, et al.
Published: (2025)
Taking Flight with Dialogue: Enabling Natural Language Control for PX4-based Drone Agent
by: Lim, Shoon Kit, et al.
Published: (2025)
by: Lim, Shoon Kit, et al.
Published: (2025)
StratXplore: Strategic Novelty-seeking and Instruction-aligned Exploration for Vision and Language Navigation
by: Gopinathan, Muraleekrishna, et al.
Published: (2024)
by: Gopinathan, Muraleekrishna, et al.
Published: (2024)
Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment
by: Ye, Hua, et al.
Published: (2025)
by: Ye, Hua, et al.
Published: (2025)
Memory-Efficient Differentially Private Training with Gradient Random Projection
by: Mulrooney, Alex, et al.
Published: (2025)
by: Mulrooney, Alex, et al.
Published: (2025)
PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model
by: Zhang, Sinin, et al.
Published: (2026)
by: Zhang, Sinin, et al.
Published: (2026)
GeoVision Labeler: Zero-Shot Geospatial Classification with Vision and Language Models
by: Hacheme, Gilles Quentin, et al.
Published: (2025)
by: Hacheme, Gilles Quentin, et al.
Published: (2025)
K-MetBench: A Multi-Dimensional Benchmark for Fine-Grained Evaluation of Expert Reasoning, Locality, and Multimodality in Meteorology
by: Kim, Soyeon, et al.
Published: (2026)
by: Kim, Soyeon, et al.
Published: (2026)
VidNum-1.4K: A Comprehensive Benchmark for Video-based Numerical Reasoning
by: Cui, Shaoyang, et al.
Published: (2026)
by: Cui, Shaoyang, et al.
Published: (2026)
Learning the meanings of function words from grounded language using a visual question answering model
by: Portelance, Eva, et al.
Published: (2023)
by: Portelance, Eva, et al.
Published: (2023)
Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning
by: Yang, Shan
Published: (2026)
by: Yang, Shan
Published: (2026)
Spatially-Aware Speaker for Vision-and-Language Navigation Instruction Generation
by: Gopinathan, Muraleekrishna, et al.
Published: (2024)
by: Gopinathan, Muraleekrishna, et al.
Published: (2024)
Bounding Hallucinations: Information-Theoretic Guarantees for RAG Systems via Merlin-Arthur Protocols
by: Deiseroth, Björn, et al.
Published: (2025)
by: Deiseroth, Björn, et al.
Published: (2025)
TensLoRA: Tensor Alternatives for Low-Rank Adaptation
by: Marmoret, Axel, et al.
Published: (2025)
by: Marmoret, Axel, et al.
Published: (2025)
Correspondence of high-dimensional emotion structures elicited by video clips between humans and Multimodal LLMs
by: Asanuma, Haruka, et al.
Published: (2025)
by: Asanuma, Haruka, et al.
Published: (2025)
PCRI: Measuring Context Robustness in Multimodal Models for Enterprise Applications
by: Patel, Hitesh Laxmichand, et al.
Published: (2025)
by: Patel, Hitesh Laxmichand, et al.
Published: (2025)
Survey Transfer Learning: Recycling Data with Silicon Responses
by: Amini, Ali
Published: (2025)
by: Amini, Ali
Published: (2025)
TriAlignGR: Triangular Multitask Alignment with Multimodal Deep Interest Mining for Generative Recommendation
by: Zeng, Yangchen, et al.
Published: (2026)
by: Zeng, Yangchen, et al.
Published: (2026)
ReSpace: Text-Driven Autoregressive 3D Indoor Scene Synthesis and Editing
by: Bucher, Martin JJ., et al.
Published: (2025)
by: Bucher, Martin JJ., et al.
Published: (2025)
A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models
by: Balasubramanian, Sriram, et al.
Published: (2025)
by: Balasubramanian, Sriram, et al.
Published: (2025)
MIRA: Empowering One-Touch AI Services on Smartphones with MLLM-based Instruction Recommendation
by: Bian, Zhipeng, et al.
Published: (2025)
by: Bian, Zhipeng, et al.
Published: (2025)
From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text
by: Le, Van-Truong
Published: (2026)
by: Le, Van-Truong
Published: (2026)
Defending against Backdoor Attacks via Module Switching
by: Li, Weijun, et al.
Published: (2025)
by: Li, Weijun, et al.
Published: (2025)
Enhanced Kalman with Adaptive Appearance Motion SORT for Grounded Generic Multiple Object Tracking
by: Anh, Duy Le Dinh, et al.
Published: (2024)
by: Anh, Duy Le Dinh, et al.
Published: (2024)
What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation
by: Yang, Dingyi, et al.
Published: (2024)
by: Yang, Dingyi, et al.
Published: (2024)
Enhancing Sports Strategy with Video Analytics and Data Mining: Assessing the effectiveness of Multimodal LLMs in tennis video analysis
by: Teo, Charlton
Published: (2025)
by: Teo, Charlton
Published: (2025)
Similar Items
-
Leum-VL Technical Report
by: He, Yuxuan, et al.
Published: (2026) -
MemeCraft: Contextual and Stance-Driven Multimodal Meme Generation
by: Wang, Han, et al.
Published: (2024) -
A Roadmap for Multilingual, Multimodal Domain Independent Deception Detection
by: Boumber, Dainis, et al.
Published: (2024) -
Perception-Consistency Multimodal Large Language Models Reasoning via Caption-Regularized Policy Optimization
by: Tu, Songjun, et al.
Published: (2025) -
Universal Adversarial Attack on Aligned Multimodal LLMs
by: Rahmatullaev, Temurbek, et al.
Published: (2025)