:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Aghdam, Amir, Hu, Vincent Tao, Ommer, Björn
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Machine Learning Multimedia I.2.10; I.2.7
Online Access:	https://arxiv.org/abs/2506.22967
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Leum-VL Technical Report
by: He, Yuxuan, et al.
Published: (2026)

MemeCraft: Contextual and Stance-Driven Multimodal Meme Generation
by: Wang, Han, et al.
Published: (2024)

A Roadmap for Multilingual, Multimodal Domain Independent Deception Detection
by: Boumber, Dainis, et al.
Published: (2024)

Perception-Consistency Multimodal Large Language Models Reasoning via Caption-Regularized Policy Optimization
by: Tu, Songjun, et al.
Published: (2025)

Universal Adversarial Attack on Aligned Multimodal LLMs
by: Rahmatullaev, Temurbek, et al.
Published: (2025)

Cinéaste: A Fine-grained Contextual Movie Question Answering Benchmark
by: Shah, Nisarg A., et al.
Published: (2025)

ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment
by: Bian, Zhipeng, et al.
Published: (2026)

Labels or Input? Rethinking Augmentation in Multimodal Hate Detection
by: Singh, Sahajpreet, et al.
Published: (2025)

PhysicsArena: The First Multimodal Physics Reasoning Benchmark Exploring Variable, Process, and Solution Dimensions
by: Dai, Song, et al.
Published: (2025)

Hateful Meme Detection through Context-Sensitive Prompting and Fine-Grained Labeling
by: Ouyang, Rongxin, et al.
Published: (2024)

RCI: A Score for Evaluating Global and Local Reasoning in Multimodal Benchmarks
by: Agarwal, Amit, et al.
Published: (2025)

Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning
by: Tong, Jingqi, et al.
Published: (2025)

ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use
by: Li, Kaixin, et al.
Published: (2025)

Evaluating Voice Command Pipelines for Drone Control: From STT and LLM to Direct Classification and Siamese Networks
by: Simões, Lucca Emmanuel Pineli, et al.
Published: (2024)

OpenMap: Instruction Grounding via Open-Vocabulary Visual-Language Mapping
by: Li, Danyang, et al.
Published: (2025)

Taking Flight with Dialogue: Enabling Natural Language Control for PX4-based Drone Agent
by: Lim, Shoon Kit, et al.
Published: (2025)

StratXplore: Strategic Novelty-seeking and Instruction-aligned Exploration for Vision and Language Navigation
by: Gopinathan, Muraleekrishna, et al.
Published: (2024)

Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment
by: Ye, Hua, et al.
Published: (2025)

Memory-Efficient Differentially Private Training with Gradient Random Projection
by: Mulrooney, Alex, et al.
Published: (2025)

PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model
by: Zhang, Sinin, et al.
Published: (2026)

GeoVision Labeler: Zero-Shot Geospatial Classification with Vision and Language Models
by: Hacheme, Gilles Quentin, et al.
Published: (2025)

K-MetBench: A Multi-Dimensional Benchmark for Fine-Grained Evaluation of Expert Reasoning, Locality, and Multimodality in Meteorology
by: Kim, Soyeon, et al.
Published: (2026)

VidNum-1.4K: A Comprehensive Benchmark for Video-based Numerical Reasoning
by: Cui, Shaoyang, et al.
Published: (2026)

Learning the meanings of function words from grounded language using a visual question answering model
by: Portelance, Eva, et al.
Published: (2023)

Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning
by: Yang, Shan
Published: (2026)

Spatially-Aware Speaker for Vision-and-Language Navigation Instruction Generation
by: Gopinathan, Muraleekrishna, et al.
Published: (2024)

Bounding Hallucinations: Information-Theoretic Guarantees for RAG Systems via Merlin-Arthur Protocols
by: Deiseroth, Björn, et al.
Published: (2025)

TensLoRA: Tensor Alternatives for Low-Rank Adaptation
by: Marmoret, Axel, et al.
Published: (2025)

Correspondence of high-dimensional emotion structures elicited by video clips between humans and Multimodal LLMs
by: Asanuma, Haruka, et al.
Published: (2025)

PCRI: Measuring Context Robustness in Multimodal Models for Enterprise Applications
by: Patel, Hitesh Laxmichand, et al.
Published: (2025)

Survey Transfer Learning: Recycling Data with Silicon Responses
by: Amini, Ali
Published: (2025)

TriAlignGR: Triangular Multitask Alignment with Multimodal Deep Interest Mining for Generative Recommendation
by: Zeng, Yangchen, et al.
Published: (2026)

ReSpace: Text-Driven Autoregressive 3D Indoor Scene Synthesis and Editing
by: Bucher, Martin JJ., et al.
Published: (2025)

A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models
by: Balasubramanian, Sriram, et al.
Published: (2025)

MIRA: Empowering One-Touch AI Services on Smartphones with MLLM-based Instruction Recommendation
by: Bian, Zhipeng, et al.
Published: (2025)

From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text
by: Le, Van-Truong
Published: (2026)

Defending against Backdoor Attacks via Module Switching
by: Li, Weijun, et al.
Published: (2025)

Enhanced Kalman with Adaptive Appearance Motion SORT for Grounded Generic Multiple Object Tracking
by: Anh, Duy Le Dinh, et al.
Published: (2024)

What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation
by: Yang, Dingyi, et al.
Published: (2024)

Enhancing Sports Strategy with Video Analytics and Data Mining: Assessing the effectiveness of Multimodal LLMs in tennis video analysis
by: Teo, Charlton
Published: (2025)