Saved in:
| Main Authors: | Dell'Erba, Samuele, Bagdanov, Andrew D. |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.20821 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
by: Niu, Yuwei, et al.
Published: (2025)
by: Niu, Yuwei, et al.
Published: (2025)
Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?
by: Feng, Yichen, et al.
Published: (2026)
by: Feng, Yichen, et al.
Published: (2026)
Spatially-Aware Speaker for Vision-and-Language Navigation Instruction Generation
by: Gopinathan, Muraleekrishna, et al.
Published: (2024)
by: Gopinathan, Muraleekrishna, et al.
Published: (2024)
FlexDoc: Parameterized Sampling for Diverse Multilingual Synthetic Documents for Training Document Understanding Models
by: Dua, Karan, et al.
Published: (2025)
by: Dua, Karan, et al.
Published: (2025)
Chat-Driven Text Generation and Interaction for Person Retrieval
by: Xie, Zequn, et al.
Published: (2025)
by: Xie, Zequn, et al.
Published: (2025)
MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing
by: Skripkin, Matvey, et al.
Published: (2025)
by: Skripkin, Matvey, et al.
Published: (2025)
GLoT: A Novel Gated-Logarithmic Transformer for Efficient Sign Language Translation
by: Shahin, Nada, et al.
Published: (2025)
by: Shahin, Nada, et al.
Published: (2025)
OpenMap: Instruction Grounding via Open-Vocabulary Visual-Language Mapping
by: Li, Danyang, et al.
Published: (2025)
by: Li, Danyang, et al.
Published: (2025)
Taking Flight with Dialogue: Enabling Natural Language Control for PX4-based Drone Agent
by: Lim, Shoon Kit, et al.
Published: (2025)
by: Lim, Shoon Kit, et al.
Published: (2025)
ADAT: Time-Series-Aware Adaptive Transformer Architecture for Sign Language Translation
by: Shahin, Nada, et al.
Published: (2025)
by: Shahin, Nada, et al.
Published: (2025)
ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment
by: Bian, Zhipeng, et al.
Published: (2026)
by: Bian, Zhipeng, et al.
Published: (2026)
StratXplore: Strategic Novelty-seeking and Instruction-aligned Exploration for Vision and Language Navigation
by: Gopinathan, Muraleekrishna, et al.
Published: (2024)
by: Gopinathan, Muraleekrishna, et al.
Published: (2024)
Bridge Diffusion Model: Bridge Chinese Text-to-Image Diffusion Model with English Communities
by: Liu, Shanyuan, et al.
Published: (2023)
by: Liu, Shanyuan, et al.
Published: (2023)
RoboScript: Code Generation for Free-Form Manipulation Tasks across Real and Simulation
by: Chen, Junting, et al.
Published: (2024)
by: Chen, Junting, et al.
Published: (2024)
OkanNet: A Lightweight Deep Learning Architecture for Classification of Brain Tumor from MRI Images
by: Uçar, Okan, et al.
Published: (2026)
by: Uçar, Okan, et al.
Published: (2026)
OmniFusion Technical Report
by: Goncharova, Elizaveta, et al.
Published: (2024)
by: Goncharova, Elizaveta, et al.
Published: (2024)
HATL: Hierarchical Adaptive-Transfer Learning Framework for Sign Language Machine Translation
by: Shahin, Nada, et al.
Published: (2026)
by: Shahin, Nada, et al.
Published: (2026)
NumeriKontrol: Adding Numeric Control to Diffusion Transformers for Instruction-based Image Editing
by: Xu, Zhenyu, et al.
Published: (2025)
by: Xu, Zhenyu, et al.
Published: (2025)
Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning
by: Yang, Shan
Published: (2026)
by: Yang, Shan
Published: (2026)
UAV-assisted Visual SLAM Generating Reconstructed 3D Scene Graphs in GPS-denied Environments
by: Radwan, Ahmed, et al.
Published: (2024)
by: Radwan, Ahmed, et al.
Published: (2024)
YETI (YET to Intervene) Proactive Interventions by Multimodal AI Agents in Augmented Reality Tasks
by: Bandyopadhyay, Saptarashmi, et al.
Published: (2025)
by: Bandyopadhyay, Saptarashmi, et al.
Published: (2025)
Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning
by: Tong, Jingqi, et al.
Published: (2025)
by: Tong, Jingqi, et al.
Published: (2025)
Exposing Blindspots: Cultural Bias Evaluation in Generative Image Models
by: Seo, Huichan, et al.
Published: (2025)
by: Seo, Huichan, et al.
Published: (2025)
A Surveillance Based Interactive Robot
by: Kavimandan, Kshitij, et al.
Published: (2025)
by: Kavimandan, Kshitij, et al.
Published: (2025)
Human-Robot Dialogue Annotation for Multi-Modal Common Ground
by: Bonial, Claire, et al.
Published: (2024)
by: Bonial, Claire, et al.
Published: (2024)
SCOUT: A Situated and Multi-Modal Human-Robot Dialogue Corpus
by: Lukin, Stephanie M., et al.
Published: (2024)
by: Lukin, Stephanie M., et al.
Published: (2024)
PhysicsArena: The First Multimodal Physics Reasoning Benchmark Exploring Variable, Process, and Solution Dimensions
by: Dai, Song, et al.
Published: (2025)
by: Dai, Song, et al.
Published: (2025)
Ego-Motion Aware Target Prediction Module for Robust Multi-Object Tracking
by: Mahdian, Navid, et al.
Published: (2024)
by: Mahdian, Navid, et al.
Published: (2024)
Vision-based Situational Graphs Exploiting Fiducial Markers for the Integration of Semantic Entities
by: Tourani, Ali, et al.
Published: (2023)
by: Tourani, Ali, et al.
Published: (2023)
Privacy-Preserving Structureless Visual Localization via Image Obfuscation
by: Panek, Vojtech, et al.
Published: (2026)
by: Panek, Vojtech, et al.
Published: (2026)
Pro-DG: Procedural Diffusion Guidance for Architectural Facade Generation
by: Plocharski, Aleksander, et al.
Published: (2025)
by: Plocharski, Aleksander, et al.
Published: (2025)
Who Sees What? Structured Thought-Action Sequences for Epistemic Reasoning in LLMs
by: Annese, Luca, et al.
Published: (2025)
by: Annese, Luca, et al.
Published: (2025)
IntrinsiX: High-Quality PBR Generation using Image Priors
by: Kocsis, Peter, et al.
Published: (2025)
by: Kocsis, Peter, et al.
Published: (2025)
GroundCap: A Visually Grounded Image Captioning Dataset
by: Oliveira, Daniel A. P., et al.
Published: (2025)
by: Oliveira, Daniel A. P., et al.
Published: (2025)
Learning the meanings of function words from grounded language using a visual question answering model
by: Portelance, Eva, et al.
Published: (2023)
by: Portelance, Eva, et al.
Published: (2023)
Combining Absolute and Semi-Generalized Relative Poses for Visual Localization
by: Panek, Vojtech, et al.
Published: (2024)
by: Panek, Vojtech, et al.
Published: (2024)
ReSpace: Text-Driven Autoregressive 3D Indoor Scene Synthesis and Editing
by: Bucher, Martin JJ., et al.
Published: (2025)
by: Bucher, Martin JJ., et al.
Published: (2025)
VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models
by: Cao, Jingtao, et al.
Published: (2024)
by: Cao, Jingtao, et al.
Published: (2024)
Analyzing Quality, Bias, and Performance in Text-to-Image Generative Models
by: Masrourisaadat, Nila, et al.
Published: (2024)
by: Masrourisaadat, Nila, et al.
Published: (2024)
Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views
by: Deichler, Anna, et al.
Published: (2025)
by: Deichler, Anna, et al.
Published: (2025)
Similar Items
-
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
by: Niu, Yuwei, et al.
Published: (2025) -
Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?
by: Feng, Yichen, et al.
Published: (2026) -
Spatially-Aware Speaker for Vision-and-Language Navigation Instruction Generation
by: Gopinathan, Muraleekrishna, et al.
Published: (2024) -
FlexDoc: Parameterized Sampling for Diverse Multilingual Synthetic Documents for Training Document Understanding Models
by: Dua, Karan, et al.
Published: (2025) -
Chat-Driven Text Generation and Interaction for Person Retrieval
by: Xie, Zequn, et al.
Published: (2025)