:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Dell'Erba, Samuele, Bagdanov, Andrew D.
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language Machine Learning I.4.9; I.2.10; I.2.7
Online Access:	https://arxiv.org/abs/2511.20821
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
by: Niu, Yuwei, et al.
Published: (2025)

Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?
by: Feng, Yichen, et al.
Published: (2026)

Spatially-Aware Speaker for Vision-and-Language Navigation Instruction Generation
by: Gopinathan, Muraleekrishna, et al.
Published: (2024)

FlexDoc: Parameterized Sampling for Diverse Multilingual Synthetic Documents for Training Document Understanding Models
by: Dua, Karan, et al.
Published: (2025)

Chat-Driven Text Generation and Interaction for Person Retrieval
by: Xie, Zequn, et al.
Published: (2025)

MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing
by: Skripkin, Matvey, et al.
Published: (2025)

GLoT: A Novel Gated-Logarithmic Transformer for Efficient Sign Language Translation
by: Shahin, Nada, et al.
Published: (2025)

OpenMap: Instruction Grounding via Open-Vocabulary Visual-Language Mapping
by: Li, Danyang, et al.
Published: (2025)

Taking Flight with Dialogue: Enabling Natural Language Control for PX4-based Drone Agent
by: Lim, Shoon Kit, et al.
Published: (2025)

ADAT: Time-Series-Aware Adaptive Transformer Architecture for Sign Language Translation
by: Shahin, Nada, et al.
Published: (2025)

ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment
by: Bian, Zhipeng, et al.
Published: (2026)

StratXplore: Strategic Novelty-seeking and Instruction-aligned Exploration for Vision and Language Navigation
by: Gopinathan, Muraleekrishna, et al.
Published: (2024)

Bridge Diffusion Model: Bridge Chinese Text-to-Image Diffusion Model with English Communities
by: Liu, Shanyuan, et al.
Published: (2023)

RoboScript: Code Generation for Free-Form Manipulation Tasks across Real and Simulation
by: Chen, Junting, et al.
Published: (2024)

OkanNet: A Lightweight Deep Learning Architecture for Classification of Brain Tumor from MRI Images
by: Uçar, Okan, et al.
Published: (2026)

OmniFusion Technical Report
by: Goncharova, Elizaveta, et al.
Published: (2024)

HATL: Hierarchical Adaptive-Transfer Learning Framework for Sign Language Machine Translation
by: Shahin, Nada, et al.
Published: (2026)

NumeriKontrol: Adding Numeric Control to Diffusion Transformers for Instruction-based Image Editing
by: Xu, Zhenyu, et al.
Published: (2025)

Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning
by: Yang, Shan
Published: (2026)

UAV-assisted Visual SLAM Generating Reconstructed 3D Scene Graphs in GPS-denied Environments
by: Radwan, Ahmed, et al.
Published: (2024)

YETI (YET to Intervene) Proactive Interventions by Multimodal AI Agents in Augmented Reality Tasks
by: Bandyopadhyay, Saptarashmi, et al.
Published: (2025)

Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning
by: Tong, Jingqi, et al.
Published: (2025)

Exposing Blindspots: Cultural Bias Evaluation in Generative Image Models
by: Seo, Huichan, et al.
Published: (2025)

A Surveillance Based Interactive Robot
by: Kavimandan, Kshitij, et al.
Published: (2025)

Human-Robot Dialogue Annotation for Multi-Modal Common Ground
by: Bonial, Claire, et al.
Published: (2024)

SCOUT: A Situated and Multi-Modal Human-Robot Dialogue Corpus
by: Lukin, Stephanie M., et al.
Published: (2024)

PhysicsArena: The First Multimodal Physics Reasoning Benchmark Exploring Variable, Process, and Solution Dimensions
by: Dai, Song, et al.
Published: (2025)

Ego-Motion Aware Target Prediction Module for Robust Multi-Object Tracking
by: Mahdian, Navid, et al.
Published: (2024)

Vision-based Situational Graphs Exploiting Fiducial Markers for the Integration of Semantic Entities
by: Tourani, Ali, et al.
Published: (2023)

Privacy-Preserving Structureless Visual Localization via Image Obfuscation
by: Panek, Vojtech, et al.
Published: (2026)

Pro-DG: Procedural Diffusion Guidance for Architectural Facade Generation
by: Plocharski, Aleksander, et al.
Published: (2025)

Who Sees What? Structured Thought-Action Sequences for Epistemic Reasoning in LLMs
by: Annese, Luca, et al.
Published: (2025)

IntrinsiX: High-Quality PBR Generation using Image Priors
by: Kocsis, Peter, et al.
Published: (2025)

GroundCap: A Visually Grounded Image Captioning Dataset
by: Oliveira, Daniel A. P., et al.
Published: (2025)

Learning the meanings of function words from grounded language using a visual question answering model
by: Portelance, Eva, et al.
Published: (2023)

Combining Absolute and Semi-Generalized Relative Poses for Visual Localization
by: Panek, Vojtech, et al.
Published: (2024)

ReSpace: Text-Driven Autoregressive 3D Indoor Scene Synthesis and Editing
by: Bucher, Martin JJ., et al.
Published: (2025)

VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models
by: Cao, Jingtao, et al.
Published: (2024)

Analyzing Quality, Bias, and Performance in Text-to-Image Generative Models
by: Masrourisaadat, Nila, et al.
Published: (2024)

Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views
by: Deichler, Anna, et al.
Published: (2025)