:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Sanguigni, Fulvio, Lobba, Davide, Ren, Bin, Cornia, Marcella, Sebe, Nicu, Cucchiara, Rita
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2603.22607
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals
by: Lobba, Davide, et al.
Published: (2025)

Fashion-RAG: Multimodal Fashion Image Editing via Retrieval-Augmented Generation
by: Sanguigni, Fulvio, et al.
Published: (2025)

What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models
by: Baraldi, Lorenzo, et al.
Published: (2025)

Hierarchical Cross-Attention Network for Virtual Try-On
by: Tang, Hao, et al.
Published: (2024)

Multimodal-Conditioned Latent Diffusion Models for Fashion Image Editing
by: Baldrati, Alberto, et al.
Published: (2024)

Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives
by: Sarto, Sara, et al.
Published: (2025)

Fluent and Accurate Image Captioning with a Self-Trained Reward Model
by: Moratelli, Nicholas, et al.
Published: (2024)

Unveiling the Truth: Exploring Human Gaze Patterns in Fake Images
by: Cartella, Giuseppe, et al.
Published: (2024)

Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval
by: Caffagni, Davide, et al.
Published: (2025)

Recurrence Meets Transformers for Universal Multimodal Retrieval
by: Caffagni, Davide, et al.
Published: (2025)

Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization
by: Moratelli, Nicholas, et al.
Published: (2024)

Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis
by: Bucciarelli, Davide, et al.
Published: (2024)

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues
by: Sarto, Sara, et al.
Published: (2024)

Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation
by: Barsellotti, Luca, et al.
Published: (2024)

LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning
by: Cocchi, Federico, et al.
Published: (2025)

Personalized Instance-based Navigation Toward User-Specific Objects in Realistic Environments
by: Barsellotti, Luca, et al.
Published: (2024)

Hallucination Early Detection in Diffusion Models
by: Betti, Federico, et al.
Published: (2026)

Optimizing Resource Consumption in Diffusion Models through Hallucination Early Detection
by: Betti, Federico, et al.
Published: (2024)

Tiny Inference-Time Scaling with Latent Verifiers
by: Bucciarelli, Davide, et al.
Published: (2026)

Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization
by: Compagnoni, Alberto, et al.
Published: (2025)

Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images
by: Amoroso, Roberto, et al.
Published: (2023)

Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training
by: Sarto, Sara, et al.
Published: (2024)

Towards Retrieval-Augmented Architectures for Image Captioning
by: Sarto, Sara, et al.
Published: (2024)

Embodied Agents for Efficient Exploration and Smart Scene Description
by: Bigazzi, Roberto, et al.
Published: (2023)

Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering
by: Cocchi, Federico, et al.
Published: (2024)

A Unified Masked Jigsaw Puzzle Framework for Vision and Language Models
by: Ye, Weixin, et al.
Published: (2026)

Multi-Class Unlearning for Image Classification via Weight Filtering
by: Poppi, Samuele, et al.
Published: (2023)

OmniTry: Virtual Try-On Anything without Masks
by: Feng, Yutong, et al.
Published: (2025)

TryOffAnyone: Tiled Cloth Generation from a Dressed Person
by: Xarchakos, Ioannis, et al.
Published: (2024)

Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs
by: Caffagni, Davide, et al.
Published: (2024)

Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training
by: Baraldi, Lorenzo, et al.
Published: (2023)

Spot the Difference: A Novel Task for Embodied Agents in Changing Environments
by: Landi, Federico, et al.
Published: (2022)

Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities
by: Baraldi, Lorenzo, et al.
Published: (2024)

Embodied Navigation at the Art Gallery
by: Bigazzi, Roberto, et al.
Published: (2022)

TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models
by: Velioglu, Riza, et al.
Published: (2024)

Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models
by: Poppi, Samuele, et al.
Published: (2023)

Explore and Explain: Self-supervised Navigation and Recounting
by: Bigazzi, Roberto, et al.
Published: (2020)

Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction
by: Cartella, Giuseppe, et al.
Published: (2025)

Shape-Guided Clothing Warping for Virtual Try-On
by: Han, Xiaoyu, et al.
Published: (2025)

ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering
by: Compagnoni, Alberto, et al.
Published: (2025)