Saved in:
| Main Authors: | Baldrati, Alberto, Morelli, Davide, Cornia, Marcella, Bertini, Marco, Cucchiara, Rita |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2403.14828 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Fashion-RAG: Multimodal Fashion Image Editing via Retrieval-Augmented Generation
by: Sanguigni, Fulvio, et al.
Published: (2025)
by: Sanguigni, Fulvio, et al.
Published: (2025)
Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images
by: Amoroso, Roberto, et al.
Published: (2023)
by: Amoroso, Roberto, et al.
Published: (2023)
Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives
by: Sarto, Sara, et al.
Published: (2025)
by: Sarto, Sara, et al.
Published: (2025)
Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis
by: Bucciarelli, Davide, et al.
Published: (2024)
by: Bucciarelli, Davide, et al.
Published: (2024)
Fluent and Accurate Image Captioning with a Self-Trained Reward Model
by: Moratelli, Nicholas, et al.
Published: (2024)
by: Moratelli, Nicholas, et al.
Published: (2024)
Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off
by: Sanguigni, Fulvio, et al.
Published: (2026)
by: Sanguigni, Fulvio, et al.
Published: (2026)
Recurrence Meets Transformers for Universal Multimodal Retrieval
by: Caffagni, Davide, et al.
Published: (2025)
by: Caffagni, Davide, et al.
Published: (2025)
Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval
by: Caffagni, Davide, et al.
Published: (2025)
by: Caffagni, Davide, et al.
Published: (2025)
Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization
by: Compagnoni, Alberto, et al.
Published: (2025)
by: Compagnoni, Alberto, et al.
Published: (2025)
Unveiling the Truth: Exploring Human Gaze Patterns in Fake Images
by: Cartella, Giuseppe, et al.
Published: (2024)
by: Cartella, Giuseppe, et al.
Published: (2024)
Tiny Inference-Time Scaling with Latent Verifiers
by: Bucciarelli, Davide, et al.
Published: (2026)
by: Bucciarelli, Davide, et al.
Published: (2026)
Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization
by: Moratelli, Nicholas, et al.
Published: (2024)
by: Moratelli, Nicholas, et al.
Published: (2024)
Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals
by: Lobba, Davide, et al.
Published: (2025)
by: Lobba, Davide, et al.
Published: (2025)
BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues
by: Sarto, Sara, et al.
Published: (2024)
by: Sarto, Sara, et al.
Published: (2024)
Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation
by: Barsellotti, Luca, et al.
Published: (2024)
by: Barsellotti, Luca, et al.
Published: (2024)
What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models
by: Baraldi, Lorenzo, et al.
Published: (2025)
by: Baraldi, Lorenzo, et al.
Published: (2025)
Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering
by: Cocchi, Federico, et al.
Published: (2024)
by: Cocchi, Federico, et al.
Published: (2024)
Towards Retrieval-Augmented Architectures for Image Captioning
by: Sarto, Sara, et al.
Published: (2024)
by: Sarto, Sara, et al.
Published: (2024)
iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval
by: Agnolucci, Lorenzo, et al.
Published: (2024)
by: Agnolucci, Lorenzo, et al.
Published: (2024)
Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction
by: Cartella, Giuseppe, et al.
Published: (2025)
by: Cartella, Giuseppe, et al.
Published: (2025)
Multi-Class Unlearning for Image Classification via Weight Filtering
by: Poppi, Samuele, et al.
Published: (2023)
by: Poppi, Samuele, et al.
Published: (2023)
Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs
by: Caffagni, Davide, et al.
Published: (2024)
by: Caffagni, Davide, et al.
Published: (2024)
Personalized Instance-based Navigation Toward User-Specific Objects in Realistic Environments
by: Barsellotti, Luca, et al.
Published: (2024)
by: Barsellotti, Luca, et al.
Published: (2024)
Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities
by: Baraldi, Lorenzo, et al.
Published: (2024)
by: Baraldi, Lorenzo, et al.
Published: (2024)
ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering
by: Compagnoni, Alberto, et al.
Published: (2025)
by: Compagnoni, Alberto, et al.
Published: (2025)
The Revolution of Multimodal Large Language Models: A Survey
by: Caffagni, Davide, et al.
Published: (2024)
by: Caffagni, Davide, et al.
Published: (2024)
Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training
by: Sarto, Sara, et al.
Published: (2024)
by: Sarto, Sara, et al.
Published: (2024)
Embodied Agents for Efficient Exploration and Smart Scene Description
by: Bigazzi, Roberto, et al.
Published: (2023)
by: Bigazzi, Roberto, et al.
Published: (2023)
RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models
by: Mattioli, Gabriele, et al.
Published: (2026)
by: Mattioli, Gabriele, et al.
Published: (2026)
Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models
by: Caffagni, Davide, et al.
Published: (2025)
by: Caffagni, Davide, et al.
Published: (2025)
Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models
by: Poppi, Samuele, et al.
Published: (2023)
by: Poppi, Samuele, et al.
Published: (2023)
Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation
by: Mistretta, Marco, et al.
Published: (2024)
by: Mistretta, Marco, et al.
Published: (2024)
LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning
by: Cocchi, Federico, et al.
Published: (2025)
by: Cocchi, Federico, et al.
Published: (2025)
Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training
by: Baraldi, Lorenzo, et al.
Published: (2023)
by: Baraldi, Lorenzo, et al.
Published: (2023)
Spot the Difference: A Novel Task for Embodied Agents in Changing Environments
by: Landi, Federico, et al.
Published: (2022)
by: Landi, Federico, et al.
Published: (2022)
Embodied Navigation at the Art Gallery
by: Bigazzi, Roberto, et al.
Published: (2022)
by: Bigazzi, Roberto, et al.
Published: (2022)
Explore and Explain: Self-supervised Navigation and Recounting
by: Bigazzi, Roberto, et al.
Published: (2020)
by: Bigazzi, Roberto, et al.
Published: (2020)
Fashionability-Enhancing Outfit Image Editing with Conditional Diffusion Models
by: Qin, Qice, et al.
Published: (2024)
by: Qin, Qice, et al.
Published: (2024)
Trends, Applications, and Challenges in Human Attention Modelling
by: Cartella, Giuseppe, et al.
Published: (2024)
by: Cartella, Giuseppe, et al.
Published: (2024)
TPP-Gaze: Modelling Gaze Dynamics in Space and Time with Neural Temporal Point Processes
by: D'Amelio, Alessandro, et al.
Published: (2024)
by: D'Amelio, Alessandro, et al.
Published: (2024)
Similar Items
-
Fashion-RAG: Multimodal Fashion Image Editing via Retrieval-Augmented Generation
by: Sanguigni, Fulvio, et al.
Published: (2025) -
Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images
by: Amoroso, Roberto, et al.
Published: (2023) -
Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives
by: Sarto, Sara, et al.
Published: (2025) -
Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis
by: Bucciarelli, Davide, et al.
Published: (2024) -
Fluent and Accurate Image Captioning with a Self-Trained Reward Model
by: Moratelli, Nicholas, et al.
Published: (2024)