Saved in:
| Main Authors: | You, Xiaoxing, Huang, Qiang, Li, Lingyu, Zhang, Chi, Liu, Xiaopeng, Zhang, Min, Yu, Jun |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.21002 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events
by: You, Xiaoxing, et al.
Published: (2026)
by: You, Xiaoxing, et al.
Published: (2026)
MG$^2$-RAG: Multi-Granularity Graph for Multimodal Retrieval-Augmented Generation
by: Dai, Sijun, et al.
Published: (2026)
by: Dai, Sijun, et al.
Published: (2026)
EAMA : Entity-Aware Multimodal Alignment Based Approach for News Image Captioning
by: Zhang, Junzhe, et al.
Published: (2024)
by: Zhang, Junzhe, et al.
Published: (2024)
Knowledge Guided Entity-aware Video Captioning and A Basketball Benchmark
by: Xi, Zeyu, et al.
Published: (2024)
by: Xi, Zeyu, et al.
Published: (2024)
Retrieval-Augmented Egocentric Video Captioning
by: Xu, Jilan, et al.
Published: (2024)
by: Xu, Jilan, et al.
Published: (2024)
Understanding Retrieval Robustness for Retrieval-Augmented Image Captioning
by: Li, Wenyan, et al.
Published: (2024)
by: Li, Wenyan, et al.
Published: (2024)
ViTOC: Vision Transformer and Object-aware Captioner
by: Huang, Feiyang
Published: (2024)
by: Huang, Feiyang
Published: (2024)
DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding
by: Wu, Hao, et al.
Published: (2024)
by: Wu, Hao, et al.
Published: (2024)
Towards Retrieval-Augmented Architectures for Image Captioning
by: Sarto, Sara, et al.
Published: (2024)
by: Sarto, Sara, et al.
Published: (2024)
MegaRAG: Multimodal Knowledge Graph-Based Retrieval Augmented Generation
by: Hsiao, Chi-Hsiang, et al.
Published: (2025)
by: Hsiao, Chi-Hsiang, et al.
Published: (2025)
EVENT-Retriever: Event-Aware Multimodal Image Retrieval for Realistic Captions
by: Vo, Dinh-Khoi, et al.
Published: (2025)
by: Vo, Dinh-Khoi, et al.
Published: (2025)
DWE+: Dual-Way Matching Enhanced Framework for Multimodal Entity Linking
by: Song, Shezheng, et al.
Published: (2024)
by: Song, Shezheng, et al.
Published: (2024)
Open Multimodal Retrieval-Augmented Factual Image Generation
by: Tian, Yang, et al.
Published: (2025)
by: Tian, Yang, et al.
Published: (2025)
VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph
by: Wang, Qiuchen, et al.
Published: (2026)
by: Wang, Qiuchen, et al.
Published: (2026)
IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning
by: Lee, Soeun, et al.
Published: (2024)
by: Lee, Soeun, et al.
Published: (2024)
Beyond Vision: Contextually Enriched Image Captioning with Multi-Modal Retrieval
by: Quy, Nguyen Lam Phu, et al.
Published: (2025)
by: Quy, Nguyen Lam Phu, et al.
Published: (2025)
RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation
by: Li, Hao, et al.
Published: (2026)
by: Li, Hao, et al.
Published: (2026)
LLMs Should Incorporate Explicit Mechanisms for Human Empathy
by: You, Xiaoxing, et al.
Published: (2026)
by: You, Xiaoxing, et al.
Published: (2026)
Retrieval Augmented Comic Image Generation
by: Shui, Yunhao, et al.
Published: (2025)
by: Shui, Yunhao, et al.
Published: (2025)
Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization
by: Wu, Jiulong, et al.
Published: (2025)
by: Wu, Jiulong, et al.
Published: (2025)
Caption-Matching: A Multimodal Approach for Cross-Domain Image Retrieval
by: Iijima, Lucas, et al.
Published: (2024)
by: Iijima, Lucas, et al.
Published: (2024)
Modeling Image-Caption Rating from Comparative Judgments
by: Minni, Kezia, et al.
Published: (2026)
by: Minni, Kezia, et al.
Published: (2026)
RACap: Relation-Aware Prompting for Lightweight Retrieval-Augmented Image Captioning
by: Long, Xiaosheng, et al.
Published: (2025)
by: Long, Xiaosheng, et al.
Published: (2025)
Frequency-Aware Vision-Language Multimodality Generalization Network for Remote Sensing Image Classification
by: Zhang, Junjie, et al.
Published: (2025)
by: Zhang, Junjie, et al.
Published: (2025)
Enhancing Image Quality Assessment Ability of LMMs via Retrieval-Augmented Generation
by: Fu, Kang, et al.
Published: (2026)
by: Fu, Kang, et al.
Published: (2026)
OPCap:Object-aware Prompting Captioning
by: Huang, Feiyang
Published: (2024)
by: Huang, Feiyang
Published: (2024)
Mitigating Image Captioning Hallucinations in Vision-Language Models
by: Zhao, Fei, et al.
Published: (2025)
by: Zhao, Fei, et al.
Published: (2025)
Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation
by: Luo, Weiqing, et al.
Published: (2026)
by: Luo, Weiqing, et al.
Published: (2026)
Negative Entity Suppression for Zero-Shot Captioning with Synthetic Images
by: Lu, Zimao, et al.
Published: (2025)
by: Lu, Zimao, et al.
Published: (2025)
Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation
by: Li, Zhiyuan, et al.
Published: (2023)
by: Li, Zhiyuan, et al.
Published: (2023)
Context-aware Difference Distilling for Multi-change Captioning
by: Tu, Yunbin, et al.
Published: (2024)
by: Tu, Yunbin, et al.
Published: (2024)
When Abundance Conceals Weakness: Knowledge Conflict in Multilingual Models
by: Zhao, Jiaqi, et al.
Published: (2026)
by: Zhao, Jiaqi, et al.
Published: (2026)
Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction
by: Fonseca, Rui, et al.
Published: (2025)
by: Fonseca, Rui, et al.
Published: (2025)
Knowledge-aware Text-Image Retrieval for Remote Sensing Images
by: Mi, Li, et al.
Published: (2024)
by: Mi, Li, et al.
Published: (2024)
Diffusion Based Augmentation for Captioning and Retrieval in Cultural Heritage
by: Cioni, Dario, et al.
Published: (2023)
by: Cioni, Dario, et al.
Published: (2023)
Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning
by: You, Zuyao, et al.
Published: (2025)
by: You, Zuyao, et al.
Published: (2025)
LAKE-RED: Camouflaged Images Generation by Latent Background Knowledge Retrieval-Augmented Diffusion
by: Zhao, Pancheng, et al.
Published: (2024)
by: Zhao, Pancheng, et al.
Published: (2024)
FACE-net: Factual Calibration and Emotion Augmentation for Retrieval-enhanced Emotional Video Captioning
by: Chen, Weidong, et al.
Published: (2026)
by: Chen, Weidong, et al.
Published: (2026)
Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval?
by: Feng, Hengyi, et al.
Published: (2025)
by: Feng, Hengyi, et al.
Published: (2025)
KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities
by: Huang, Hsin-Ping, et al.
Published: (2024)
by: Huang, Hsin-Ping, et al.
Published: (2024)
Similar Items
-
Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events
by: You, Xiaoxing, et al.
Published: (2026) -
MG$^2$-RAG: Multi-Granularity Graph for Multimodal Retrieval-Augmented Generation
by: Dai, Sijun, et al.
Published: (2026) -
EAMA : Entity-Aware Multimodal Alignment Based Approach for News Image Captioning
by: Zhang, Junzhe, et al.
Published: (2024) -
Knowledge Guided Entity-aware Video Captioning and A Basketball Benchmark
by: Xi, Zeyu, et al.
Published: (2024) -
Retrieval-Augmented Egocentric Video Captioning
by: Xu, Jilan, et al.
Published: (2024)