:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Tu, Yunbin, Li, Liang, Su, Li, Huang, Qingming
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2412.13543
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning
by: Tu, Yunbin, et al.
Published: (2024)

Context-aware Difference Distilling for Multi-change Captioning
by: Tu, Yunbin, et al.
Published: (2024)

Text-only Synthesis for Image Captioning
by: Zhou, Qing, et al.
Published: (2024)

ViPCap: Retrieval Text-Based Visual Prompts for Lightweight Image Captioning
by: Kim, Taewhan, et al.
Published: (2024)

Towards Retrieval-Augmented Architectures for Image Captioning
by: Sarto, Sara, et al.
Published: (2024)

FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures
by: Song, Jifeng, et al.
Published: (2026)

From Pixels to Posts: Retrieval-Augmented Fashion Captioning and Hashtag Generation
by: Gondal, Moazzam Umer, et al.
Published: (2025)

CapGeo: A Caption-Assisted Approach to Geometric Reasoning
by: Li, Yuying, et al.
Published: (2025)

VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning
by: Lu, Xingyu, et al.
Published: (2026)

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
by: Nayak, Shravan, et al.
Published: (2025)

Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval
by: Shen, Li-Cheng, et al.
Published: (2025)

VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models
by: Li, Zejun, et al.
Published: (2024)

LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation
by: Hashemi, Mohammad Abuzar, et al.
Published: (2021)

The Role of Data Curation in Image Captioning
by: Li, Wenyan, et al.
Published: (2023)

A Survey of Multimodal Large Language Model from A Data-centric Perspective
by: Bai, Tianyi, et al.
Published: (2024)

Region-R1: Reinforcing Query-Side Region Cropping for Multi-Modal Re-Ranking
by: Hu, Chan-Wei, et al.
Published: (2026)

Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries
by: Bhattacharyya, Sree, et al.
Published: (2025)

CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning
by: Xing, Long, et al.
Published: (2025)

QA-Dragon: Query-Aware Dynamic RAG System for Knowledge-Intensive Visual Question Answering
by: Jiang, Zhuohang, et al.
Published: (2025)

CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning
by: Li, Kailing, et al.
Published: (2025)

MVMR: A New Framework for Evaluating Faithfulness of Video Moment Retrieval against Multiple Distractors
by: Yang, Nakyeong, et al.
Published: (2023)

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues
by: Sarto, Sara, et al.
Published: (2024)

Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task
by: Dhawan, Aashish, et al.
Published: (2026)

HNC: Leveraging Hard Negative Captions towards Models with Fine-Grained Visual-Linguistic Comprehension Capabilities
by: Dönmez, Esra, et al.
Published: (2026)

Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation
by: Ma, Juncheng, et al.
Published: (2024)

SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM
by: Yu, An, et al.
Published: (2025)

Audio-centric Video Understanding Benchmark without Text Shortcut
by: Yang, Yudong, et al.
Published: (2025)

Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step
by: Guo, Ziyu, et al.
Published: (2025)

Unveiling the Invisible: Captioning Videos with Metaphors
by: Kalarani, Abisek Rajakumar, et al.
Published: (2024)

WsiCaption: Multiple Instance Generation of Pathology Reports for Gigapixel Whole-Slide Images
by: Chen, Pingyi, et al.
Published: (2023)

ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification
by: Fan, Ziqing, et al.
Published: (2025)

Retrieving Counterfactuals Improves Visual In-Context Learning
by: Xiong, Guangzhi, et al.
Published: (2026)

Updating CLIP to Prefer Descriptions Over Captions
by: Zur, Amir, et al.
Published: (2024)

Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities
by: Zhou, Ziwei, et al.
Published: (2025)

Do Large Multimodal Models Solve Caption Generation for Scientific Figures? Lessons Learned from SciCap Challenge 2023
by: Hsu, Ting-Yao E., et al.
Published: (2025)

AdaCodec: A Predictive Visual Code for Video MLLMs
by: Hou, Haowen, et al.
Published: (2026)

IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning
by: Lee, Soeun, et al.
Published: (2024)

LaMP-Cap: Personalized Figure Caption Generation With Multimodal Figure Profiles
by: Ng, Ho Yin 'Sam', et al.
Published: (2025)

CAPEEN: Image Captioning with Early Exits and Knowledge Distillation
by: Bajpai, Divya Jyoti, et al.
Published: (2024)

ChartCap: Mitigating Hallucination of Dense Chart Captioning
by: Lim, Junyoung, et al.
Published: (2025)