Saved in:
| Main Authors: | Tu, Yunbin, Li, Liang, Su, Li, Huang, Qingming |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2412.13543 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning
by: Tu, Yunbin, et al.
Published: (2024)
by: Tu, Yunbin, et al.
Published: (2024)
Context-aware Difference Distilling for Multi-change Captioning
by: Tu, Yunbin, et al.
Published: (2024)
by: Tu, Yunbin, et al.
Published: (2024)
Text-only Synthesis for Image Captioning
by: Zhou, Qing, et al.
Published: (2024)
by: Zhou, Qing, et al.
Published: (2024)
ViPCap: Retrieval Text-Based Visual Prompts for Lightweight Image Captioning
by: Kim, Taewhan, et al.
Published: (2024)
by: Kim, Taewhan, et al.
Published: (2024)
Towards Retrieval-Augmented Architectures for Image Captioning
by: Sarto, Sara, et al.
Published: (2024)
by: Sarto, Sara, et al.
Published: (2024)
FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures
by: Song, Jifeng, et al.
Published: (2026)
by: Song, Jifeng, et al.
Published: (2026)
From Pixels to Posts: Retrieval-Augmented Fashion Captioning and Hashtag Generation
by: Gondal, Moazzam Umer, et al.
Published: (2025)
by: Gondal, Moazzam Umer, et al.
Published: (2025)
CapGeo: A Caption-Assisted Approach to Geometric Reasoning
by: Li, Yuying, et al.
Published: (2025)
by: Li, Yuying, et al.
Published: (2025)
VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning
by: Lu, Xingyu, et al.
Published: (2026)
by: Lu, Xingyu, et al.
Published: (2026)
UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
by: Nayak, Shravan, et al.
Published: (2025)
by: Nayak, Shravan, et al.
Published: (2025)
Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval
by: Shen, Li-Cheng, et al.
Published: (2025)
by: Shen, Li-Cheng, et al.
Published: (2025)
VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models
by: Li, Zejun, et al.
Published: (2024)
by: Li, Zejun, et al.
Published: (2024)
LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation
by: Hashemi, Mohammad Abuzar, et al.
Published: (2021)
by: Hashemi, Mohammad Abuzar, et al.
Published: (2021)
The Role of Data Curation in Image Captioning
by: Li, Wenyan, et al.
Published: (2023)
by: Li, Wenyan, et al.
Published: (2023)
A Survey of Multimodal Large Language Model from A Data-centric Perspective
by: Bai, Tianyi, et al.
Published: (2024)
by: Bai, Tianyi, et al.
Published: (2024)
Region-R1: Reinforcing Query-Side Region Cropping for Multi-Modal Re-Ranking
by: Hu, Chan-Wei, et al.
Published: (2026)
by: Hu, Chan-Wei, et al.
Published: (2026)
Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries
by: Bhattacharyya, Sree, et al.
Published: (2025)
by: Bhattacharyya, Sree, et al.
Published: (2025)
CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning
by: Xing, Long, et al.
Published: (2025)
by: Xing, Long, et al.
Published: (2025)
QA-Dragon: Query-Aware Dynamic RAG System for Knowledge-Intensive Visual Question Answering
by: Jiang, Zhuohang, et al.
Published: (2025)
by: Jiang, Zhuohang, et al.
Published: (2025)
CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning
by: Li, Kailing, et al.
Published: (2025)
by: Li, Kailing, et al.
Published: (2025)
MVMR: A New Framework for Evaluating Faithfulness of Video Moment Retrieval against Multiple Distractors
by: Yang, Nakyeong, et al.
Published: (2023)
by: Yang, Nakyeong, et al.
Published: (2023)
BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues
by: Sarto, Sara, et al.
Published: (2024)
by: Sarto, Sara, et al.
Published: (2024)
Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task
by: Dhawan, Aashish, et al.
Published: (2026)
by: Dhawan, Aashish, et al.
Published: (2026)
HNC: Leveraging Hard Negative Captions towards Models with Fine-Grained Visual-Linguistic Comprehension Capabilities
by: Dönmez, Esra, et al.
Published: (2026)
by: Dönmez, Esra, et al.
Published: (2026)
Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation
by: Ma, Juncheng, et al.
Published: (2024)
by: Ma, Juncheng, et al.
Published: (2024)
SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM
by: Yu, An, et al.
Published: (2025)
by: Yu, An, et al.
Published: (2025)
Audio-centric Video Understanding Benchmark without Text Shortcut
by: Yang, Yudong, et al.
Published: (2025)
by: Yang, Yudong, et al.
Published: (2025)
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step
by: Guo, Ziyu, et al.
Published: (2025)
by: Guo, Ziyu, et al.
Published: (2025)
Unveiling the Invisible: Captioning Videos with Metaphors
by: Kalarani, Abisek Rajakumar, et al.
Published: (2024)
by: Kalarani, Abisek Rajakumar, et al.
Published: (2024)
WsiCaption: Multiple Instance Generation of Pathology Reports for Gigapixel Whole-Slide Images
by: Chen, Pingyi, et al.
Published: (2023)
by: Chen, Pingyi, et al.
Published: (2023)
ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification
by: Fan, Ziqing, et al.
Published: (2025)
by: Fan, Ziqing, et al.
Published: (2025)
Retrieving Counterfactuals Improves Visual In-Context Learning
by: Xiong, Guangzhi, et al.
Published: (2026)
by: Xiong, Guangzhi, et al.
Published: (2026)
Updating CLIP to Prefer Descriptions Over Captions
by: Zur, Amir, et al.
Published: (2024)
by: Zur, Amir, et al.
Published: (2024)
Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities
by: Zhou, Ziwei, et al.
Published: (2025)
by: Zhou, Ziwei, et al.
Published: (2025)
Do Large Multimodal Models Solve Caption Generation for Scientific Figures? Lessons Learned from SciCap Challenge 2023
by: Hsu, Ting-Yao E., et al.
Published: (2025)
by: Hsu, Ting-Yao E., et al.
Published: (2025)
AdaCodec: A Predictive Visual Code for Video MLLMs
by: Hou, Haowen, et al.
Published: (2026)
by: Hou, Haowen, et al.
Published: (2026)
IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning
by: Lee, Soeun, et al.
Published: (2024)
by: Lee, Soeun, et al.
Published: (2024)
LaMP-Cap: Personalized Figure Caption Generation With Multimodal Figure Profiles
by: Ng, Ho Yin 'Sam', et al.
Published: (2025)
by: Ng, Ho Yin 'Sam', et al.
Published: (2025)
CAPEEN: Image Captioning with Early Exits and Knowledge Distillation
by: Bajpai, Divya Jyoti, et al.
Published: (2024)
by: Bajpai, Divya Jyoti, et al.
Published: (2024)
ChartCap: Mitigating Hallucination of Dense Chart Captioning
by: Lim, Junyoung, et al.
Published: (2025)
by: Lim, Junyoung, et al.
Published: (2025)
Similar Items
-
Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning
by: Tu, Yunbin, et al.
Published: (2024) -
Context-aware Difference Distilling for Multi-change Captioning
by: Tu, Yunbin, et al.
Published: (2024) -
Text-only Synthesis for Image Captioning
by: Zhou, Qing, et al.
Published: (2024) -
ViPCap: Retrieval Text-Based Visual Prompts for Lightweight Image Captioning
by: Kim, Taewhan, et al.
Published: (2024) -
Towards Retrieval-Augmented Architectures for Image Captioning
by: Sarto, Sara, et al.
Published: (2024)