:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Li, Li, Peng, Yingzhe, Yang, Xu, Cheng, Ruoxi, Xu, Haiyang, Yan, Ming, Huang, Fei
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2507.08710
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Adaptively Clustering Neighbor Elements for Image-Text Generation
by: Wang, Zihua, et al.
Published: (2023)

HICEScore: A Hierarchical Metric for Image Captioning Evaluation
by: Zeng, Zequn, et al.
Published: (2024)

VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation
by: Zhang, Shi-Xue, et al.
Published: (2025)

MIBench: Evaluating Multimodal Large Language Models over Multiple Images
by: Liu, Haowei, et al.
Published: (2024)

TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning
by: Zhang, Liang, et al.
Published: (2024)

Evaluating Remote Sensing Image Captions Beyond Metric Biases
by: Chen, Ziyun, et al.
Published: (2026)

Efficient and Effective In-context Demonstration Selection with Coreset
by: Wang, Zihua, et al.
Published: (2025)

DEVICE: Depth and Visual Concepts Aware Transformer for OCR-based Image Captioning
by: Xu, Dongsheng, et al.
Published: (2023)

CaptionQA: Is Your Caption as Useful as the Image Itself?
by: Yang, Shijia, et al.
Published: (2025)

ICCV23 Visual-Dialog Emotion Explanation Challenge: SEU_309 Team Technical Report
by: Yuan, Yixiao, et al.
Published: (2024)

Mobile-Agent-V: A Video-Guided Approach for Effortless and Efficient Operational Knowledge Injection in Mobile Automation
by: Wang, Junyang, et al.
Published: (2025)

SimInversion: A Simple Framework for Inversion-Based Text-to-Image Editing
by: Qian, Qi, et al.
Published: (2024)

G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4o
by: Tong, Tony Cheng, et al.
Published: (2024)

BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization
by: Jiang, Chaoya, et al.
Published: (2023)

Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption
by: Qin, Luozheng, et al.
Published: (2025)

Do Vision Encoders Truly Explain Object Hallucination?: Mitigating Object Hallucination via Simple Fine-Grained CLIPScore
by: Oh, Hongseok, et al.
Published: (2025)

STAMP: Training Explicit Memory for Mobile GUI Agents in Controllable and Scalable Virtual Environments
by: Wang, Junyang, et al.
Published: (2026)

Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection
by: Ye, Wei, et al.
Published: (2024)

A Conformal Risk Control Framework for Granular Word Assessment and Uncertainty Calibration of CLIPScore Quality Estimates
by: Gomes, Gonçalo, et al.
Published: (2025)

Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval
by: Liu, Haowei, et al.
Published: (2024)

Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training
by: Liu, Haowei, et al.
Published: (2024)

Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
by: Wang, Zhenhailong, et al.
Published: (2025)

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
by: Wang, Junyang, et al.
Published: (2024)

ProxyImg: Towards Highly-Controllable Image Representation via Hierarchical Disentangled Proxy Embedding
by: Chen, Ye, et al.
Published: (2026)

OmniCaptioner: One Captioner to Rule Them All
by: Lu, Yiting, et al.
Published: (2025)

TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes
by: Jin, Bu, et al.
Published: (2024)

Precision or Recall? An Analysis of Image Captions for Training Text-to-Image Generation Model
by: Cheng, Sheng, et al.
Published: (2024)

BTCChat: Advancing Remote Sensing Bi-temporal Change Captioning with Multimodal Large Language Model
by: Li, Yujie, et al.
Published: (2025)

Dual-frequency Selected Knowledge Distillation with Statistical-based Sample Rectification for PolSAR Image Classification
by: Xin, Xinyue, et al.
Published: (2025)

AeroLite: Tag-Guided Lightweight Generation of Aerial Image Captions
by: Zi, Xing, et al.
Published: (2025)

Context-aware Difference Distilling for Multi-change Captioning
by: Tu, Yunbin, et al.
Published: (2024)

SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization
by: Jia, Hongrui, et al.
Published: (2024)

mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding
by: Hu, Anwen, et al.
Published: (2024)

Set Prediction Guided by Semantic Concepts for Diverse Video Captioning
by: Lu, Yifan, et al.
Published: (2023)

Scene Graph-guided SegCaptioning Transformer with Fine-grained Alignment for Controllable Video Segmentation and Captioning
by: Zhang, Xu, et al.
Published: (2026)

EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations
by: Kim, Hyunjong, et al.
Published: (2025)

Aesthetic Image Captioning with Saliency Enhanced MLLMs
by: Tao, Yilin, et al.
Published: (2025)

CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models
by: Huang, Junming, et al.
Published: (2026)

Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning
by: Tu, Yunbin, et al.
Published: (2024)

Lever LM: Configuring In-Context Sequence to Lever Large Vision Language Models
by: Yang, Xu, et al.
Published: (2023)