Saved in:
| Main Authors: | Li, Li, Peng, Yingzhe, Yang, Xu, Cheng, Ruoxi, Xu, Haiyang, Yan, Ming, Huang, Fei |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.08710 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Adaptively Clustering Neighbor Elements for Image-Text Generation
by: Wang, Zihua, et al.
Published: (2023)
by: Wang, Zihua, et al.
Published: (2023)
HICEScore: A Hierarchical Metric for Image Captioning Evaluation
by: Zeng, Zequn, et al.
Published: (2024)
by: Zeng, Zequn, et al.
Published: (2024)
VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation
by: Zhang, Shi-Xue, et al.
Published: (2025)
by: Zhang, Shi-Xue, et al.
Published: (2025)
MIBench: Evaluating Multimodal Large Language Models over Multiple Images
by: Liu, Haowei, et al.
Published: (2024)
by: Liu, Haowei, et al.
Published: (2024)
TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning
by: Zhang, Liang, et al.
Published: (2024)
by: Zhang, Liang, et al.
Published: (2024)
Evaluating Remote Sensing Image Captions Beyond Metric Biases
by: Chen, Ziyun, et al.
Published: (2026)
by: Chen, Ziyun, et al.
Published: (2026)
Efficient and Effective In-context Demonstration Selection with Coreset
by: Wang, Zihua, et al.
Published: (2025)
by: Wang, Zihua, et al.
Published: (2025)
DEVICE: Depth and Visual Concepts Aware Transformer for OCR-based Image Captioning
by: Xu, Dongsheng, et al.
Published: (2023)
by: Xu, Dongsheng, et al.
Published: (2023)
CaptionQA: Is Your Caption as Useful as the Image Itself?
by: Yang, Shijia, et al.
Published: (2025)
by: Yang, Shijia, et al.
Published: (2025)
ICCV23 Visual-Dialog Emotion Explanation Challenge: SEU_309 Team Technical Report
by: Yuan, Yixiao, et al.
Published: (2024)
by: Yuan, Yixiao, et al.
Published: (2024)
Mobile-Agent-V: A Video-Guided Approach for Effortless and Efficient Operational Knowledge Injection in Mobile Automation
by: Wang, Junyang, et al.
Published: (2025)
by: Wang, Junyang, et al.
Published: (2025)
SimInversion: A Simple Framework for Inversion-Based Text-to-Image Editing
by: Qian, Qi, et al.
Published: (2024)
by: Qian, Qi, et al.
Published: (2024)
G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4o
by: Tong, Tony Cheng, et al.
Published: (2024)
by: Tong, Tony Cheng, et al.
Published: (2024)
BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization
by: Jiang, Chaoya, et al.
Published: (2023)
by: Jiang, Chaoya, et al.
Published: (2023)
Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption
by: Qin, Luozheng, et al.
Published: (2025)
by: Qin, Luozheng, et al.
Published: (2025)
Do Vision Encoders Truly Explain Object Hallucination?: Mitigating Object Hallucination via Simple Fine-Grained CLIPScore
by: Oh, Hongseok, et al.
Published: (2025)
by: Oh, Hongseok, et al.
Published: (2025)
STAMP: Training Explicit Memory for Mobile GUI Agents in Controllable and Scalable Virtual Environments
by: Wang, Junyang, et al.
Published: (2026)
by: Wang, Junyang, et al.
Published: (2026)
Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection
by: Ye, Wei, et al.
Published: (2024)
by: Ye, Wei, et al.
Published: (2024)
A Conformal Risk Control Framework for Granular Word Assessment and Uncertainty Calibration of CLIPScore Quality Estimates
by: Gomes, Gonçalo, et al.
Published: (2025)
by: Gomes, Gonçalo, et al.
Published: (2025)
Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval
by: Liu, Haowei, et al.
Published: (2024)
by: Liu, Haowei, et al.
Published: (2024)
Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training
by: Liu, Haowei, et al.
Published: (2024)
by: Liu, Haowei, et al.
Published: (2024)
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
by: Wang, Zhenhailong, et al.
Published: (2025)
by: Wang, Zhenhailong, et al.
Published: (2025)
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
by: Wang, Junyang, et al.
Published: (2024)
by: Wang, Junyang, et al.
Published: (2024)
ProxyImg: Towards Highly-Controllable Image Representation via Hierarchical Disentangled Proxy Embedding
by: Chen, Ye, et al.
Published: (2026)
by: Chen, Ye, et al.
Published: (2026)
OmniCaptioner: One Captioner to Rule Them All
by: Lu, Yiting, et al.
Published: (2025)
by: Lu, Yiting, et al.
Published: (2025)
TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes
by: Jin, Bu, et al.
Published: (2024)
by: Jin, Bu, et al.
Published: (2024)
Precision or Recall? An Analysis of Image Captions for Training Text-to-Image Generation Model
by: Cheng, Sheng, et al.
Published: (2024)
by: Cheng, Sheng, et al.
Published: (2024)
BTCChat: Advancing Remote Sensing Bi-temporal Change Captioning with Multimodal Large Language Model
by: Li, Yujie, et al.
Published: (2025)
by: Li, Yujie, et al.
Published: (2025)
Dual-frequency Selected Knowledge Distillation with Statistical-based Sample Rectification for PolSAR Image Classification
by: Xin, Xinyue, et al.
Published: (2025)
by: Xin, Xinyue, et al.
Published: (2025)
AeroLite: Tag-Guided Lightweight Generation of Aerial Image Captions
by: Zi, Xing, et al.
Published: (2025)
by: Zi, Xing, et al.
Published: (2025)
Context-aware Difference Distilling for Multi-change Captioning
by: Tu, Yunbin, et al.
Published: (2024)
by: Tu, Yunbin, et al.
Published: (2024)
SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization
by: Jia, Hongrui, et al.
Published: (2024)
by: Jia, Hongrui, et al.
Published: (2024)
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding
by: Hu, Anwen, et al.
Published: (2024)
by: Hu, Anwen, et al.
Published: (2024)
Set Prediction Guided by Semantic Concepts for Diverse Video Captioning
by: Lu, Yifan, et al.
Published: (2023)
by: Lu, Yifan, et al.
Published: (2023)
Scene Graph-guided SegCaptioning Transformer with Fine-grained Alignment for Controllable Video Segmentation and Captioning
by: Zhang, Xu, et al.
Published: (2026)
by: Zhang, Xu, et al.
Published: (2026)
EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations
by: Kim, Hyunjong, et al.
Published: (2025)
by: Kim, Hyunjong, et al.
Published: (2025)
Aesthetic Image Captioning with Saliency Enhanced MLLMs
by: Tao, Yilin, et al.
Published: (2025)
by: Tao, Yilin, et al.
Published: (2025)
CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models
by: Huang, Junming, et al.
Published: (2026)
by: Huang, Junming, et al.
Published: (2026)
Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning
by: Tu, Yunbin, et al.
Published: (2024)
by: Tu, Yunbin, et al.
Published: (2024)
Lever LM: Configuring In-Context Sequence to Lever Large Vision Language Models
by: Yang, Xu, et al.
Published: (2023)
by: Yang, Xu, et al.
Published: (2023)
Similar Items
-
Adaptively Clustering Neighbor Elements for Image-Text Generation
by: Wang, Zihua, et al.
Published: (2023) -
HICEScore: A Hierarchical Metric for Image Captioning Evaluation
by: Zeng, Zequn, et al.
Published: (2024) -
VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation
by: Zhang, Shi-Xue, et al.
Published: (2025) -
MIBench: Evaluating Multimodal Large Language Models over Multiple Images
by: Liu, Haowei, et al.
Published: (2024) -
TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning
by: Zhang, Liang, et al.
Published: (2024)