:: Library Catalog

Buchumschlag

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Dipta, Shubhashis Roy, Wu, Tz-Ying, Tripathi, Subarna
Format:	Preprint
Veröffentlicht:	2025
Schlagworte:	Computer Vision and Pattern Recognition Computation and Language
Online-Zugang:	https://arxiv.org/abs/2509.16538
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Ähnliche Einträge

Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval
von: Dipta, Shubhashis Roy, et al.
Veröffentlicht: (2025)

Harnessing Object Grounding for Time-Sensitive Video Understanding
von: Wu, Tz-Ying, et al.
Veröffentlicht: (2025)

Toward Scalable Video Narration: A Training-free Approach Using Multimodal Large Language Models
von: Wu, Tz-Ying, et al.
Veröffentlicht: (2025)

Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation
von: Wu, Tz-Ying, et al.
Veröffentlicht: (2024)

Search2Motion: Training-Free Object-Level Motion Control via Attention-Consensus Search
von: Liu, Sainan, et al.
Veröffentlicht: (2026)

EASG-Bench: Video Q&A Benchmark with Egocentric Action Scene Graphs
von: Rodin, Ivan, et al.
Veröffentlicht: (2025)

VC4VG: Optimizing Video Captions for Text-to-Video Generation
von: Du, Yang, et al.
Veröffentlicht: (2025)

Many Dialects, Many Languages, One Cultural Lens: Evaluating Multilingual VLMs for Bengali Culture Understanding Across Historically Linked Languages and Regional Dialects
von: Sayeedi, Nurul Labib, et al.
Veröffentlicht: (2026)

VideoSAGE: Video Summarization with Graph Representation Learning
von: Chaves, Jose M. Rojas, et al.
Veröffentlicht: (2024)

SViTT-Ego: A Sparse Video-Text Transformer for Egocentric Video
von: Valdez, Hector A., et al.
Veröffentlicht: (2024)

Grounded Visual Factualization: Factual Anchor-Based Finetuning for Enhancing MLLM Factual Consistency
von: Morbiato, Filippo, et al.
Veröffentlicht: (2025)

Contrastive Language Video Time Pre-training
von: Liu, Hengyue, et al.
Veröffentlicht: (2024)

Can VLMs Recall Factual Associations From Visual References?
von: Ashok, Dhananjay, et al.
Veröffentlicht: (2025)

An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics
von: Ahmadi, Saba, et al.
Veröffentlicht: (2023)

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis
von: Berger, Uri, et al.
Veröffentlicht: (2024)

FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model
von: Lee, Yebin, et al.
Veröffentlicht: (2024)

Unveiling the Invisible: Captioning Videos with Metaphors
von: Kalarani, Abisek Rajakumar, et al.
Veröffentlicht: (2024)

OmniCaptioner: One Captioner to Rule Them All
von: Lu, Yiting, et al.
Veröffentlicht: (2025)

If We May De-Presuppose: Robustly Verifying Claims through Presupposition-Free Question Decomposition
von: Dipta, Shubhashis Roy, et al.
Veröffentlicht: (2025)

Figuring out Figures: Using Textual References to Caption Scientific Figures
von: Cao, Stanley, et al.
Veröffentlicht: (2024)

PALADIN : Robust Neural Fingerprinting for Text-to-Image Diffusion Models
von: L, Murthy, et al.
Veröffentlicht: (2025)

Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos
von: Ohkawa, Takehiko, et al.
Veröffentlicht: (2023)

A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval
von: Gwilliam, Matthew, et al.
Veröffentlicht: (2023)

DiffVC: A Non-autoregressive Framework Based on Diffusion Model for Video Captioning
von: Wang, Junbo, et al.
Veröffentlicht: (2026)

SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation
von: Chen, Xiaofu, et al.
Veröffentlicht: (2025)

"See the World, Discover Knowledge": A Chinese Factuality Evaluation for Large Vision Language Models
von: Gu, Jihao, et al.
Veröffentlicht: (2025)

LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning
von: Tang, Yolo Yunlong, et al.
Veröffentlicht: (2023)

ProTeCt: Prompt Tuning for Taxonomic Open Set Classification
von: Wu, Tz-Ying, et al.
Veröffentlicht: (2023)

G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4o
von: Tong, Tony Cheng, et al.
Veröffentlicht: (2024)

Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning
von: Piergiovanni, AJ, et al.
Veröffentlicht: (2024)

ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way
von: Roy, Rajarshi, et al.
Veröffentlicht: (2025)

Distinctive Image Captioning: Leveraging Ground Truth Captions in CLIP Guided Reinforcement Learning
von: Chaffin, Antoine, et al.
Veröffentlicht: (2024)

Precision or Recall? An Analysis of Image Captions for Training Text-to-Image Generation Model
von: Cheng, Sheng, et al.
Veröffentlicht: (2024)

Video Summarization: Towards Entity-Aware Captions
von: Ayyubi, Hammad A., et al.
Veröffentlicht: (2023)

PromptGuard at BLP-2025 Task 1: A Few-Shot Classification Framework Using Majority Voting and Keyword Similarity for Bengali Hate Speech Detection
von: Hossan, Rakib, et al.
Veröffentlicht: (2025)

Wolf: Dense Video Captioning with a World Summarization Framework
von: Li, Boyi, et al.
Veröffentlicht: (2024)

SST-EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing
von: Biyyala, Varun, et al.
Veröffentlicht: (2025)

Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline
von: Gordon, Brian, et al.
Veröffentlicht: (2025)

LLM-Bootstrapped Targeted Finding Guidance for Factual MLLM-based Medical Report Generation
von: Yang, Cunyuan, et al.
Veröffentlicht: (2026)

FinCap: Topic-Aligned Captions for Short-Form Financial YouTube Videos
von: Sukhani, Siddhant, et al.
Veröffentlicht: (2025)