:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Huaying, Hashimoto, Atsushi, Hirasawa, Tosho
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2512.15006
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark
by: Maeda, Koki, et al.
Published: (2024)

SciPostGen: Bridging the Gap between Scientific Papers and Poster Layouts
by: Inadumi, Shun, et al.
Published: (2025)

AdaCoder: Adaptive Prompt Compression for Programmatic Visual Question Answering
by: Ukai, Mahiro, et al.
Published: (2024)

FIQ: Fundamental Question Generation with the Integration of Question Embeddings for Video Question Answering
by: Oh, Ju-Young, et al.
Published: (2025)

Assessing the Capabilities of LLMs in Humor:A Multi-dimensional Analysis of Oogiri Generation and Evaluation
by: Sakabe, Ritsu, et al.
Published: (2025)

VideoExplorer: Think With Videos For Agentic Long-Video Understanding
by: Yuan, Huaying, et al.
Published: (2025)

Task-Aware KV Compression For Cost-Effective Long Video Understanding
by: Qin, Minghao, et al.
Published: (2025)

Video-XL-2: Towards Very Long-Video Understanding Through Task-Aware KV Sparsification
by: Qin, Minghao, et al.
Published: (2025)

Knowledge-Intensive Video Generation
by: Wang, Chenxu, et al.
Published: (2026)

SBS Figures: Pre-training Figure QA from Stage-by-Stage Synthesized Images
by: Shinoda, Risa, et al.
Published: (2024)

A Knowledge Noise Mitigation Framework for Knowledge-based Visual Question Answering
by: Liu, Zhiyue, et al.
Published: (2025)

EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation
by: Yang, Songlin, et al.
Published: (2026)

Knowledge Detection by Relevant Question and Image Attributes in Visual Question Answering
by: Ahir, Param, et al.
Published: (2023)

Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles
by: Chen, Qi, et al.
Published: (2024)

YTCommentQA: Video Question Answerability in Instructional Videos
by: Yang, Saelyne, et al.
Published: (2024)

Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering
by: Lim, Youngsun, et al.
Published: (2024)

Fine-Grained Knowledge Structuring and Retrieval for Visual Question Answering
by: Zhang, Zhengxuan, et al.
Published: (2025)

MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval
by: Yuan, Huaying, et al.
Published: (2025)

Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties
by: Yu, Keunwoo Peter, et al.
Published: (2023)

Product of Experts for Visual Generation
by: Zhang, Yunzhi, et al.
Published: (2025)

QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering
by: Jung, Woojun, et al.
Published: (2026)

Learning Question-Aware Keyframe Selection with Synthetic Supervision for Video Question Answering
by: Kwon, Minchan, et al.
Published: (2026)

Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering
by: Yu, Ting, et al.
Published: (2024)

AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences
by: Li, Jieyu, et al.
Published: (2025)

OmniGen: Unified Image Generation
by: Xiao, Shitao, et al.
Published: (2024)

Deep Expert Injection for Anchoring Retinal VLMs with Domain-Specific Knowledge
by: Lu, Shuai, et al.
Published: (2026)

VQA$^2$: Visual Question Answering for Video Quality Assessment
by: Jia, Ziheng, et al.
Published: (2024)

StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering
by: Wen, Zhihao, et al.
Published: (2025)

Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning
by: Liu, Huabin, et al.
Published: (2025)

Enhancing Long Video Question Answering with Scene-Localized Frame Grouping
by: Yang, Xuyi, et al.
Published: (2025)

Knowledge Graphs of Driving Scenes to Empower the Emerging Capabilities of Neurosymbolic AI
by: Wickramarachchi, Ruwan, et al.
Published: (2024)

Ego-Grounding for Personalized Question-Answering in Egocentric Videos
by: Xiao, Junbin, et al.
Published: (2026)

Evaluation of Safety Cognition Capability in Vision-Language Models for Autonomous Driving
by: Zhang, Enming, et al.
Published: (2025)

ARMOR: Empowering Multimodal Understanding Model with Interleaved Multimodal Generation Capability
by: Sun, Jianwen, et al.
Published: (2025)

ME-Mamba: Multi-Expert Mamba with Efficient Knowledge Capture and Fusion for Multimodal Survival Analysis
by: Zhang, Chengsheng, et al.
Published: (2025)

HalDec-Bench: Benchmarking Hallucination Detector in Image Captioning
by: Saito, Kuniaki, et al.
Published: (2025)

HalDec-Bench: Benchmarking Hallucination Detector in Image Captioning
by: Saito, Kuniaki, et al.
Published: (2026)

MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering
by: Mao, Xianwei, et al.
Published: (2026)

Question-Answering Dense Video Events
by: Qin, Hangyu, et al.
Published: (2024)

VidPrism: Heterogeneous Mixture of Experts for Image-to-Video Transfer
by: Lin, Rui, et al.
Published: (2026)