Saved in:
| Main Authors: | Zhang, Huaying, Hashimoto, Atsushi, Hirasawa, Tosho |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.15006 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark
by: Maeda, Koki, et al.
Published: (2024)
by: Maeda, Koki, et al.
Published: (2024)
SciPostGen: Bridging the Gap between Scientific Papers and Poster Layouts
by: Inadumi, Shun, et al.
Published: (2025)
by: Inadumi, Shun, et al.
Published: (2025)
AdaCoder: Adaptive Prompt Compression for Programmatic Visual Question Answering
by: Ukai, Mahiro, et al.
Published: (2024)
by: Ukai, Mahiro, et al.
Published: (2024)
FIQ: Fundamental Question Generation with the Integration of Question Embeddings for Video Question Answering
by: Oh, Ju-Young, et al.
Published: (2025)
by: Oh, Ju-Young, et al.
Published: (2025)
Assessing the Capabilities of LLMs in Humor:A Multi-dimensional Analysis of Oogiri Generation and Evaluation
by: Sakabe, Ritsu, et al.
Published: (2025)
by: Sakabe, Ritsu, et al.
Published: (2025)
VideoExplorer: Think With Videos For Agentic Long-Video Understanding
by: Yuan, Huaying, et al.
Published: (2025)
by: Yuan, Huaying, et al.
Published: (2025)
Task-Aware KV Compression For Cost-Effective Long Video Understanding
by: Qin, Minghao, et al.
Published: (2025)
by: Qin, Minghao, et al.
Published: (2025)
Video-XL-2: Towards Very Long-Video Understanding Through Task-Aware KV Sparsification
by: Qin, Minghao, et al.
Published: (2025)
by: Qin, Minghao, et al.
Published: (2025)
Knowledge-Intensive Video Generation
by: Wang, Chenxu, et al.
Published: (2026)
by: Wang, Chenxu, et al.
Published: (2026)
SBS Figures: Pre-training Figure QA from Stage-by-Stage Synthesized Images
by: Shinoda, Risa, et al.
Published: (2024)
by: Shinoda, Risa, et al.
Published: (2024)
A Knowledge Noise Mitigation Framework for Knowledge-based Visual Question Answering
by: Liu, Zhiyue, et al.
Published: (2025)
by: Liu, Zhiyue, et al.
Published: (2025)
EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation
by: Yang, Songlin, et al.
Published: (2026)
by: Yang, Songlin, et al.
Published: (2026)
Knowledge Detection by Relevant Question and Image Attributes in Visual Question Answering
by: Ahir, Param, et al.
Published: (2023)
by: Ahir, Param, et al.
Published: (2023)
Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles
by: Chen, Qi, et al.
Published: (2024)
by: Chen, Qi, et al.
Published: (2024)
YTCommentQA: Video Question Answerability in Instructional Videos
by: Yang, Saelyne, et al.
Published: (2024)
by: Yang, Saelyne, et al.
Published: (2024)
Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering
by: Lim, Youngsun, et al.
Published: (2024)
by: Lim, Youngsun, et al.
Published: (2024)
Fine-Grained Knowledge Structuring and Retrieval for Visual Question Answering
by: Zhang, Zhengxuan, et al.
Published: (2025)
by: Zhang, Zhengxuan, et al.
Published: (2025)
MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval
by: Yuan, Huaying, et al.
Published: (2025)
by: Yuan, Huaying, et al.
Published: (2025)
Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties
by: Yu, Keunwoo Peter, et al.
Published: (2023)
by: Yu, Keunwoo Peter, et al.
Published: (2023)
Product of Experts for Visual Generation
by: Zhang, Yunzhi, et al.
Published: (2025)
by: Zhang, Yunzhi, et al.
Published: (2025)
QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering
by: Jung, Woojun, et al.
Published: (2026)
by: Jung, Woojun, et al.
Published: (2026)
Learning Question-Aware Keyframe Selection with Synthetic Supervision for Video Question Answering
by: Kwon, Minchan, et al.
Published: (2026)
by: Kwon, Minchan, et al.
Published: (2026)
Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering
by: Yu, Ting, et al.
Published: (2024)
by: Yu, Ting, et al.
Published: (2024)
AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences
by: Li, Jieyu, et al.
Published: (2025)
by: Li, Jieyu, et al.
Published: (2025)
OmniGen: Unified Image Generation
by: Xiao, Shitao, et al.
Published: (2024)
by: Xiao, Shitao, et al.
Published: (2024)
Deep Expert Injection for Anchoring Retinal VLMs with Domain-Specific Knowledge
by: Lu, Shuai, et al.
Published: (2026)
by: Lu, Shuai, et al.
Published: (2026)
VQA$^2$: Visual Question Answering for Video Quality Assessment
by: Jia, Ziheng, et al.
Published: (2024)
by: Jia, Ziheng, et al.
Published: (2024)
StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering
by: Wen, Zhihao, et al.
Published: (2025)
by: Wen, Zhihao, et al.
Published: (2025)
Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning
by: Liu, Huabin, et al.
Published: (2025)
by: Liu, Huabin, et al.
Published: (2025)
Enhancing Long Video Question Answering with Scene-Localized Frame Grouping
by: Yang, Xuyi, et al.
Published: (2025)
by: Yang, Xuyi, et al.
Published: (2025)
Knowledge Graphs of Driving Scenes to Empower the Emerging Capabilities of Neurosymbolic AI
by: Wickramarachchi, Ruwan, et al.
Published: (2024)
by: Wickramarachchi, Ruwan, et al.
Published: (2024)
Ego-Grounding for Personalized Question-Answering in Egocentric Videos
by: Xiao, Junbin, et al.
Published: (2026)
by: Xiao, Junbin, et al.
Published: (2026)
Evaluation of Safety Cognition Capability in Vision-Language Models for Autonomous Driving
by: Zhang, Enming, et al.
Published: (2025)
by: Zhang, Enming, et al.
Published: (2025)
ARMOR: Empowering Multimodal Understanding Model with Interleaved Multimodal Generation Capability
by: Sun, Jianwen, et al.
Published: (2025)
by: Sun, Jianwen, et al.
Published: (2025)
ME-Mamba: Multi-Expert Mamba with Efficient Knowledge Capture and Fusion for Multimodal Survival Analysis
by: Zhang, Chengsheng, et al.
Published: (2025)
by: Zhang, Chengsheng, et al.
Published: (2025)
HalDec-Bench: Benchmarking Hallucination Detector in Image Captioning
by: Saito, Kuniaki, et al.
Published: (2025)
by: Saito, Kuniaki, et al.
Published: (2025)
HalDec-Bench: Benchmarking Hallucination Detector in Image Captioning
by: Saito, Kuniaki, et al.
Published: (2026)
by: Saito, Kuniaki, et al.
Published: (2026)
MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering
by: Mao, Xianwei, et al.
Published: (2026)
by: Mao, Xianwei, et al.
Published: (2026)
Question-Answering Dense Video Events
by: Qin, Hangyu, et al.
Published: (2024)
by: Qin, Hangyu, et al.
Published: (2024)
VidPrism: Heterogeneous Mixture of Experts for Image-to-Video Transfer
by: Lin, Rui, et al.
Published: (2026)
by: Lin, Rui, et al.
Published: (2026)
Similar Items
-
COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark
by: Maeda, Koki, et al.
Published: (2024) -
SciPostGen: Bridging the Gap between Scientific Papers and Poster Layouts
by: Inadumi, Shun, et al.
Published: (2025) -
AdaCoder: Adaptive Prompt Compression for Programmatic Visual Question Answering
by: Ukai, Mahiro, et al.
Published: (2024) -
FIQ: Fundamental Question Generation with the Integration of Question Embeddings for Video Question Answering
by: Oh, Ju-Young, et al.
Published: (2025) -
Assessing the Capabilities of LLMs in Humor:A Multi-dimensional Analysis of Oogiri Generation and Evaluation
by: Sakabe, Ritsu, et al.
Published: (2025)