Saved in:
| Main Authors: | Mao, Yuanyuan, Lin, Xin, Ni, Qin, He, Liang |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2402.07402 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Question-Answering Dense Video Events
by: Qin, Hangyu, et al.
Published: (2024)
by: Qin, Hangyu, et al.
Published: (2024)
Memory-Centric Embodied Question Answering
by: Zhai, Mingliang, et al.
Published: (2025)
by: Zhai, Mingliang, et al.
Published: (2025)
VidCtx: Context-aware Video Question Answering with Image Models
by: Goulas, Andreas, et al.
Published: (2024)
by: Goulas, Andreas, et al.
Published: (2024)
Can I Trust Your Answer? Visually Grounded Video Question Answering
by: Xiao, Junbin, et al.
Published: (2023)
by: Xiao, Junbin, et al.
Published: (2023)
POLYCHARTQA: Benchmarking Large Vision-Language Models with Multilingual Chart Question Answering
by: Xu, Yichen, et al.
Published: (2025)
by: Xu, Yichen, et al.
Published: (2025)
MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering
by: Xiao, Junbin, et al.
Published: (2026)
by: Xiao, Junbin, et al.
Published: (2026)
CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning
by: He, Zheqi, et al.
Published: (2024)
by: He, Zheqi, et al.
Published: (2024)
Multi-Domain Audio Question Answering Benchmark Toward Acoustic Content Reasoning
by: Yang, Chao-Han Huck, et al.
Published: (2025)
by: Yang, Chao-Han Huck, et al.
Published: (2025)
ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering
by: Compagnoni, Alberto, et al.
Published: (2025)
by: Compagnoni, Alberto, et al.
Published: (2025)
Navigating the Mirage: A Dual-Path Agentic Framework for Robust Misleading Chart Question Answering
by: Zhang, Yanjie, et al.
Published: (2026)
by: Zhang, Yanjie, et al.
Published: (2026)
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models
by: Han, Wei, et al.
Published: (2023)
by: Han, Wei, et al.
Published: (2023)
Exposing Cross-Modal Consistency for Fake News Detection in Short-Form Videos
by: Tian, Chong, et al.
Published: (2026)
by: Tian, Chong, et al.
Published: (2026)
Why We Feel: Breaking Boundaries in Emotional Reasoning with Multimodal Large Language Models
by: Lin, Yuxiang, et al.
Published: (2025)
by: Lin, Yuxiang, et al.
Published: (2025)
Mitigating Easy Option Bias in Multiple-Choice Question Answering
by: Zhang, Hao, et al.
Published: (2025)
by: Zhang, Hao, et al.
Published: (2025)
Attention of a Kiss: Exploring Attention Maps in Video Diffusion for XAIxArts
by: Cole, Adam, et al.
Published: (2025)
by: Cole, Adam, et al.
Published: (2025)
Learning Trimodal Relation for Audio-Visual Question Answering with Missing Modality
by: Park, Kyu Ri, et al.
Published: (2024)
by: Park, Kyu Ri, et al.
Published: (2024)
AdaCoder: Adaptive Prompt Compression for Programmatic Visual Question Answering
by: Ukai, Mahiro, et al.
Published: (2024)
by: Ukai, Mahiro, et al.
Published: (2024)
Towards Automatic Soccer Commentary Generation with Knowledge-Enhanced Visual Reasoning
by: Jin, Zeyu, et al.
Published: (2026)
by: Jin, Zeyu, et al.
Published: (2026)
Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning
by: Cheng, Zebang, et al.
Published: (2024)
by: Cheng, Zebang, et al.
Published: (2024)
Solving Copyright Infringement on Short Video Platforms: Novel Datasets and an Audio Restoration Deep Learning Pipeline
by: Oh, Minwoo, et al.
Published: (2025)
by: Oh, Minwoo, et al.
Published: (2025)
Boosting Audio Visual Question Answering via Key Semantic-Aware Cues
by: Li, Guangyao, et al.
Published: (2024)
by: Li, Guangyao, et al.
Published: (2024)
SHMamba: Structured Hyperbolic State Space Model for Audio-Visual Question Answering
by: Yang, Zhe, et al.
Published: (2024)
by: Yang, Zhe, et al.
Published: (2024)
SEER: Semantic Enhancement and Emotional Reasoning Network for Multimodal Fake News Detection
by: Zhu, Peican, et al.
Published: (2025)
by: Zhu, Peican, et al.
Published: (2025)
Official-NV: An LLM-Generated News Video Dataset for Multimodal Fake News Detection
by: Wang, Yihao, et al.
Published: (2024)
by: Wang, Yihao, et al.
Published: (2024)
Plasticity-Aware Mixture of Experts for Learning Under QoE Shifts in Adaptive Video Streaming
by: He, Zhiqiang, et al.
Published: (2025)
by: He, Zhiqiang, et al.
Published: (2025)
MindFuse: Towards GenAI Explainability in Marketing Strategy Co-Creation
by: Farseev, Aleksandr, et al.
Published: (2025)
by: Farseev, Aleksandr, et al.
Published: (2025)
PediatricsMQA: a Multi-modal Pediatrics Question Answering Benchmark
by: Bahaj, Adil, et al.
Published: (2025)
by: Bahaj, Adil, et al.
Published: (2025)
Has Multimodal Learning Delivered Universal Intelligence in Healthcare? A Comprehensive Survey
by: Lin, Qika, et al.
Published: (2024)
by: Lin, Qika, et al.
Published: (2024)
EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports
by: Ma, Jianzhe, et al.
Published: (2026)
by: Ma, Jianzhe, et al.
Published: (2026)
Adaptive Multi-Agent Reasoning for Text-to-Video Retrieval
by: Wu, Jiaxin, et al.
Published: (2025)
by: Wu, Jiaxin, et al.
Published: (2025)
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
by: Yu, Jiashuo, et al.
Published: (2025)
by: Yu, Jiashuo, et al.
Published: (2025)
HiQuE: Hierarchical Question Embedding Network for Multimodal Depression Detection
by: Jung, Juho, et al.
Published: (2024)
by: Jung, Juho, et al.
Published: (2024)
A New Dataset and Benchmark for Grounding Multimodal Misinformation
by: Yang, Bingjian, et al.
Published: (2025)
by: Yang, Bingjian, et al.
Published: (2025)
VideoForest: Person-Anchored Hierarchical Reasoning for Cross-Video Question Answering
by: Meng, Yiran, et al.
Published: (2025)
by: Meng, Yiran, et al.
Published: (2025)
QMAVIS: Long Video-Audio Understanding using Fusion of Large Multimodal Models
by: Lin, Zixing, et al.
Published: (2026)
by: Lin, Zixing, et al.
Published: (2026)
FineFake: A Knowledge-Enriched Dataset for Fine-Grained Multi-Domain Fake News Detection
by: Zhou, Ziyi, et al.
Published: (2024)
by: Zhou, Ziyi, et al.
Published: (2024)
Semantic-Guided Unsupervised Video Summarization
by: Liu, Haizhou, et al.
Published: (2026)
by: Liu, Haizhou, et al.
Published: (2026)
Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering
by: Cocchi, Federico, et al.
Published: (2024)
by: Cocchi, Federico, et al.
Published: (2024)
Towards Real-world Video Face Restoration: A New Benchmark
by: Chen, Ziyan, et al.
Published: (2024)
by: Chen, Ziyan, et al.
Published: (2024)
Towards Open-Vocabulary Video Semantic Segmentation
by: Li, Xinhao, et al.
Published: (2024)
by: Li, Xinhao, et al.
Published: (2024)
Similar Items
-
Question-Answering Dense Video Events
by: Qin, Hangyu, et al.
Published: (2024) -
Memory-Centric Embodied Question Answering
by: Zhai, Mingliang, et al.
Published: (2025) -
VidCtx: Context-aware Video Question Answering with Image Models
by: Goulas, Andreas, et al.
Published: (2024) -
Can I Trust Your Answer? Visually Grounded Video Question Answering
by: Xiao, Junbin, et al.
Published: (2023) -
POLYCHARTQA: Benchmarking Large Vision-Language Models with Multilingual Chart Question Answering
by: Xu, Yichen, et al.
Published: (2025)