:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Mao, Yuanyuan, Lin, Xin, Ni, Qin, He, Liang
Format:	Preprint
Published:	2024
Subjects:	Multimedia Artificial Intelligence
Online Access:	https://arxiv.org/abs/2402.07402
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Question-Answering Dense Video Events
by: Qin, Hangyu, et al.
Published: (2024)

Memory-Centric Embodied Question Answering
by: Zhai, Mingliang, et al.
Published: (2025)

VidCtx: Context-aware Video Question Answering with Image Models
by: Goulas, Andreas, et al.
Published: (2024)

Can I Trust Your Answer? Visually Grounded Video Question Answering
by: Xiao, Junbin, et al.
Published: (2023)

POLYCHARTQA: Benchmarking Large Vision-Language Models with Multilingual Chart Question Answering
by: Xu, Yichen, et al.
Published: (2025)

MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering
by: Xiao, Junbin, et al.
Published: (2026)

CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning
by: He, Zheqi, et al.
Published: (2024)

Multi-Domain Audio Question Answering Benchmark Toward Acoustic Content Reasoning
by: Yang, Chao-Han Huck, et al.
Published: (2025)

ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering
by: Compagnoni, Alberto, et al.
Published: (2025)

Navigating the Mirage: A Dual-Path Agentic Framework for Robust Misleading Chart Question Answering
by: Zhang, Yanjie, et al.
Published: (2026)

Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models
by: Han, Wei, et al.
Published: (2023)

Exposing Cross-Modal Consistency for Fake News Detection in Short-Form Videos
by: Tian, Chong, et al.
Published: (2026)

Why We Feel: Breaking Boundaries in Emotional Reasoning with Multimodal Large Language Models
by: Lin, Yuxiang, et al.
Published: (2025)

Mitigating Easy Option Bias in Multiple-Choice Question Answering
by: Zhang, Hao, et al.
Published: (2025)

Attention of a Kiss: Exploring Attention Maps in Video Diffusion for XAIxArts
by: Cole, Adam, et al.
Published: (2025)

Learning Trimodal Relation for Audio-Visual Question Answering with Missing Modality
by: Park, Kyu Ri, et al.
Published: (2024)

AdaCoder: Adaptive Prompt Compression for Programmatic Visual Question Answering
by: Ukai, Mahiro, et al.
Published: (2024)

Towards Automatic Soccer Commentary Generation with Knowledge-Enhanced Visual Reasoning
by: Jin, Zeyu, et al.
Published: (2026)

Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning
by: Cheng, Zebang, et al.
Published: (2024)

Solving Copyright Infringement on Short Video Platforms: Novel Datasets and an Audio Restoration Deep Learning Pipeline
by: Oh, Minwoo, et al.
Published: (2025)

Boosting Audio Visual Question Answering via Key Semantic-Aware Cues
by: Li, Guangyao, et al.
Published: (2024)

SHMamba: Structured Hyperbolic State Space Model for Audio-Visual Question Answering
by: Yang, Zhe, et al.
Published: (2024)

SEER: Semantic Enhancement and Emotional Reasoning Network for Multimodal Fake News Detection
by: Zhu, Peican, et al.
Published: (2025)

Official-NV: An LLM-Generated News Video Dataset for Multimodal Fake News Detection
by: Wang, Yihao, et al.
Published: (2024)

Plasticity-Aware Mixture of Experts for Learning Under QoE Shifts in Adaptive Video Streaming
by: He, Zhiqiang, et al.
Published: (2025)

MindFuse: Towards GenAI Explainability in Marketing Strategy Co-Creation
by: Farseev, Aleksandr, et al.
Published: (2025)

PediatricsMQA: a Multi-modal Pediatrics Question Answering Benchmark
by: Bahaj, Adil, et al.
Published: (2025)

Has Multimodal Learning Delivered Universal Intelligence in Healthcare? A Comprehensive Survey
by: Lin, Qika, et al.
Published: (2024)

EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports
by: Ma, Jianzhe, et al.
Published: (2026)

Adaptive Multi-Agent Reasoning for Text-to-Video Retrieval
by: Wu, Jiaxin, et al.
Published: (2025)

VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
by: Yu, Jiashuo, et al.
Published: (2025)

HiQuE: Hierarchical Question Embedding Network for Multimodal Depression Detection
by: Jung, Juho, et al.
Published: (2024)

A New Dataset and Benchmark for Grounding Multimodal Misinformation
by: Yang, Bingjian, et al.
Published: (2025)

VideoForest: Person-Anchored Hierarchical Reasoning for Cross-Video Question Answering
by: Meng, Yiran, et al.
Published: (2025)

QMAVIS: Long Video-Audio Understanding using Fusion of Large Multimodal Models
by: Lin, Zixing, et al.
Published: (2026)

FineFake: A Knowledge-Enriched Dataset for Fine-Grained Multi-Domain Fake News Detection
by: Zhou, Ziyi, et al.
Published: (2024)

Semantic-Guided Unsupervised Video Summarization
by: Liu, Haizhou, et al.
Published: (2026)

Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering
by: Cocchi, Federico, et al.
Published: (2024)

Towards Real-world Video Face Restoration: A New Benchmark
by: Chen, Ziyan, et al.
Published: (2024)

Towards Open-Vocabulary Video Semantic Segmentation
by: Li, Xinhao, et al.
Published: (2024)