Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Alam, Md Zarif Ul, Zamani, Hamed
Format:	Preprint
Published:	2025
Subjects:	Information Retrieval
Online Access:	https://arxiv.org/abs/2502.11747
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912235582914560
author	Alam, Md Zarif Ul Zamani, Hamed
author_facet	Alam, Md Zarif Ul Zamani, Hamed
contents	Video question answering that requires external knowledge beyond the visual content remains a significant challenge in AI systems. While models can effectively answer questions based on direct visual observations, they often falter when faced with questions requiring broader contextual knowledge. To address this limitation, we investigate knowledge-intensive video question answering (KI-VideoQA) through the lens of multi-modal retrieval-augmented generation, with a particular focus on handling open-ended questions rather than just multiple-choice formats. Our comprehensive analysis examines various retrieval augmentation approaches using cutting-edge retrieval and vision language models, testing both zero-shot and fine-tuned configurations. We investigate several critical dimensions: the interplay between different information sources and modalities, strategies for integrating diverse multi-modal contexts, and the dynamics between query formulation and retrieval result utilization. Our findings reveal that while retrieval augmentation shows promise in improving model performance, its success is heavily dependent on the chosen modality and retrieval methodology. The study also highlights the critical role of query construction and retrieval depth optimization in effective knowledge integration. Through our proposed approach, we achieve a substantial 17.5% improvement in accuracy on multiple choice questions in the KnowIT VQA dataset, establishing new state-of-the-art performance levels.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_11747
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Open-Ended and Knowledge-Intensive Video Question Answering Alam, Md Zarif Ul Zamani, Hamed Information Retrieval Video question answering that requires external knowledge beyond the visual content remains a significant challenge in AI systems. While models can effectively answer questions based on direct visual observations, they often falter when faced with questions requiring broader contextual knowledge. To address this limitation, we investigate knowledge-intensive video question answering (KI-VideoQA) through the lens of multi-modal retrieval-augmented generation, with a particular focus on handling open-ended questions rather than just multiple-choice formats. Our comprehensive analysis examines various retrieval augmentation approaches using cutting-edge retrieval and vision language models, testing both zero-shot and fine-tuned configurations. We investigate several critical dimensions: the interplay between different information sources and modalities, strategies for integrating diverse multi-modal contexts, and the dynamics between query formulation and retrieval result utilization. Our findings reveal that while retrieval augmentation shows promise in improving model performance, its success is heavily dependent on the chosen modality and retrieval methodology. The study also highlights the critical role of query construction and retrieval depth optimization in effective knowledge integration. Through our proposed approach, we achieve a substantial 17.5% improvement in accuracy on multiple choice questions in the KnowIT VQA dataset, establishing new state-of-the-art performance levels.
title	Open-Ended and Knowledge-Intensive Video Question Answering
topic	Information Retrieval
url	https://arxiv.org/abs/2502.11747

Similar Items