Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Tan, Xichen, Ye, Yunfan, Luo, Yuanjing, Wan, Qian, Liu, Fang, Cai, Zhiping
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2503.08576
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910870236299264
author	Tan, Xichen Ye, Yunfan Luo, Yuanjing Wan, Qian Liu, Fang Cai, Zhiping
author_facet	Tan, Xichen Ye, Yunfan Luo, Yuanjing Wan, Qian Liu, Fang Cai, Zhiping
contents	Multi-modal Large Language Models (MLLMs) capable of video understanding are advancing rapidly. To effectively assess their video comprehension capabilities, long video understanding benchmarks, such as Video-MME and MLVU, are proposed. However, these benchmarks directly use uniform frame sampling for testing, which results in significant information loss and affects the accuracy of the evaluations in reflecting the true abilities of MLLMs. To address this, we propose RAG-Adapter, a plug-and-play framework that reduces information loss during testing by sampling frames most relevant to the given question. Additionally, we introduce a Grouped-supervised Contrastive Learning (GCL) method to further enhance sampling effectiveness of RAG-Adapter through fine-tuning on our constructed MMAT dataset. Finally, we test numerous baseline MLLMs on various video understanding benchmarks, finding that RAG-Adapter sampling consistently outperforms uniform sampling (e.g., Accuracy of GPT-4o increases by 9.3 percent on Video-MME), providing a more accurate testing method for long video benchmarks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_08576
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	RAG-Adapter: A Plug-and-Play RAG-enhanced Framework for Long Video Understanding Tan, Xichen Ye, Yunfan Luo, Yuanjing Wan, Qian Liu, Fang Cai, Zhiping Computer Vision and Pattern Recognition Multi-modal Large Language Models (MLLMs) capable of video understanding are advancing rapidly. To effectively assess their video comprehension capabilities, long video understanding benchmarks, such as Video-MME and MLVU, are proposed. However, these benchmarks directly use uniform frame sampling for testing, which results in significant information loss and affects the accuracy of the evaluations in reflecting the true abilities of MLLMs. To address this, we propose RAG-Adapter, a plug-and-play framework that reduces information loss during testing by sampling frames most relevant to the given question. Additionally, we introduce a Grouped-supervised Contrastive Learning (GCL) method to further enhance sampling effectiveness of RAG-Adapter through fine-tuning on our constructed MMAT dataset. Finally, we test numerous baseline MLLMs on various video understanding benchmarks, finding that RAG-Adapter sampling consistently outperforms uniform sampling (e.g., Accuracy of GPT-4o increases by 9.3 percent on Video-MME), providing a more accurate testing method for long video benchmarks.
title	RAG-Adapter: A Plug-and-Play RAG-enhanced Framework for Long Video Understanding
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2503.08576

Similar Items