Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Mu, Kuan-Chen, Chin, Zhi-Yi, Chiu, Wei-Chen
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Computation and Language
Online Access:	https://arxiv.org/abs/2410.04511
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909337457262592
author	Mu, Kuan-Chen Chin, Zhi-Yi Chiu, Wei-Chen
author_facet	Mu, Kuan-Chen Chin, Zhi-Yi Chiu, Wei-Chen
contents	The recent development of Video-based Large Language Models (VideoLLMs), has significantly advanced video summarization by aligning video features and, in some cases, audio features with Large Language Models (LLMs). Each of these VideoLLMs possesses unique strengths and weaknesses. Many recent methods have required extensive fine-tuning to overcome the limitations of these models, which can be resource-intensive. In this work, we observe that the strengths of one VideoLLM can complement the weaknesses of another. Leveraging this insight, we propose a novel video summarization framework inspired by the Mixture of Experts (MoE) paradigm, which operates as an inference-time algorithm without requiring any form of fine-tuning. Our approach integrates multiple VideoLLMs to generate comprehensive and coherent textual summaries. It effectively combines visual and audio content, provides detailed background descriptions, and excels at identifying keyframes, which enables more semantically meaningful retrieval compared to traditional computer vision approaches that rely solely on visual information, all without the need for additional fine-tuning. Moreover, the resulting summaries enhance performance in downstream tasks such as summary video generation, either through keyframe selection or in combination with text-to-image models. Our language-driven approach offers a semantically rich alternative to conventional methods and provides flexibility to incorporate newer VideoLLMs, enhancing adaptability and performance in video summarization tasks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2410_04511
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Realizing Video Summarization from the Path of Language-based Semantic Understanding Mu, Kuan-Chen Chin, Zhi-Yi Chiu, Wei-Chen Computer Vision and Pattern Recognition Computation and Language The recent development of Video-based Large Language Models (VideoLLMs), has significantly advanced video summarization by aligning video features and, in some cases, audio features with Large Language Models (LLMs). Each of these VideoLLMs possesses unique strengths and weaknesses. Many recent methods have required extensive fine-tuning to overcome the limitations of these models, which can be resource-intensive. In this work, we observe that the strengths of one VideoLLM can complement the weaknesses of another. Leveraging this insight, we propose a novel video summarization framework inspired by the Mixture of Experts (MoE) paradigm, which operates as an inference-time algorithm without requiring any form of fine-tuning. Our approach integrates multiple VideoLLMs to generate comprehensive and coherent textual summaries. It effectively combines visual and audio content, provides detailed background descriptions, and excels at identifying keyframes, which enables more semantically meaningful retrieval compared to traditional computer vision approaches that rely solely on visual information, all without the need for additional fine-tuning. Moreover, the resulting summaries enhance performance in downstream tasks such as summary video generation, either through keyframe selection or in combination with text-to-image models. Our language-driven approach offers a semantically rich alternative to conventional methods and provides flexibility to incorporate newer VideoLLMs, enhancing adaptability and performance in video summarization tasks.
title	Realizing Video Summarization from the Path of Language-based Semantic Understanding
topic	Computer Vision and Pattern Recognition Computation and Language
url	https://arxiv.org/abs/2410.04511

Similar Items