Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Boyi, Zhu, Ligeng, Tian, Ran, Tan, Shuhan, Chen, Yuxiao, Lu, Yao, Cui, Yin, Veer, Sushant, Ehrlich, Max, Philion, Jonah, Weng, Xinshuo, Xue, Fuzhao, Fan, Linxi, Zhu, Yuke, Kautz, Jan, Tao, Andrew, Liu, Ming-Yu, Fidler, Sanja, Ivanovic, Boris, Darrell, Trevor, Malik, Jitendra, Han, Song, Pavone, Marco
Format:	Preprint
Published:	2024
Subjects:	Machine Learning Computation and Language Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2407.18908
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910884955160576
author	Li, Boyi Zhu, Ligeng Tian, Ran Tan, Shuhan Chen, Yuxiao Lu, Yao Cui, Yin Veer, Sushant Ehrlich, Max Philion, Jonah Weng, Xinshuo Xue, Fuzhao Fan, Linxi Zhu, Yuke Kautz, Jan Tao, Andrew Liu, Ming-Yu Fidler, Sanja Ivanovic, Boris Darrell, Trevor Malik, Jitendra Han, Song Pavone, Marco
author_facet	Li, Boyi Zhu, Ligeng Tian, Ran Tan, Shuhan Chen, Yuxiao Lu, Yao Cui, Yin Veer, Sushant Ehrlich, Max Philion, Jonah Weng, Xinshuo Xue, Fuzhao Fan, Linxi Zhu, Yuke Kautz, Jan Tao, Andrew Liu, Ming-Yu Fidler, Sanja Ivanovic, Boris Darrell, Trevor Malik, Jitendra Han, Song Pavone, Marco
contents	We propose Wolf, a WOrLd summarization Framework for accurate video captioning. Wolf is an automated captioning framework that adopts a mixture-of-experts approach, leveraging complementary strengths of Vision Language Models (VLMs). By utilizing both image and video models, our framework captures different levels of information and summarizes them efficiently. Our approach can be applied to enhance video understanding, auto-labeling, and captioning. To evaluate caption quality, we introduce CapScore, an LLM-based metric to assess the similarity and quality of generated captions compared to the ground truth captions. We further build four human-annotated datasets in three domains: autonomous driving, general scenes, and robotics, to facilitate comprehensive comparisons. We show that Wolf achieves superior captioning performance compared to state-of-the-art approaches from the research community (VILA1.5, CogAgent) and commercial solutions (Gemini-Pro-1.5, GPT-4V). For instance, in comparison with GPT-4V, Wolf improves CapScore both quality-wise by 55.6% and similarity-wise by 77.4% on challenging driving videos. Finally, we establish a benchmark for video captioning and introduce a leaderboard, aiming to accelerate advancements in video understanding, captioning, and data alignment. Webpage: https://wolfv0.github.io/.
format	Preprint
id	arxiv_https___arxiv_org_abs_2407_18908
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Wolf: Dense Video Captioning with a World Summarization Framework Li, Boyi Zhu, Ligeng Tian, Ran Tan, Shuhan Chen, Yuxiao Lu, Yao Cui, Yin Veer, Sushant Ehrlich, Max Philion, Jonah Weng, Xinshuo Xue, Fuzhao Fan, Linxi Zhu, Yuke Kautz, Jan Tao, Andrew Liu, Ming-Yu Fidler, Sanja Ivanovic, Boris Darrell, Trevor Malik, Jitendra Han, Song Pavone, Marco Machine Learning Computation and Language Computer Vision and Pattern Recognition We propose Wolf, a WOrLd summarization Framework for accurate video captioning. Wolf is an automated captioning framework that adopts a mixture-of-experts approach, leveraging complementary strengths of Vision Language Models (VLMs). By utilizing both image and video models, our framework captures different levels of information and summarizes them efficiently. Our approach can be applied to enhance video understanding, auto-labeling, and captioning. To evaluate caption quality, we introduce CapScore, an LLM-based metric to assess the similarity and quality of generated captions compared to the ground truth captions. We further build four human-annotated datasets in three domains: autonomous driving, general scenes, and robotics, to facilitate comprehensive comparisons. We show that Wolf achieves superior captioning performance compared to state-of-the-art approaches from the research community (VILA1.5, CogAgent) and commercial solutions (Gemini-Pro-1.5, GPT-4V). For instance, in comparison with GPT-4V, Wolf improves CapScore both quality-wise by 55.6% and similarity-wise by 77.4% on challenging driving videos. Finally, we establish a benchmark for video captioning and introduce a leaderboard, aiming to accelerate advancements in video understanding, captioning, and data alignment. Webpage: https://wolfv0.github.io/.
title	Wolf: Dense Video Captioning with a World Summarization Framework
topic	Machine Learning Computation and Language Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2407.18908

Similar Items