Saved in:
Bibliographic Details
Main Authors: Li, Boyi, Zhu, Ligeng, Tian, Ran, Tan, Shuhan, Chen, Yuxiao, Lu, Yao, Cui, Yin, Veer, Sushant, Ehrlich, Max, Philion, Jonah, Weng, Xinshuo, Xue, Fuzhao, Fan, Linxi, Zhu, Yuke, Kautz, Jan, Tao, Andrew, Liu, Ming-Yu, Fidler, Sanja, Ivanovic, Boris, Darrell, Trevor, Malik, Jitendra, Han, Song, Pavone, Marco
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2407.18908
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910884955160576
author Li, Boyi
Zhu, Ligeng
Tian, Ran
Tan, Shuhan
Chen, Yuxiao
Lu, Yao
Cui, Yin
Veer, Sushant
Ehrlich, Max
Philion, Jonah
Weng, Xinshuo
Xue, Fuzhao
Fan, Linxi
Zhu, Yuke
Kautz, Jan
Tao, Andrew
Liu, Ming-Yu
Fidler, Sanja
Ivanovic, Boris
Darrell, Trevor
Malik, Jitendra
Han, Song
Pavone, Marco
author_facet Li, Boyi
Zhu, Ligeng
Tian, Ran
Tan, Shuhan
Chen, Yuxiao
Lu, Yao
Cui, Yin
Veer, Sushant
Ehrlich, Max
Philion, Jonah
Weng, Xinshuo
Xue, Fuzhao
Fan, Linxi
Zhu, Yuke
Kautz, Jan
Tao, Andrew
Liu, Ming-Yu
Fidler, Sanja
Ivanovic, Boris
Darrell, Trevor
Malik, Jitendra
Han, Song
Pavone, Marco
contents We propose Wolf, a WOrLd summarization Framework for accurate video captioning. Wolf is an automated captioning framework that adopts a mixture-of-experts approach, leveraging complementary strengths of Vision Language Models (VLMs). By utilizing both image and video models, our framework captures different levels of information and summarizes them efficiently. Our approach can be applied to enhance video understanding, auto-labeling, and captioning. To evaluate caption quality, we introduce CapScore, an LLM-based metric to assess the similarity and quality of generated captions compared to the ground truth captions. We further build four human-annotated datasets in three domains: autonomous driving, general scenes, and robotics, to facilitate comprehensive comparisons. We show that Wolf achieves superior captioning performance compared to state-of-the-art approaches from the research community (VILA1.5, CogAgent) and commercial solutions (Gemini-Pro-1.5, GPT-4V). For instance, in comparison with GPT-4V, Wolf improves CapScore both quality-wise by 55.6% and similarity-wise by 77.4% on challenging driving videos. Finally, we establish a benchmark for video captioning and introduce a leaderboard, aiming to accelerate advancements in video understanding, captioning, and data alignment. Webpage: https://wolfv0.github.io/.
format Preprint
id arxiv_https___arxiv_org_abs_2407_18908
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Wolf: Dense Video Captioning with a World Summarization Framework
Li, Boyi
Zhu, Ligeng
Tian, Ran
Tan, Shuhan
Chen, Yuxiao
Lu, Yao
Cui, Yin
Veer, Sushant
Ehrlich, Max
Philion, Jonah
Weng, Xinshuo
Xue, Fuzhao
Fan, Linxi
Zhu, Yuke
Kautz, Jan
Tao, Andrew
Liu, Ming-Yu
Fidler, Sanja
Ivanovic, Boris
Darrell, Trevor
Malik, Jitendra
Han, Song
Pavone, Marco
Machine Learning
Computation and Language
Computer Vision and Pattern Recognition
We propose Wolf, a WOrLd summarization Framework for accurate video captioning. Wolf is an automated captioning framework that adopts a mixture-of-experts approach, leveraging complementary strengths of Vision Language Models (VLMs). By utilizing both image and video models, our framework captures different levels of information and summarizes them efficiently. Our approach can be applied to enhance video understanding, auto-labeling, and captioning. To evaluate caption quality, we introduce CapScore, an LLM-based metric to assess the similarity and quality of generated captions compared to the ground truth captions. We further build four human-annotated datasets in three domains: autonomous driving, general scenes, and robotics, to facilitate comprehensive comparisons. We show that Wolf achieves superior captioning performance compared to state-of-the-art approaches from the research community (VILA1.5, CogAgent) and commercial solutions (Gemini-Pro-1.5, GPT-4V). For instance, in comparison with GPT-4V, Wolf improves CapScore both quality-wise by 55.6% and similarity-wise by 77.4% on challenging driving videos. Finally, we establish a benchmark for video captioning and introduce a leaderboard, aiming to accelerate advancements in video understanding, captioning, and data alignment. Webpage: https://wolfv0.github.io/.
title Wolf: Dense Video Captioning with a World Summarization Framework
topic Machine Learning
Computation and Language
Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2407.18908