Saved in:
Bibliographic Details
Main Authors: Chen, Tsai-Shien, Siarohin, Aliaksandr, Menapace, Willi, Deyneka, Ekaterina, Chao, Hsiang-wei, Jeon, Byung Eun, Fang, Yuwei, Lee, Hsin-Ying, Ren, Jian, Yang, Ming-Hsuan, Tulyakov, Sergey
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2402.19479
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866929259580227584
author Chen, Tsai-Shien
Siarohin, Aliaksandr
Menapace, Willi
Deyneka, Ekaterina
Chao, Hsiang-wei
Jeon, Byung Eun
Fang, Yuwei
Lee, Hsin-Ying
Ren, Jian
Yang, Ming-Hsuan
Tulyakov, Sergey
author_facet Chen, Tsai-Shien
Siarohin, Aliaksandr
Menapace, Willi
Deyneka, Ekaterina
Chao, Hsiang-wei
Jeon, Byung Eun
Fang, Yuwei
Lee, Hsin-Ying
Ren, Jian
Yang, Ming-Hsuan
Tulyakov, Sergey
contents The quality of the data and annotation upper-bounds the quality of a downstream model. While there exist large text corpora and image-text pairs, high-quality video-text data is much harder to collect. First of all, manual labeling is more time-consuming, as it requires an annotator to watch an entire video. Second, videos have a temporal dimension, consisting of several scenes stacked together, and showing multiple actions. Accordingly, to establish a video dataset with high-quality captions, we propose an automatic approach leveraging multimodal inputs, such as textual video description, subtitles, and individual video frames. Specifically, we curate 3.8M high-resolution videos from the publicly available HD-VILA-100M dataset. We then split them into semantically consistent video clips, and apply multiple cross-modality teacher models to obtain captions for each video. Next, we finetune a retrieval model on a small subset where the best caption of each video is manually selected and then employ the model in the whole dataset to select the best caption as the annotation. In this way, we get 70M videos paired with high-quality text captions. We dub the dataset as Panda-70M. We show the value of the proposed dataset on three downstream tasks: video captioning, video and text retrieval, and text-driven video generation. The models trained on the proposed data score substantially better on the majority of metrics across all the tasks.
format Preprint
id arxiv_https___arxiv_org_abs_2402_19479
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Chen, Tsai-Shien
Siarohin, Aliaksandr
Menapace, Willi
Deyneka, Ekaterina
Chao, Hsiang-wei
Jeon, Byung Eun
Fang, Yuwei
Lee, Hsin-Ying
Ren, Jian
Yang, Ming-Hsuan
Tulyakov, Sergey
Computer Vision and Pattern Recognition
The quality of the data and annotation upper-bounds the quality of a downstream model. While there exist large text corpora and image-text pairs, high-quality video-text data is much harder to collect. First of all, manual labeling is more time-consuming, as it requires an annotator to watch an entire video. Second, videos have a temporal dimension, consisting of several scenes stacked together, and showing multiple actions. Accordingly, to establish a video dataset with high-quality captions, we propose an automatic approach leveraging multimodal inputs, such as textual video description, subtitles, and individual video frames. Specifically, we curate 3.8M high-resolution videos from the publicly available HD-VILA-100M dataset. We then split them into semantically consistent video clips, and apply multiple cross-modality teacher models to obtain captions for each video. Next, we finetune a retrieval model on a small subset where the best caption of each video is manually selected and then employ the model in the whole dataset to select the best caption as the annotation. In this way, we get 70M videos paired with high-quality text captions. We dub the dataset as Panda-70M. We show the value of the proposed dataset on three downstream tasks: video captioning, video and text retrieval, and text-driven video generation. The models trained on the proposed data score substantially better on the majority of metrics across all the tasks.
title Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2402.19479