Saved in:
Bibliographic Details
Main Authors: Ge, Yuying, Ge, Yixiao, Li, Chen, Wang, Teng, Pu, Junfu, Li, Yizhuo, Qiu, Lu, Ma, Jin, Duan, Lisheng, Zuo, Xinyu, Luo, Jinwen, Gu, Weibo, Li, Zexuan, Zhang, Xiaojing, Tao, Yangyu, Hu, Han, Wang, Di, Shan, Ying
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2507.20939
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916867371696128
author Ge, Yuying
Ge, Yixiao
Li, Chen
Wang, Teng
Pu, Junfu
Li, Yizhuo
Qiu, Lu
Ma, Jin
Duan, Lisheng
Zuo, Xinyu
Luo, Jinwen
Gu, Weibo
Li, Zexuan
Zhang, Xiaojing
Tao, Yangyu
Hu, Han
Wang, Di
Shan, Ying
author_facet Ge, Yuying
Ge, Yixiao
Li, Chen
Wang, Teng
Pu, Junfu
Li, Yizhuo
Qiu, Lu
Ma, Jin
Duan, Lisheng
Zuo, Xinyu
Luo, Jinwen
Gu, Weibo
Li, Zexuan
Zhang, Xiaojing
Tao, Yangyu
Hu, Han
Wang, Di
Shan, Ying
contents Real-world user-generated short videos, especially those distributed on platforms such as WeChat Channel and TikTok, dominate the mobile internet. However, current large multimodal models lack essential temporally-structured, detailed, and in-depth video comprehension capabilities, which are the cornerstone of effective video search and recommendation, as well as emerging video applications. Understanding real-world shorts is actually challenging due to their complex visual elements, high information density in both visuals and audio, and fast pacing that focuses on emotional expression and viewpoint delivery. This requires advanced reasoning to effectively integrate multimodal information, including visual, audio, and text. In this work, we introduce ARC-Hunyuan-Video, a multimodal model that processes visual, audio, and textual signals from raw video inputs end-to-end for structured comprehension. The model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning. Leveraging high-quality data from an automated annotation pipeline, our compact 7B-parameter model is trained through a comprehensive regimen: pre-training, instruction fine-tuning, cold start, reinforcement learning (RL) post-training, and final instruction fine-tuning. Quantitative evaluations on our introduced benchmark ShortVid-Bench and qualitative comparisons demonstrate its strong performance in real-world video comprehension, and it supports zero-shot or fine-tuning with a few samples for diverse downstream applications. The real-world production deployment of our model has yielded tangible and measurable improvements in user engagement and satisfaction, a success supported by its remarkable efficiency, with stress tests indicating an inference time of just 10 seconds for a one-minute video on H20 GPU.
format Preprint
id arxiv_https___arxiv_org_abs_2507_20939
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts
Ge, Yuying
Ge, Yixiao
Li, Chen
Wang, Teng
Pu, Junfu
Li, Yizhuo
Qiu, Lu
Ma, Jin
Duan, Lisheng
Zuo, Xinyu
Luo, Jinwen
Gu, Weibo
Li, Zexuan
Zhang, Xiaojing
Tao, Yangyu
Hu, Han
Wang, Di
Shan, Ying
Computer Vision and Pattern Recognition
Real-world user-generated short videos, especially those distributed on platforms such as WeChat Channel and TikTok, dominate the mobile internet. However, current large multimodal models lack essential temporally-structured, detailed, and in-depth video comprehension capabilities, which are the cornerstone of effective video search and recommendation, as well as emerging video applications. Understanding real-world shorts is actually challenging due to their complex visual elements, high information density in both visuals and audio, and fast pacing that focuses on emotional expression and viewpoint delivery. This requires advanced reasoning to effectively integrate multimodal information, including visual, audio, and text. In this work, we introduce ARC-Hunyuan-Video, a multimodal model that processes visual, audio, and textual signals from raw video inputs end-to-end for structured comprehension. The model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning. Leveraging high-quality data from an automated annotation pipeline, our compact 7B-parameter model is trained through a comprehensive regimen: pre-training, instruction fine-tuning, cold start, reinforcement learning (RL) post-training, and final instruction fine-tuning. Quantitative evaluations on our introduced benchmark ShortVid-Bench and qualitative comparisons demonstrate its strong performance in real-world video comprehension, and it supports zero-shot or fine-tuning with a few samples for diverse downstream applications. The real-world production deployment of our model has yielded tangible and measurable improvements in user engagement and satisfaction, a success supported by its remarkable efficiency, with stress tests indicating an inference time of just 10 seconds for a one-minute video on H20 GPU.
title ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2507.20939