Saved in:
| Main Authors: | Ge, Yuying, Ge, Yixiao, Li, Chen, Wang, Teng, Pu, Junfu, Li, Yizhuo, Qiu, Lu, Ma, Jin, Duan, Lisheng, Zuo, Xinyu, Luo, Jinwen, Gu, Weibo, Li, Zexuan, Zhang, Xiaojing, Tao, Yangyu, Hu, Han, Wang, Di, Shan, Ying |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.20939 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries
by: Pu, Junfu, et al.
Published: (2025)
by: Pu, Junfu, et al.
Published: (2025)
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
by: Ge, Yuying, et al.
Published: (2024)
by: Ge, Yuying, et al.
Published: (2024)
AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation
by: Qiu, Lu, et al.
Published: (2025)
by: Qiu, Lu, et al.
Published: (2025)
DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models
by: Li, Yizhuo, et al.
Published: (2024)
by: Li, Yizhuo, et al.
Published: (2024)
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
by: Cheng, Junhao, et al.
Published: (2025)
by: Cheng, Junhao, et al.
Published: (2025)
TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
by: Zhang, Jun, et al.
Published: (2025)
by: Zhang, Jun, et al.
Published: (2025)
Aligning Latent Spaces with Flow Priors
by: Li, Yizhuo, et al.
Published: (2025)
by: Li, Yizhuo, et al.
Published: (2025)
Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos
by: Chen, Yi, et al.
Published: (2024)
by: Chen, Yi, et al.
Published: (2024)
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1
by: Chen, Yi, et al.
Published: (2025)
by: Chen, Yi, et al.
Published: (2025)
OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video
by: Pu, Junfu, et al.
Published: (2026)
by: Pu, Junfu, et al.
Published: (2026)
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
by: Liu, Ruyang, et al.
Published: (2023)
by: Liu, Ruyang, et al.
Published: (2023)
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension
by: Li, Bohao, et al.
Published: (2024)
by: Li, Bohao, et al.
Published: (2024)
GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers
by: Ma, Shijie, et al.
Published: (2025)
by: Ma, Shijie, et al.
Published: (2025)
SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing
by: Ge, Yuying, et al.
Published: (2024)
by: Ge, Yuying, et al.
Published: (2024)
TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
by: Lin, Haokun, et al.
Published: (2025)
by: Lin, Haokun, et al.
Published: (2025)
AudioStory: Generating Long-Form Narrative Audio with Large Language Models
by: Guo, Yuxin, et al.
Published: (2025)
by: Guo, Yuxin, et al.
Published: (2025)
HunyuanVideo: A Systematic Framework For Large Video Generative Models
by: Kong, Weijie, et al.
Published: (2024)
by: Kong, Weijie, et al.
Published: (2024)
EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios
by: Qiu, Lu, et al.
Published: (2024)
by: Qiu, Lu, et al.
Published: (2024)
AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction
by: Cheng, Junhao, et al.
Published: (2025)
by: Cheng, Junhao, et al.
Published: (2025)
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
by: Ge, Yuying, et al.
Published: (2024)
by: Ge, Yuying, et al.
Published: (2024)
SEED-Story: Multimodal Long Story Generation with Large Language Model
by: Yang, Shuai, et al.
Published: (2024)
by: Yang, Shuai, et al.
Published: (2024)
HunyuanVideo 1.5 Technical Report
by: Wu, Bing, et al.
Published: (2025)
by: Wu, Bing, et al.
Published: (2025)
VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models
by: Wu, Tao, et al.
Published: (2024)
by: Wu, Tao, et al.
Published: (2024)
Supervised Fine-tuning in turn Improves Visual Foundation Models
by: Jiang, Xiaohu, et al.
Published: (2024)
by: Jiang, Xiaohu, et al.
Published: (2024)
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning
by: Chen, Yi, et al.
Published: (2025)
by: Chen, Yi, et al.
Published: (2025)
HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation
by: Shan, Sizhe, et al.
Published: (2025)
by: Shan, Sizhe, et al.
Published: (2025)
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
by: Chen, Boyu, et al.
Published: (2026)
by: Chen, Boyu, et al.
Published: (2026)
EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning
by: Chen, Yi, et al.
Published: (2023)
by: Chen, Yi, et al.
Published: (2023)
HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation
by: Hu, Teng, et al.
Published: (2025)
by: Hu, Teng, et al.
Published: (2025)
ST-LLM: Large Language Models Are Effective Temporal Learners
by: Liu, Ruyang, et al.
Published: (2024)
by: Liu, Ruyang, et al.
Published: (2024)
Empirical models for calculating soil wetting patterns under surface drip irrigation systems: A comprehensive analysis
by: Ge Li, et al.
Published: (2024)
by: Ge Li, et al.
Published: (2024)
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
by: Li, Kunchang, et al.
Published: (2023)
by: Li, Kunchang, et al.
Published: (2023)
DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA
by: Chen, Yi, et al.
Published: (2026)
by: Chen, Yi, et al.
Published: (2026)
HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation
by: Xiao, Yicheng, et al.
Published: (2025)
by: Xiao, Yicheng, et al.
Published: (2025)
VideoChat: Chat-Centric Video Understanding
by: Li, KunChang, et al.
Published: (2023)
by: Li, KunChang, et al.
Published: (2023)
Hunyuan-MT Technical Report
by: Zheng, Mao, et al.
Published: (2025)
by: Zheng, Mao, et al.
Published: (2025)
HunyuanOCR Technical Report
by: Hunyuan Vision Team, et al.
Published: (2025)
by: Hunyuan Vision Team, et al.
Published: (2025)
FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction
by: Cai, Yuxuan, et al.
Published: (2025)
by: Cai, Yuxuan, et al.
Published: (2025)
R2E-VID: Two-Stage Robust Routing via Temporal Gating for Elastic Edge-Cloud Video Inference
by: Yang, Zheming, et al.
Published: (2026)
by: Yang, Zheming, et al.
Published: (2026)
Optimized Live 4K Video Multicast
by: He, Zhaoyuan, et al.
Published: (2023)
by: He, Zhaoyuan, et al.
Published: (2023)
Similar Items
-
ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries
by: Pu, Junfu, et al.
Published: (2025) -
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
by: Ge, Yuying, et al.
Published: (2024) -
AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation
by: Qiu, Lu, et al.
Published: (2025) -
DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models
by: Li, Yizhuo, et al.
Published: (2024) -
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
by: Cheng, Junhao, et al.
Published: (2025)