:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Ge, Yuying, Ge, Yixiao, Li, Chen, Wang, Teng, Pu, Junfu, Li, Yizhuo, Qiu, Lu, Ma, Jin, Duan, Lisheng, Zuo, Xinyu, Luo, Jinwen, Gu, Weibo, Li, Zexuan, Zhang, Xiaojing, Tao, Yangyu, Hu, Han, Wang, Di, Shan, Ying
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2507.20939
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries
by: Pu, Junfu, et al.
Published: (2025)

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
by: Ge, Yuying, et al.
Published: (2024)

AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation
by: Qiu, Lu, et al.
Published: (2025)

DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models
by: Li, Yizhuo, et al.
Published: (2024)

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
by: Cheng, Junhao, et al.
Published: (2025)

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
by: Zhang, Jun, et al.
Published: (2025)

Aligning Latent Spaces with Flow Priors
by: Li, Yizhuo, et al.
Published: (2025)

Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos
by: Chen, Yi, et al.
Published: (2024)

Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1
by: Chen, Yi, et al.
Published: (2025)

OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video
by: Pu, Junfu, et al.
Published: (2026)

BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
by: Liu, Ruyang, et al.
Published: (2023)

SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension
by: Li, Bohao, et al.
Published: (2024)

GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers
by: Ma, Shijie, et al.
Published: (2025)

SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing
by: Ge, Yuying, et al.
Published: (2024)

TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
by: Lin, Haokun, et al.
Published: (2025)

AudioStory: Generating Long-Form Narrative Audio with Large Language Models
by: Guo, Yuxin, et al.
Published: (2025)

HunyuanVideo: A Systematic Framework For Large Video Generative Models
by: Kong, Weijie, et al.
Published: (2024)

EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios
by: Qiu, Lu, et al.
Published: (2024)

AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction
by: Cheng, Junhao, et al.
Published: (2025)

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
by: Ge, Yuying, et al.
Published: (2024)

SEED-Story: Multimodal Long Story Generation with Large Language Model
by: Yang, Shuai, et al.
Published: (2024)

HunyuanVideo 1.5 Technical Report
by: Wu, Bing, et al.
Published: (2025)

VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models
by: Wu, Tao, et al.
Published: (2024)

Supervised Fine-tuning in turn Improves Visual Foundation Models
by: Jiang, Xiaohu, et al.
Published: (2024)

GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning
by: Chen, Yi, et al.
Published: (2025)

HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation
by: Shan, Sizhe, et al.
Published: (2025)

UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
by: Chen, Boyu, et al.
Published: (2026)

EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning
by: Chen, Yi, et al.
Published: (2023)

HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation
by: Hu, Teng, et al.
Published: (2025)

ST-LLM: Large Language Models Are Effective Temporal Learners
by: Liu, Ruyang, et al.
Published: (2024)

Empirical models for calculating soil wetting patterns under surface drip irrigation systems: A comprehensive analysis
by: Ge Li, et al.
Published: (2024)

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
by: Li, Kunchang, et al.
Published: (2023)

DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA
by: Chen, Yi, et al.
Published: (2026)

HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation
by: Xiao, Yicheng, et al.
Published: (2025)

VideoChat: Chat-Centric Video Understanding
by: Li, KunChang, et al.
Published: (2023)

Hunyuan-MT Technical Report
by: Zheng, Mao, et al.
Published: (2025)

HunyuanOCR Technical Report
by: Hunyuan Vision Team, et al.
Published: (2025)

FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction
by: Cai, Yuxuan, et al.
Published: (2025)

R2E-VID: Two-Stage Robust Routing via Temporal Gating for Elastic Edge-Cloud Video Inference
by: Yang, Zheming, et al.
Published: (2026)

Optimized Live 4K Video Multicast
by: He, Zhaoyuan, et al.
Published: (2023)