Saved in:
| Main Authors: | Liu, Kai, Li, Jungang, Sun, Yuchong, Wu, Shengqiong, Gao, Jianzhang, Zhang, Daoan, Zhang, Wei, Jin, Sheng, Yu, Sicheng, Zhan, Geng, Ji, Jiayi, Zhou, Fan, Zheng, Liang, Yan, Shuicheng, Fei, Hao, Chua, Tat-Seng |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.22905 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation
by: Liu, Kai, et al.
Published: (2026)
by: Liu, Kai, et al.
Published: (2026)
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
by: Fei, Hao, et al.
Published: (2024)
by: Fei, Hao, et al.
Published: (2024)
JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization
by: Liu, Kai, et al.
Published: (2025)
by: Liu, Kai, et al.
Published: (2025)
Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment
by: Fei, Hao, et al.
Published: (2024)
by: Fei, Hao, et al.
Published: (2024)
Towards Semantic Equivalence of Tokenization in Multimodal LLM
by: Wu, Shengqiong, et al.
Published: (2024)
by: Wu, Shengqiong, et al.
Published: (2024)
NExT-GPT: Any-to-Any Multimodal LLM
by: Wu, Shengqiong, et al.
Published: (2023)
by: Wu, Shengqiong, et al.
Published: (2023)
Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs
by: Fei, Hao, et al.
Published: (2023)
by: Fei, Hao, et al.
Published: (2023)
Universal Scene Graph Generation
by: Wu, Shengqiong, et al.
Published: (2025)
by: Wu, Shengqiong, et al.
Published: (2025)
Modeling Cross-vision Synergy for Unified Large Vision Model
by: Wu, Shengqiong, et al.
Published: (2026)
by: Wu, Shengqiong, et al.
Published: (2026)
Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning
by: Wu, Shengqiong, et al.
Published: (2024)
by: Wu, Shengqiong, et al.
Published: (2024)
Global Commander and Local Operative: A Dual-Agent Framework for Scene Navigation
by: Jin, Kaiming, et al.
Published: (2026)
by: Jin, Kaiming, et al.
Published: (2026)
Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation
by: Wu, Shengqiong, et al.
Published: (2025)
by: Wu, Shengqiong, et al.
Published: (2025)
A Reason-then-Describe Instruction Interpreter for Controllable Video Generation
by: Wu, Shengqiong, et al.
Published: (2025)
by: Wu, Shengqiong, et al.
Published: (2025)
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
by: Wang, Yaoting, et al.
Published: (2025)
by: Wang, Yaoting, et al.
Published: (2025)
Grammar Induction from Visual, Speech and Text
by: Zhao, Yu, et al.
Published: (2024)
by: Zhao, Yu, et al.
Published: (2024)
XNLP: An Interactive Demonstration System for Universal Structured NLP
by: Fei, Hao, et al.
Published: (2023)
by: Fei, Hao, et al.
Published: (2023)
Synergizing Understanding and Generation with Interleaved Analyzing-Drafting Thinking
by: Wu, Shengqiong, et al.
Published: (2026)
by: Wu, Shengqiong, et al.
Published: (2026)
Turing Patterns for Multimedia: Reaction-Diffusion Multi-Modal Fusion for Language-Guided Video Moment Retrieval
by: Fang, Xiang, et al.
Published: (2026)
by: Fang, Xiang, et al.
Published: (2026)
Understanding Long Videos via LLM-Powered Entity Relation Graphs
by: Chu, Meng, et al.
Published: (2025)
by: Chu, Meng, et al.
Published: (2025)
Compose Your Aesthetics: Empowering Text-to-Image Models with the Principles of Art
by: Jin, Zhe, et al.
Published: (2025)
by: Jin, Zhe, et al.
Published: (2025)
On Generative Agents in Recommendation
by: Zhang, An, et al.
Published: (2023)
by: Zhang, An, et al.
Published: (2023)
Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene
by: Wu, Shengqiong, et al.
Published: (2025)
by: Wu, Shengqiong, et al.
Published: (2025)
UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist
by: Liang, Zhengyang, et al.
Published: (2025)
by: Liang, Zhengyang, et al.
Published: (2025)
Extending Visual Dynamics for Video-to-Music Generation
by: Liu, Xiaohao, et al.
Published: (2025)
by: Liu, Xiaohao, et al.
Published: (2025)
Language Representations Can be What Recommenders Need: Findings and Potentials
by: Sheng, Leheng, et al.
Published: (2024)
by: Sheng, Leheng, et al.
Published: (2024)
Towards Goal-oriented Intelligent Tutoring Systems in Online Education
by: Deng, Yang, et al.
Published: (2023)
by: Deng, Yang, et al.
Published: (2023)
Disentangling Masked Autoencoders for Unsupervised Domain Generalization
by: Zhang, An, et al.
Published: (2024)
by: Zhang, An, et al.
Published: (2024)
Can I Trust Your Answer? Visually Grounded Video Question Answering
by: Xiao, Junbin, et al.
Published: (2023)
by: Xiao, Junbin, et al.
Published: (2023)
Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion
by: Zhou, Zhenglin, et al.
Published: (2025)
by: Zhou, Zhenglin, et al.
Published: (2025)
SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models
by: Fang, Junfeng, et al.
Published: (2025)
by: Fang, Junfeng, et al.
Published: (2025)
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes
by: Qi, Ji, et al.
Published: (2025)
by: Qi, Ji, et al.
Published: (2025)
Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics
by: Sheng, Leheng, et al.
Published: (2026)
by: Sheng, Leheng, et al.
Published: (2026)
Lingua-SafetyBench: A Benchmark for Safety Evaluation of Multilingual Vision-Language Models
by: Shi, Enyi, et al.
Published: (2026)
by: Shi, Enyi, et al.
Published: (2026)
3D-TAFS: A Training-free Framework for 3D Affordance Segmentation
by: Chu, Meng, et al.
Published: (2024)
by: Chu, Meng, et al.
Published: (2024)
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
by: Zhang, Tao, et al.
Published: (2024)
by: Zhang, Tao, et al.
Published: (2024)
Learning to Ask Critical Questions for Assisting Product Search
by: Li, Zixuan, et al.
Published: (2024)
by: Li, Zixuan, et al.
Published: (2024)
UniVST: A Unified Framework for Training-free Localized Video Style Transfer
by: Song, Quanjian, et al.
Published: (2024)
by: Song, Quanjian, et al.
Published: (2024)
Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
by: Fei, Hao, et al.
Published: (2024)
by: Fei, Hao, et al.
Published: (2024)
Auto-Encoding Morph-Tokens for Multimodal LLM
by: Pan, Kaihang, et al.
Published: (2024)
by: Pan, Kaihang, et al.
Published: (2024)
Training-Free Multimodal Large Language Model Orchestration
by: Xie, Tianyu, et al.
Published: (2025)
by: Xie, Tianyu, et al.
Published: (2025)
Similar Items
-
JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation
by: Liu, Kai, et al.
Published: (2026) -
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
by: Fei, Hao, et al.
Published: (2024) -
JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization
by: Liu, Kai, et al.
Published: (2025) -
Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment
by: Fei, Hao, et al.
Published: (2024) -
Towards Semantic Equivalence of Tokenization in Multimodal LLM
by: Wu, Shengqiong, et al.
Published: (2024)