Saved in:
| Main Authors: | Liu, Dingming, Li, Shaowei, Zhou, Ruoyan, Liang, Lili, Hong, Yongguan, Chao, Fei, Ji, Rongrong |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.12903 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Routing Experts: Learning to Route Dynamic Experts in Multi-modal Large Language Models
by: Wu, Qiong, et al.
Published: (2024)
by: Wu, Qiong, et al.
Published: (2024)
Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model
by: Yang, Danni, et al.
Published: (2024)
by: Yang, Danni, et al.
Published: (2024)
Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models
by: Wu, Qiong, et al.
Published: (2024)
by: Wu, Qiong, et al.
Published: (2024)
Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning
by: Chen, Weifeng, et al.
Published: (2023)
by: Chen, Weifeng, et al.
Published: (2023)
MMAPS: End-to-End Multi-Grained Multi-Modal Attribute-Aware Product Summarization
by: Chen, Tao, et al.
Published: (2023)
by: Chen, Tao, et al.
Published: (2023)
SurgSora: Object-Aware Diffusion Model for Controllable Surgical Video Generation
by: Chen, Tong, et al.
Published: (2024)
by: Chen, Tong, et al.
Published: (2024)
Semantic Compensation via Adversarial Removal for Robust Zero-Shot ECG Diagnosis
by: Liu, Hongjun, et al.
Published: (2026)
by: Liu, Hongjun, et al.
Published: (2026)
Grounded Chain-of-Thought for Multimodal Large Language Models
by: Wu, Qiong, et al.
Published: (2025)
by: Wu, Qiong, et al.
Published: (2025)
3DMambaIPF: A State Space Model for Iterative Point Cloud Filtering via Differentiable Rendering
by: Zhou, Qingyuan, et al.
Published: (2024)
by: Zhou, Qingyuan, et al.
Published: (2024)
DreamHead: Learning Spatial-Temporal Correspondence via Hierarchical Diffusion for Audio-driven Talking Head Synthesis
by: Hong, Fa-Ting, et al.
Published: (2024)
by: Hong, Fa-Ting, et al.
Published: (2024)
Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM
by: Ji, Yatai, et al.
Published: (2024)
by: Ji, Yatai, et al.
Published: (2024)
3MDiT: Unified Tri-Modal Diffusion Transformer for Text-Driven Synchronized Audio-Video Generation
by: Li, Yaoru, et al.
Published: (2025)
by: Li, Yaoru, et al.
Published: (2025)
AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers
by: Guan, Jiazhi, et al.
Published: (2025)
by: Guan, Jiazhi, et al.
Published: (2025)
Parallel Vision Token Scheduling for Fast and Accurate Multimodal LMMs Inference
by: Zhan, Wengyi, et al.
Published: (2025)
by: Zhan, Wengyi, et al.
Published: (2025)
RenCon 2025: Revival of the Expressive Performance Rendering Competition
by: Zhang, Huan, et al.
Published: (2026)
by: Zhang, Huan, et al.
Published: (2026)
Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings
by: Wu, Qiong, et al.
Published: (2024)
by: Wu, Qiong, et al.
Published: (2024)
Generalized Video Anomaly Event Detection: Systematic Taxonomy and Comparison of Deep Models
by: Liu, Yang, et al.
Published: (2023)
by: Liu, Yang, et al.
Published: (2023)
AutoAWG: Adverse Weather Generation with Adaptive Multi-Controls for Automotive Videos
by: Hu, Jiagao, et al.
Published: (2026)
by: Hu, Jiagao, et al.
Published: (2026)
ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling
by: Yang, Jianxuan, et al.
Published: (2026)
by: Yang, Jianxuan, et al.
Published: (2026)
Towards Universal Modal Tracking with Online Dense Temporal Token Learning
by: Zheng, Yaozong, et al.
Published: (2025)
by: Zheng, Yaozong, et al.
Published: (2025)
Knowledge-aware Diffusion-Enhanced Multimedia Recommendation
by: Mo, Xian, et al.
Published: (2025)
by: Mo, Xian, et al.
Published: (2025)
TeMTG: Text-Enhanced Multi-Hop Temporal Graph Modeling for Audio-Visual Video Parsing
by: Chen, Yaru, et al.
Published: (2025)
by: Chen, Yaru, et al.
Published: (2025)
Error Analyses of Auto-Regressive Video Diffusion Models: A Unified Framework
by: Wang, Jing, et al.
Published: (2025)
by: Wang, Jing, et al.
Published: (2025)
Archiving Body Movements: Collective Generation of Chinese Calligraphy
by: Zhou, Aven Le, et al.
Published: (2023)
by: Zhou, Aven Le, et al.
Published: (2023)
Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline
by: Yang, Dingyi, et al.
Published: (2024)
by: Yang, Dingyi, et al.
Published: (2024)
GestureHYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation
by: Yang, Quanwei, et al.
Published: (2025)
by: Yang, Quanwei, et al.
Published: (2025)
Virbo: Multimodal Multilingual Avatar Video Generation in Digital Marketing
by: Zhang, Juan, et al.
Published: (2024)
by: Zhang, Juan, et al.
Published: (2024)
Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model
by: He, Xu, et al.
Published: (2024)
by: He, Xu, et al.
Published: (2024)
Lumos-1: On Autoregressive Video Generation with Discrete Diffusion from a Unified Model Perspective
by: Yuan, Hangjie, et al.
Published: (2025)
by: Yuan, Hangjie, et al.
Published: (2025)
Towards Multimodal Empathetic Response Generation: A Rich Text-Speech-Vision Avatar-based Benchmark
by: Zhang, Han, et al.
Published: (2025)
by: Zhang, Han, et al.
Published: (2025)
Investigating Conceptual Blending of a Diffusion Model for Improving Nonword-to-Image Generation
by: Matsuhira, Chihaya, et al.
Published: (2024)
by: Matsuhira, Chihaya, et al.
Published: (2024)
MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model
by: Wang, Sen, et al.
Published: (2024)
by: Wang, Sen, et al.
Published: (2024)
Diffusion Models for Joint Audio-Video Generation
by: La Torre, Alejandro Paredes
Published: (2026)
by: La Torre, Alejandro Paredes
Published: (2026)
Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades
by: Taghipour, Ashkan, et al.
Published: (2026)
by: Taghipour, Ashkan, et al.
Published: (2026)
FlexCache: Flexible Approximate Cache System for Video Diffusion
by: Sun, Desen, et al.
Published: (2024)
by: Sun, Desen, et al.
Published: (2024)
Feedback-Driven Rate Control for Learned Video Compression
by: Xu, Zhiheng, et al.
Published: (2026)
by: Xu, Zhiheng, et al.
Published: (2026)
Divide and Conquer: Multimodal Video Deepfake Detection via Cross-Modal Fusion and Localization
by: Li, Qingcao, et al.
Published: (2026)
by: Li, Qingcao, et al.
Published: (2026)
JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation
by: Liu, Kai, et al.
Published: (2026)
by: Liu, Kai, et al.
Published: (2026)
Towards User-level QoE: Large-scale Practice in Personalized Optimization of Adaptive Video Streaming
by: Jia, Lianchen, et al.
Published: (2025)
by: Jia, Lianchen, et al.
Published: (2025)
DreamFoley: Scalable VLMs for High-Fidelity Video-to-Audio Generation
by: Li, Fu, et al.
Published: (2025)
by: Li, Fu, et al.
Published: (2025)
Similar Items
-
Routing Experts: Learning to Route Dynamic Experts in Multi-modal Large Language Models
by: Wu, Qiong, et al.
Published: (2024) -
Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model
by: Yang, Danni, et al.
Published: (2024) -
Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models
by: Wu, Qiong, et al.
Published: (2024) -
Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning
by: Chen, Weifeng, et al.
Published: (2023) -
MMAPS: End-to-End Multi-Grained Multi-Modal Attribute-Aware Product Summarization
by: Chen, Tao, et al.
Published: (2023)