Saved in:
| Main Authors: | Li, Yan, Zhou, Ziya, Wang, Zhiqiang, Xue, Wei, Luo, Wenhan, Guo, Yike |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2412.03430 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing
by: Tian, Zeyue, et al.
Published: (2026)
by: Tian, Zeyue, et al.
Published: (2026)
OmniAudio: Generating Spatial Audio from 360-Degree Video
by: Liu, Huadai, et al.
Published: (2025)
by: Liu, Huadai, et al.
Published: (2025)
AudioX: A Unified Framework for Anything-to-Audio Generation
by: Tian, Zeyue, et al.
Published: (2025)
by: Tian, Zeyue, et al.
Published: (2025)
Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation
by: Sun, Peiwen, et al.
Published: (2024)
by: Sun, Peiwen, et al.
Published: (2024)
Diffusion Models for Joint Audio-Video Generation
by: La Torre, Alejandro Paredes
Published: (2026)
by: La Torre, Alejandro Paredes
Published: (2026)
PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation
by: Liu, Huadai, et al.
Published: (2025)
by: Liu, Huadai, et al.
Published: (2025)
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
by: Tian, Zeyue, et al.
Published: (2024)
by: Tian, Zeyue, et al.
Published: (2024)
MOVA: Towards Scalable and Synchronized Video-Audio Generation
by: OpenMOSS Team, et al.
Published: (2026)
by: OpenMOSS Team, et al.
Published: (2026)
MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization
by: Yang, Jianxuan, et al.
Published: (2025)
by: Yang, Jianxuan, et al.
Published: (2025)
Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling
by: Ye, Zhen, et al.
Published: (2026)
by: Ye, Zhen, et al.
Published: (2026)
Apollo: Unified Multi-Task Audio-Video Joint Generation
by: Wang, Jun, et al.
Published: (2026)
by: Wang, Jun, et al.
Published: (2026)
JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation
by: Liu, Kai, et al.
Published: (2026)
by: Liu, Kai, et al.
Published: (2026)
Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis
by: Yang, Qi, et al.
Published: (2024)
by: Yang, Qi, et al.
Published: (2024)
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing
by: Liu, Huadai, et al.
Published: (2025)
by: Liu, Huadai, et al.
Published: (2025)
AudioStory: Generating Long-Form Narrative Audio with Large Language Models
by: Guo, Yuxin, et al.
Published: (2025)
by: Guo, Yuxin, et al.
Published: (2025)
MMAudio-LABEL: Audio Event Labeling via Audio Generation for Silent Video
by: Tateishi, Kazuya, et al.
Published: (2026)
by: Tateishi, Kazuya, et al.
Published: (2026)
RAP: Real-time Audio-driven Portrait Animation with Video Diffusion Transformer
by: Du, Fangyu, et al.
Published: (2025)
by: Du, Fangyu, et al.
Published: (2025)
Diffusion Models as Masked Audio-Video Learners
by: Nunez, Elvis, et al.
Published: (2023)
by: Nunez, Elvis, et al.
Published: (2023)
VABench: A Comprehensive Benchmark for Audio-Video Generation
by: Hua, Daili, et al.
Published: (2025)
by: Hua, Daili, et al.
Published: (2025)
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation
by: Zhou, Yupeng, et al.
Published: (2026)
by: Zhou, Yupeng, et al.
Published: (2026)
FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts
by: Li, You, et al.
Published: (2026)
by: Li, You, et al.
Published: (2026)
HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation
by: Shan, Sizhe, et al.
Published: (2025)
by: Shan, Sizhe, et al.
Published: (2025)
Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation
by: Chen, Yuheng, et al.
Published: (2026)
by: Chen, Yuheng, et al.
Published: (2026)
OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text
by: Pian, Weiguo, et al.
Published: (2026)
by: Pian, Weiguo, et al.
Published: (2026)
CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling
by: Yang, Ruihan, et al.
Published: (2023)
by: Yang, Ruihan, et al.
Published: (2023)
Video-to-Audio Generation with Hidden Alignment
by: Xu, Manjie, et al.
Published: (2024)
by: Xu, Manjie, et al.
Published: (2024)
DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation
by: Zhang, Haomin, et al.
Published: (2025)
by: Zhang, Haomin, et al.
Published: (2025)
MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions
by: Chi, Xiaowei, et al.
Published: (2024)
by: Chi, Xiaowei, et al.
Published: (2024)
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching
by: Wang, Yongqi, et al.
Published: (2024)
by: Wang, Yongqi, et al.
Published: (2024)
READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation
by: Wang, Haotian, et al.
Published: (2025)
by: Wang, Haotian, et al.
Published: (2025)
ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling
by: Yang, Jianxuan, et al.
Published: (2026)
by: Yang, Jianxuan, et al.
Published: (2026)
UniForm: A Unified Multi-Task Diffusion Transformer for Audio-Video Generation
by: Zhao, Lei, et al.
Published: (2025)
by: Zhao, Lei, et al.
Published: (2025)
JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization
by: Liu, Kai, et al.
Published: (2025)
by: Liu, Kai, et al.
Published: (2025)
Do Joint Audio-Video Generation Models Understand Physics?
by: Cui, Zijun, et al.
Published: (2026)
by: Cui, Zijun, et al.
Published: (2026)
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
by: Cheng, Shihao, et al.
Published: (2026)
by: Cheng, Shihao, et al.
Published: (2026)
AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation
by: Wang, Le, et al.
Published: (2025)
by: Wang, Le, et al.
Published: (2025)
Hierarchical Codec Diffusion for Video-to-Speech Generation
by: Ye, Jiaxin, et al.
Published: (2026)
by: Ye, Jiaxin, et al.
Published: (2026)
TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation
by: Wang, Zhenzhi, et al.
Published: (2025)
by: Wang, Zhenzhi, et al.
Published: (2025)
MACS: Multi-source Audio-to-image Generation with Contextual Significance and Semantic Alignment
by: Zhou, Hao, et al.
Published: (2025)
by: Zhou, Hao, et al.
Published: (2025)
Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence
by: Liao, Junchao, et al.
Published: (2026)
by: Liao, Junchao, et al.
Published: (2026)
Similar Items
-
Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing
by: Tian, Zeyue, et al.
Published: (2026) -
OmniAudio: Generating Spatial Audio from 360-Degree Video
by: Liu, Huadai, et al.
Published: (2025) -
AudioX: A Unified Framework for Anything-to-Audio Generation
by: Tian, Zeyue, et al.
Published: (2025) -
Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation
by: Sun, Peiwen, et al.
Published: (2024) -
Diffusion Models for Joint Audio-Video Generation
by: La Torre, Alejandro Paredes
Published: (2026)