Saved in:
| Main Authors: | Wang, Zi-An, Zou, Shihao, Yu, Shiyao, Zhang, Mingyuan, Dong, Chao |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.23465 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
by: Cheng, Shihao, et al.
Published: (2026)
by: Cheng, Shihao, et al.
Published: (2026)
InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions
by: Wang, Zhenzhi, et al.
Published: (2025)
by: Wang, Zhenzhi, et al.
Published: (2025)
Multi-Modal Motion Retrieval by Learning a Fine-Grained Joint Embedding Space
by: Yu, Shiyao, et al.
Published: (2025)
by: Yu, Shiyao, et al.
Published: (2025)
Semantic Audio-Visual Navigation in Continuous Environments
by: Zeng, Yichen, et al.
Published: (2026)
by: Zeng, Yichen, et al.
Published: (2026)
AudioStory: Generating Long-Form Narrative Audio with Large Language Models
by: Guo, Yuxin, et al.
Published: (2025)
by: Guo, Yuxin, et al.
Published: (2025)
AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control
by: Guo, Xinyue, et al.
Published: (2025)
by: Guo, Xinyue, et al.
Published: (2025)
WavFlow: Audio Generation in Waveform Space
by: Zhou, Feiyan, et al.
Published: (2026)
by: Zhou, Feiyan, et al.
Published: (2026)
WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM
by: Tang, Changli, et al.
Published: (2025)
by: Tang, Changli, et al.
Published: (2025)
Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics
by: Liu, Chen, et al.
Published: (2025)
by: Liu, Chen, et al.
Published: (2025)
Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning
by: Wang, Linge, et al.
Published: (2026)
by: Wang, Linge, et al.
Published: (2026)
VABench: A Comprehensive Benchmark for Audio-Video Generation
by: Hua, Daili, et al.
Published: (2025)
by: Hua, Daili, et al.
Published: (2025)
MOVA: Towards Scalable and Synchronized Video-Audio Generation
by: OpenMOSS Team, et al.
Published: (2026)
by: OpenMOSS Team, et al.
Published: (2026)
Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment
by: Liu, Chen, et al.
Published: (2025)
by: Liu, Chen, et al.
Published: (2025)
MMAudio-LABEL: Audio Event Labeling via Audio Generation for Silent Video
by: Tateishi, Kazuya, et al.
Published: (2026)
by: Tateishi, Kazuya, et al.
Published: (2026)
Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition
by: Wang, Juncheng, et al.
Published: (2025)
by: Wang, Juncheng, et al.
Published: (2025)
PAVAS: Physics-Aware Video-to-Audio Synthesis
by: Hyun-Bin, Oh, et al.
Published: (2025)
by: Hyun-Bin, Oh, et al.
Published: (2025)
Leveraging Audio Representations for Vibration-Based Crowd Monitoring in Stadiums
by: Chang, Yen Cheng, et al.
Published: (2025)
by: Chang, Yen Cheng, et al.
Published: (2025)
DDAVS: Disentangled Audio Semantics and Delayed Bidirectional Alignment for Audio-Visual Segmentation
by: Tian, Jingqi, et al.
Published: (2025)
by: Tian, Jingqi, et al.
Published: (2025)
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation
by: Zhou, Yupeng, et al.
Published: (2026)
by: Zhou, Yupeng, et al.
Published: (2026)
Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis
by: Yang, Qi, et al.
Published: (2024)
by: Yang, Qi, et al.
Published: (2024)
Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning
by: Zeng, Donghuo, et al.
Published: (2026)
by: Zeng, Donghuo, et al.
Published: (2026)
Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence
by: Liao, Junchao, et al.
Published: (2026)
by: Liao, Junchao, et al.
Published: (2026)
Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention
by: Li, Kai, et al.
Published: (2025)
by: Li, Kai, et al.
Published: (2025)
Sound Sparks Motion: Audio and Text Tuning for Video Editing
by: Razlighi, AmirHossein Naghi, et al.
Published: (2026)
by: Razlighi, AmirHossein Naghi, et al.
Published: (2026)
OmniAudio: Generating Spatial Audio from 360-Degree Video
by: Liu, Huadai, et al.
Published: (2025)
by: Liu, Huadai, et al.
Published: (2025)
Video-to-Audio Generation with Hidden Alignment
by: Xu, Manjie, et al.
Published: (2024)
by: Xu, Manjie, et al.
Published: (2024)
MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization
by: Yang, Jianxuan, et al.
Published: (2025)
by: Yang, Jianxuan, et al.
Published: (2025)
Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation
by: Chen, Yuheng, et al.
Published: (2026)
by: Chen, Yuheng, et al.
Published: (2026)
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation
by: Li, Chunyu, et al.
Published: (2026)
by: Li, Chunyu, et al.
Published: (2026)
JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation
by: Liu, Kai, et al.
Published: (2026)
by: Liu, Kai, et al.
Published: (2026)
OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text
by: Pian, Weiguo, et al.
Published: (2026)
by: Pian, Weiguo, et al.
Published: (2026)
AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models
by: Yang, Jialiang, et al.
Published: (2026)
by: Yang, Jialiang, et al.
Published: (2026)
Rethinking Audio-Visual Adversarial Vulnerability from Temporal and Modality Perspectives
by: Zhang, Zeliang, et al.
Published: (2025)
by: Zhang, Zeliang, et al.
Published: (2025)
Speech Audio Generation from dynamic MRI via a Knowledge Enhanced Conditional Variational Autoencoder
by: Li, Yaxuan, et al.
Published: (2025)
by: Li, Yaxuan, et al.
Published: (2025)
LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV
by: Liu, Tengfei, et al.
Published: (2026)
by: Liu, Tengfei, et al.
Published: (2026)
MIRRORTALK: Forging Personalized Avatars Via Disentangled Style and Hierarchical Motion Control
by: Lu, Renjie, et al.
Published: (2026)
by: Lu, Renjie, et al.
Published: (2026)
TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation
by: Liu, Xiangyu, et al.
Published: (2026)
by: Liu, Xiangyu, et al.
Published: (2026)
video-SALMONN 2: Caption-Enhanced Audio-Visual Large Language Models
by: Tang, Changli, et al.
Published: (2025)
by: Tang, Changli, et al.
Published: (2025)
Gotta Hear Them All: Towards Sound Source Aware Audio Generation
by: Guo, Wei, et al.
Published: (2024)
by: Guo, Wei, et al.
Published: (2024)
ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling
by: Yang, Jianxuan, et al.
Published: (2026)
by: Yang, Jianxuan, et al.
Published: (2026)
Similar Items
-
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
by: Cheng, Shihao, et al.
Published: (2026) -
InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions
by: Wang, Zhenzhi, et al.
Published: (2025) -
Multi-Modal Motion Retrieval by Learning a Fine-Grained Joint Embedding Space
by: Yu, Shiyao, et al.
Published: (2025) -
Semantic Audio-Visual Navigation in Continuous Environments
by: Zeng, Yichen, et al.
Published: (2026) -
AudioStory: Generating Long-Form Narrative Audio with Large Language Models
by: Guo, Yuxin, et al.
Published: (2025)