Saved in:
| Main Authors: | Zhou, Feiyan, Wang, Luyuan, Chen, Shoufa, Wang, Zhe, Liu, Zhiheng, Cong, Yuren, Zhang, Xiaohui, Yang, Fanny, Zeng, Belinda |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.18749 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Wav2Sem: Plug-and-Play Audio Semantic Decoupling for 3D Speech-Driven Facial Animation
by: Li, Hao, et al.
Published: (2025)
by: Li, Hao, et al.
Published: (2025)
Semantic Audio-Visual Navigation in Continuous Environments
by: Zeng, Yichen, et al.
Published: (2026)
by: Zeng, Yichen, et al.
Published: (2026)
VABench: A Comprehensive Benchmark for Audio-Video Generation
by: Hua, Daili, et al.
Published: (2025)
by: Hua, Daili, et al.
Published: (2025)
CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization
by: Bai, Detao, et al.
Published: (2025)
by: Bai, Detao, et al.
Published: (2025)
MOVA: Towards Scalable and Synchronized Video-Audio Generation
by: OpenMOSS Team, et al.
Published: (2026)
by: OpenMOSS Team, et al.
Published: (2026)
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching
by: Wang, Yongqi, et al.
Published: (2024)
by: Wang, Yongqi, et al.
Published: (2024)
Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment
by: Liu, Chen, et al.
Published: (2025)
by: Liu, Chen, et al.
Published: (2025)
Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition
by: Wang, Juncheng, et al.
Published: (2025)
by: Wang, Juncheng, et al.
Published: (2025)
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation
by: Zhou, Yupeng, et al.
Published: (2026)
by: Zhou, Yupeng, et al.
Published: (2026)
Apollo: Unified Multi-Task Audio-Video Joint Generation
by: Wang, Jun, et al.
Published: (2026)
by: Wang, Jun, et al.
Published: (2026)
Semantics-Aware Human Motion Generation from Audio Instructions
by: Wang, Zi-An, et al.
Published: (2025)
by: Wang, Zi-An, et al.
Published: (2025)
MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization
by: Yang, Jianxuan, et al.
Published: (2025)
by: Yang, Jianxuan, et al.
Published: (2025)
FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts
by: Li, You, et al.
Published: (2026)
by: Li, You, et al.
Published: (2026)
AudioStory: Generating Long-Form Narrative Audio with Large Language Models
by: Guo, Yuxin, et al.
Published: (2025)
by: Guo, Yuxin, et al.
Published: (2025)
Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning
by: Wang, Linge, et al.
Published: (2026)
by: Wang, Linge, et al.
Published: (2026)
MMAudio-LABEL: Audio Event Labeling via Audio Generation for Silent Video
by: Tateishi, Kazuya, et al.
Published: (2026)
by: Tateishi, Kazuya, et al.
Published: (2026)
AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control
by: Guo, Xinyue, et al.
Published: (2025)
by: Guo, Xinyue, et al.
Published: (2025)
UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation
by: Li, Hebeizi, et al.
Published: (2026)
by: Li, Hebeizi, et al.
Published: (2026)
OmniAudio: Generating Spatial Audio from 360-Degree Video
by: Liu, Huadai, et al.
Published: (2025)
by: Liu, Huadai, et al.
Published: (2025)
TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation
by: Wang, Zhenzhi, et al.
Published: (2025)
by: Wang, Zhenzhi, et al.
Published: (2025)
LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV
by: Liu, Tengfei, et al.
Published: (2026)
by: Liu, Tengfei, et al.
Published: (2026)
Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation
by: Chen, Yuheng, et al.
Published: (2026)
by: Chen, Yuheng, et al.
Published: (2026)
HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation
by: Shan, Sizhe, et al.
Published: (2025)
by: Shan, Sizhe, et al.
Published: (2025)
Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence
by: Liao, Junchao, et al.
Published: (2026)
by: Liao, Junchao, et al.
Published: (2026)
SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale Spectral Diffusion Model
by: Li, Yan, et al.
Published: (2024)
by: Li, Yan, et al.
Published: (2024)
FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation
by: Tan, Weiting, et al.
Published: (2026)
by: Tan, Weiting, et al.
Published: (2026)
ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling
by: Yang, Jianxuan, et al.
Published: (2026)
by: Yang, Jianxuan, et al.
Published: (2026)
OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text
by: Pian, Weiguo, et al.
Published: (2026)
by: Pian, Weiguo, et al.
Published: (2026)
JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation
by: Liu, Kai, et al.
Published: (2026)
by: Liu, Kai, et al.
Published: (2026)
Hierarchical Codec Diffusion for Video-to-Speech Generation
by: Ye, Jiaxin, et al.
Published: (2026)
by: Ye, Jiaxin, et al.
Published: (2026)
End-to-end Audio Deepfake Detection from RAW Waveforms: a RawNet-Based Approach with Cross-Dataset Evaluation
by: Di Pierno, Andrea, et al.
Published: (2025)
by: Di Pierno, Andrea, et al.
Published: (2025)
JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching
by: Kwon, Mingi, et al.
Published: (2025)
by: Kwon, Mingi, et al.
Published: (2025)
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing
by: Liu, Huadai, et al.
Published: (2025)
by: Liu, Huadai, et al.
Published: (2025)
WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM
by: Tang, Changli, et al.
Published: (2025)
by: Tang, Changli, et al.
Published: (2025)
VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection
by: Cheng, Hao, et al.
Published: (2025)
by: Cheng, Hao, et al.
Published: (2025)
Audio-Guided Visual Perception for Audio-Visual Navigation
by: Wang, Yi, et al.
Published: (2025)
by: Wang, Yi, et al.
Published: (2025)
Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics
by: Liu, Chen, et al.
Published: (2025)
by: Liu, Chen, et al.
Published: (2025)
AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models
by: Yang, Jialiang, et al.
Published: (2026)
by: Yang, Jialiang, et al.
Published: (2026)
READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation
by: Wang, Haotian, et al.
Published: (2025)
by: Wang, Haotian, et al.
Published: (2025)
Dual Audio-Centric Modality Coupling for Talking Head Generation
by: Fu, Ao, et al.
Published: (2025)
by: Fu, Ao, et al.
Published: (2025)
Similar Items
-
Wav2Sem: Plug-and-Play Audio Semantic Decoupling for 3D Speech-Driven Facial Animation
by: Li, Hao, et al.
Published: (2025) -
Semantic Audio-Visual Navigation in Continuous Environments
by: Zeng, Yichen, et al.
Published: (2026) -
VABench: A Comprehensive Benchmark for Audio-Video Generation
by: Hua, Daili, et al.
Published: (2025) -
CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization
by: Bai, Detao, et al.
Published: (2025) -
MOVA: Towards Scalable and Synchronized Video-Audio Generation
by: OpenMOSS Team, et al.
Published: (2026)