Saved in:
| Main Authors: | Tian, Zeyue, Yang, Binxin, Liu, Zhaoyang, Zhang, Jiexuan, Yuan, Ruibin, Yin, Hubery, Chen, Qifeng, Li, Chen, Lyu, Jing, Xue, Wei, Guo, Yike |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.10708 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
AudioX: A Unified Framework for Anything-to-Audio Generation
by: Tian, Zeyue, et al.
Published: (2025)
by: Tian, Zeyue, et al.
Published: (2025)
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
by: Tian, Zeyue, et al.
Published: (2024)
by: Tian, Zeyue, et al.
Published: (2024)
LLMs Meet Multimodal Generation and Editing: A Survey
by: He, Yingqing, et al.
Published: (2024)
by: He, Yingqing, et al.
Published: (2024)
VAInpaint: Zero-Shot Video-Audio inpainting framework with LLMs-driven Module
by: Wu, Kam Man, et al.
Published: (2025)
by: Wu, Kam Man, et al.
Published: (2025)
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners
by: Xing, Yazhou, et al.
Published: (2024)
by: Xing, Yazhou, et al.
Published: (2024)
OmniForcing: Unleashing Real-time Joint Audio-Visual Generation
by: Su, Yaofeng, et al.
Published: (2026)
by: Su, Yaofeng, et al.
Published: (2026)
AV-EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Omni-modal LLMS with Audio-visual Cues
by: Zhou, Dingkun, et al.
Published: (2025)
by: Zhou, Dingkun, et al.
Published: (2025)
Audio-FLAN: A Preliminary Release
by: Xue, Liumeng, et al.
Published: (2025)
by: Xue, Liumeng, et al.
Published: (2025)
OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text
by: Pian, Weiguo, et al.
Published: (2026)
by: Pian, Weiguo, et al.
Published: (2026)
MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions
by: Chi, Xiaowei, et al.
Published: (2024)
by: Chi, Xiaowei, et al.
Published: (2024)
Audio ControlNet for Fine-Grained Audio Generation and Editing
by: Zhu, Haina, et al.
Published: (2026)
by: Zhu, Haina, et al.
Published: (2026)
OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model
by: Li, Maomao, et al.
Published: (2026)
by: Li, Maomao, et al.
Published: (2026)
Coherent Audio-Visual Editing via Conditional Audio Generation Following Video Edits
by: Ishii, Masato, et al.
Published: (2025)
by: Ishii, Masato, et al.
Published: (2025)
VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music
by: Shi, Jiatong, et al.
Published: (2024)
by: Shi, Jiatong, et al.
Published: (2024)
Omni2Sound: Towards Unified Video-Text-to-Audio Generation
by: Dai, Yusheng, et al.
Published: (2026)
by: Dai, Yusheng, et al.
Published: (2026)
Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding
by: Wang, Jieyi, et al.
Published: (2026)
by: Wang, Jieyi, et al.
Published: (2026)
Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation
by: Chen, Yuheng, et al.
Published: (2026)
by: Chen, Yuheng, et al.
Published: (2026)
V2A-DPO: Omni-Preference Optimization for Video-to-Audio Generation
by: Chan, Nolan, et al.
Published: (2026)
by: Chan, Nolan, et al.
Published: (2026)
Can LLMs "Reason" in Music? An Evaluation of LLMs' Capability of Music Understanding and Generation
by: Zhou, Ziya, et al.
Published: (2024)
by: Zhou, Ziya, et al.
Published: (2024)
Generative Audio Extension and Morphing
by: Seetharaman, Prem, et al.
Published: (2026)
by: Seetharaman, Prem, et al.
Published: (2026)
MixFake: Benchmarking and Enhancing Audio Deepfake Detection in Diverse Real-world Mixed Audio
by: Li, Qingcao, et al.
Published: (2026)
by: Li, Qingcao, et al.
Published: (2026)
MTAVG-Bench: A Diagnostic Benchmark for Multi-Talker Dialogue-Centric Audio-Video Generation
by: Zhou, Yang-Hao, et al.
Published: (2026)
by: Zhou, Yang-Hao, et al.
Published: (2026)
XGC-AVis: Towards Audio-Visual Content Understanding with a Multi-Agent Collaborative System
by: Cao, Yuqin, et al.
Published: (2025)
by: Cao, Yuqin, et al.
Published: (2025)
Variable-Length Audio Fingerprinting
by: Chen, Hongjie, et al.
Published: (2026)
by: Chen, Hongjie, et al.
Published: (2026)
AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation
by: Wang, Le, et al.
Published: (2025)
by: Wang, Le, et al.
Published: (2025)
MoTAS: MoE-Guided Feature Selection from TTS-Augmented Speech for Enhanced Multimodal Alzheimer's Early Screening
by: Shao, Yongqi, et al.
Published: (2025)
by: Shao, Yongqi, et al.
Published: (2025)
Zero-Shot Fake Video Detection by Audio-Visual Consistency
by: Li, Xiaolou, et al.
Published: (2024)
by: Li, Xiaolou, et al.
Published: (2024)
AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer
by: Fang, Pengjun, et al.
Published: (2026)
by: Fang, Pengjun, et al.
Published: (2026)
SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing
by: Ma, Ziyang, et al.
Published: (2026)
by: Ma, Ziyang, et al.
Published: (2026)
FastSAG: Towards Fast Non-Autoregressive Singing Accompaniment Generation
by: Chen, Jianyi, et al.
Published: (2024)
by: Chen, Jianyi, et al.
Published: (2024)
Audio-Visual Separation with Hierarchical Fusion and Representation Alignment
by: Hu, Han, et al.
Published: (2025)
by: Hu, Han, et al.
Published: (2025)
MusiCRS: Benchmarking Audio-Centric Conversational Recommendation
by: Surana, Rohan, et al.
Published: (2025)
by: Surana, Rohan, et al.
Published: (2025)
Enkidu: Universal Frequential Perturbation for Real-Time Audio Privacy Protection against Voice Deepfakes
by: Feng, Zhou, et al.
Published: (2025)
by: Feng, Zhou, et al.
Published: (2025)
The Interspeech 2026 Audio Reasoning Challenge: Evaluating Reasoning Process Quality for Audio Reasoning Models and Agents
by: Ma, Ziyang, et al.
Published: (2026)
by: Ma, Ziyang, et al.
Published: (2026)
Manipulated Regions Localization For Partially Deepfake Audio: A Survey
by: He, Jiayi, et al.
Published: (2025)
by: He, Jiayi, et al.
Published: (2025)
Trusted Fake Audio Detection Based on Dirichlet Distribution
by: Ding, Chi, et al.
Published: (2025)
by: Ding, Chi, et al.
Published: (2025)
Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model
by: Ma, Ziyang, et al.
Published: (2025)
by: Ma, Ziyang, et al.
Published: (2025)
Nexus: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision
by: Liu, Che, et al.
Published: (2025)
by: Liu, Che, et al.
Published: (2025)
Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions
by: Yuan, Yi, et al.
Published: (2024)
by: Yuan, Yi, et al.
Published: (2024)
AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control
by: Guo, Xinyue, et al.
Published: (2025)
by: Guo, Xinyue, et al.
Published: (2025)
Similar Items
-
AudioX: A Unified Framework for Anything-to-Audio Generation
by: Tian, Zeyue, et al.
Published: (2025) -
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
by: Tian, Zeyue, et al.
Published: (2024) -
LLMs Meet Multimodal Generation and Editing: A Survey
by: He, Yingqing, et al.
Published: (2024) -
VAInpaint: Zero-Shot Video-Audio inpainting framework with LLMs-driven Module
by: Wu, Kam Man, et al.
Published: (2025) -
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners
by: Xing, Yazhou, et al.
Published: (2024)