Saved in:
| Main Authors: | Guan, Kaisi, Wang, Xihua, Lai, Zhengfeng, Cheng, Xin, Zhang, Peng, Liu, XiaoJiang, Song, Ruihua, Cao, Meng |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.03117 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning
by: Cheng, Xin, et al.
Published: (2025)
by: Cheng, Xin, et al.
Published: (2025)
LoVA: Long-form Video-to-Audio Generation
by: Cheng, Xin, et al.
Published: (2024)
by: Cheng, Xin, et al.
Published: (2024)
ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering
by: Guan, Kaisi, et al.
Published: (2025)
by: Guan, Kaisi, et al.
Published: (2025)
SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning
by: Cheng, Xin, et al.
Published: (2026)
by: Cheng, Xin, et al.
Published: (2026)
HuM-Eval: A Coarse-to-Fine Framework for Human-Centric Video Evaluation
by: Zhang, Bingzi, et al.
Published: (2026)
by: Zhang, Bingzi, et al.
Published: (2026)
MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation
by: Takahashi, Akira, et al.
Published: (2025)
by: Takahashi, Akira, et al.
Published: (2025)
3MDiT: Unified Tri-Modal Diffusion Transformer for Text-Driven Synchronized Audio-Video Generation
by: Li, Yaoru, et al.
Published: (2025)
by: Li, Yaoru, et al.
Published: (2025)
SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition
by: Wu, Yihan, et al.
Published: (2024)
by: Wu, Yihan, et al.
Published: (2024)
Omni2Sound: Towards Unified Video-Text-to-Audio Generation
by: Dai, Yusheng, et al.
Published: (2026)
by: Dai, Yusheng, et al.
Published: (2026)
SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation
by: Niu, Xinlei, et al.
Published: (2024)
by: Niu, Xinlei, et al.
Published: (2024)
Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries
by: Cai, Pengfei, et al.
Published: (2025)
by: Cai, Pengfei, et al.
Published: (2025)
ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling
by: Yang, Jianxuan, et al.
Published: (2026)
by: Yang, Jianxuan, et al.
Published: (2026)
Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions
by: Xin, Yifei, et al.
Published: (2023)
by: Xin, Yifei, et al.
Published: (2023)
BTS: Bridging Text and Sound Modalities for Metadata-Aided Respiratory Sound Classification
by: Kim, June-Woo, et al.
Published: (2024)
by: Kim, June-Woo, et al.
Published: (2024)
SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation
by: Saito, Koichi, et al.
Published: (2024)
by: Saito, Koichi, et al.
Published: (2024)
BSharedRAG: Backbone Shared Retrieval-Augmented Generation for the E-commerce Domain
by: Guan, Kaisi, et al.
Published: (2024)
by: Guan, Kaisi, et al.
Published: (2024)
Instrumental Text-to-Music Generation with Auxiliary Conditioning Branches
by: Koh, Junyoung
Published: (2026)
by: Koh, Junyoung
Published: (2026)
Read, Watch and Scream! Sound Generation from Text and Video
by: Jeong, Yujin, et al.
Published: (2024)
by: Jeong, Yujin, et al.
Published: (2024)
Noise-Robust Sound Event Detection and Counting via Language-Queried Sound Separation
by: Chen, Yuanjian, et al.
Published: (2025)
by: Chen, Yuanjian, et al.
Published: (2025)
AudioSpa: Spatializing Sound Events with Text
by: Feng, Linfeng, et al.
Published: (2025)
by: Feng, Linfeng, et al.
Published: (2025)
Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound
by: Lee, Junwon, et al.
Published: (2024)
by: Lee, Junwon, et al.
Published: (2024)
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
by: Cheng, Shihao, et al.
Published: (2026)
by: Cheng, Shihao, et al.
Published: (2026)
ESTVocoder: An Excitation-Spectral-Transformed Neural Vocoder Conditioned on Mel Spectrogram
by: Jiang, Xiao-Hang, et al.
Published: (2024)
by: Jiang, Xiao-Hang, et al.
Published: (2024)
Hierarchical Codec Diffusion for Video-to-Speech Generation
by: Ye, Jiaxin, et al.
Published: (2026)
by: Ye, Jiaxin, et al.
Published: (2026)
Sound Sparks Motion: Audio and Text Tuning for Video Editing
by: Razlighi, AmirHossein Naghi, et al.
Published: (2026)
by: Razlighi, AmirHossein Naghi, et al.
Published: (2026)
AudioGS: Spectrogram-Based Audio Gaussian Splatting for Sound Field Reconstruction
by: Bi, Chunhao, et al.
Published: (2026)
by: Bi, Chunhao, et al.
Published: (2026)
Intelligent Text-Conditioned Music Generation
by: Xie, Zhouyao, et al.
Published: (2024)
by: Xie, Zhouyao, et al.
Published: (2024)
Domain Adaptation Method and Modality Gap Impact in Audio-Text Models for Prototypical Sound Classification
by: Acevedo, Emiliano, et al.
Published: (2025)
by: Acevedo, Emiliano, et al.
Published: (2025)
Leveraging LLM and Text-Queried Separation for Noise-Robust Sound Event Detection
by: Yin, Han, et al.
Published: (2024)
by: Yin, Han, et al.
Published: (2024)
DiffSound: Differentiable Modal Sound Rendering and Inverse Rendering for Diverse Inference Tasks
by: Jin, Xutong, et al.
Published: (2024)
by: Jin, Xutong, et al.
Published: (2024)
VinTAGe: Joint Video and Text Conditioning for Holistic Audio Generation
by: Kushwaha, Saksham Singh, et al.
Published: (2024)
by: Kushwaha, Saksham Singh, et al.
Published: (2024)
Diffusion based Text-to-Music Generation with Global and Local Text based Conditioning
by: Zhang, Jisi, et al.
Published: (2025)
by: Zhang, Jisi, et al.
Published: (2025)
Contrastive Loss Based Frame-wise Feature disentanglement for Polyphonic Sound Event Detection
by: Guan, Yadong, et al.
Published: (2024)
by: Guan, Yadong, et al.
Published: (2024)
Disentangling Hierarchical Features for Anomalous Sound Detection Under Domain Shift
by: Guan, Jian, et al.
Published: (2025)
by: Guan, Jian, et al.
Published: (2025)
First-Shot Unsupervised Anomalous Sound Detection With Unknown Anomalies Estimated by Metadata-Assisted Audio Generation
by: Zhang, Hejing, et al.
Published: (2023)
by: Zhang, Hejing, et al.
Published: (2023)
Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
by: Chen, Changan, et al.
Published: (2024)
by: Chen, Changan, et al.
Published: (2024)
Exploring Text-Queried Sound Event Detection with Audio Source Separation
by: Yin, Han, et al.
Published: (2024)
by: Yin, Han, et al.
Published: (2024)
Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation
by: Tong, Xinyi, et al.
Published: (2025)
by: Tong, Xinyi, et al.
Published: (2025)
TW-Sound580K: A Regional Audio-Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling
by: Xie, Hao-Hui, et al.
Published: (2026)
by: Xie, Hao-Hui, et al.
Published: (2026)
HDA-SELD: Hierarchical Cross-Modal Distillation with Multi-Level Data Augmentation for Low-Resource Audio-Visual Sound Event Localization and Detection
by: Wang, Qing, et al.
Published: (2025)
by: Wang, Qing, et al.
Published: (2025)
Similar Items
-
VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning
by: Cheng, Xin, et al.
Published: (2025) -
LoVA: Long-form Video-to-Audio Generation
by: Cheng, Xin, et al.
Published: (2024) -
ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering
by: Guan, Kaisi, et al.
Published: (2025) -
SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning
by: Cheng, Xin, et al.
Published: (2026) -
HuM-Eval: A Coarse-to-Fine Framework for Human-Centric Video Evaluation
by: Zhang, Bingzi, et al.
Published: (2026)