Saved in:
| Main Authors: | Liang, Jinhua, Chen, Yuanzhe, Yuan, Yi, Jia, Dongya, Zhuang, Xiaobin, Chen, Zhuo, Wang, Yuping, Wang, Yuxuan |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.16076 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions
by: Yuan, Yi, et al.
Published: (2024)
by: Yuan, Yi, et al.
Published: (2024)
Sounding that Object: Interactive Object-Aware Image to Audio Generation
by: Li, Tingle, et al.
Published: (2025)
by: Li, Tingle, et al.
Published: (2025)
DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation
by: Song, Yakun, et al.
Published: (2025)
by: Song, Yakun, et al.
Published: (2025)
T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining
by: Yuan, Yi, et al.
Published: (2024)
by: Yuan, Yi, et al.
Published: (2024)
MagiCodec: Simple Masked Gaussian-Injected Codec for High-Fidelity Reconstruction and Generation
by: Song, Yakun, et al.
Published: (2025)
by: Song, Yakun, et al.
Published: (2025)
DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation
by: Jia, Dongya, et al.
Published: (2025)
by: Jia, Dongya, et al.
Published: (2025)
Towards Reliable Large Audio Language Model
by: Ma, Ziyang, et al.
Published: (2025)
by: Ma, Ziyang, et al.
Published: (2025)
Direct Preference Optimization for Speech Autoregressive Diffusion Models
by: Liu, Zhijun, et al.
Published: (2025)
by: Liu, Zhijun, et al.
Published: (2025)
StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion
by: Wang, Zhichao, et al.
Published: (2024)
by: Wang, Zhichao, et al.
Published: (2024)
StreamVoice+: Evolving into End-to-end Streaming Zero-shot Voice Conversion
by: Wang, Zhichao, et al.
Published: (2024)
by: Wang, Zhichao, et al.
Published: (2024)
DreamAudio: Customized Text-to-Audio Generation with Diffusion Models
by: Yuan, Yi, et al.
Published: (2025)
by: Yuan, Yi, et al.
Published: (2025)
Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model
by: Ma, Ziyang, et al.
Published: (2025)
by: Ma, Ziyang, et al.
Published: (2025)
VoiceShop: A Unified Speech-to-Speech Framework for Identity-Preserving Zero-Shot Voice Editing
by: Anastassiou, Philip, et al.
Published: (2024)
by: Anastassiou, Philip, et al.
Published: (2024)
From Audio Encoders to Piano Judges: Benchmarking Performance Understanding for Solo Piano
by: Zhang, Huan, et al.
Published: (2024)
by: Zhang, Huan, et al.
Published: (2024)
Multi-level Temporal-channel Speaker Retrieval for Zero-shot Voice Conversion
by: Wang, Zhichao, et al.
Published: (2023)
by: Wang, Zhichao, et al.
Published: (2023)
Leveraging Pre-trained AudioLDM for Sound Generation: A Benchmark Study
by: Yuan, Yi, et al.
Published: (2023)
by: Yuan, Yi, et al.
Published: (2023)
AudioEditor: A Training-Free Diffusion-Based Audio Editing Framework
by: Jia, Yuhang, et al.
Published: (2024)
by: Jia, Yuhang, et al.
Published: (2024)
Scaling up masked audio encoder learning for general audio classification
by: Dinkel, Heinrich, et al.
Published: (2024)
by: Dinkel, Heinrich, et al.
Published: (2024)
AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining
by: Liu, Haohe, et al.
Published: (2023)
by: Liu, Haohe, et al.
Published: (2023)
Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models
by: He, Haolin, et al.
Published: (2025)
by: He, Haolin, et al.
Published: (2025)
ImmersiveFlow: Stereo-to-7.1.4 spatial audio generation with flow matching
by: Liang, Zining, et al.
Published: (2026)
by: Liang, Zining, et al.
Published: (2026)
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
by: Anastassiou, Philip, et al.
Published: (2024)
by: Anastassiou, Philip, et al.
Published: (2024)
Towards audio language modeling -- an overview
by: Wu, Haibin, et al.
Published: (2024)
by: Wu, Haibin, et al.
Published: (2024)
TADA: Training-free Attribution and Out-of-Domain Detection of Audio Deepfakes
by: Stan, Adriana, et al.
Published: (2025)
by: Stan, Adriana, et al.
Published: (2025)
WavCraft: Audio Editing and Generation with Large Language Models
by: Liang, Jinhua, et al.
Published: (2024)
by: Liang, Jinhua, et al.
Published: (2024)
MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
by: Ma, Ziyang, et al.
Published: (2025)
by: Ma, Ziyang, et al.
Published: (2025)
FxSearcher: gradient-free text-driven audio transformation
by: Ki, Hojoon, et al.
Published: (2025)
by: Ki, Hojoon, et al.
Published: (2025)
U-SAM: An audio language Model for Unified Speech, Audio, and Music Understanding
by: Wang, Ziqian, et al.
Published: (2025)
by: Wang, Ziqian, et al.
Published: (2025)
Audio Dialogues: Dialogues dataset for audio and music understanding
by: Goel, Arushi, et al.
Published: (2024)
by: Goel, Arushi, et al.
Published: (2024)
audio2chart: End to End Audio Transcription into playable Guitar Hero charts
by: Tripodi, Riccardo
Published: (2025)
by: Tripodi, Riccardo
Published: (2025)
Online incremental learning for audio classification using a pretrained audio model
by: Mulimani, Manjunath, et al.
Published: (2025)
by: Mulimani, Manjunath, et al.
Published: (2025)
Joint Multi-scale Cross-lingual Speaking Style Transfer with Bidirectional Attention Mechanism for Automatic Dubbing
by: Li, Jingbei, et al.
Published: (2023)
by: Li, Jingbei, et al.
Published: (2023)
FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation
by: Jiang, Yuxuan, et al.
Published: (2025)
by: Jiang, Yuxuan, et al.
Published: (2025)
Recomposer: Event-roll-guided generative audio editing
by: Ellis, Daniel P. W., et al.
Published: (2025)
by: Ellis, Daniel P. W., et al.
Published: (2025)
AudioRepInceptionNeXt: A lightweight single-stream architecture for efficient audio recognition
by: Lau, Kin Wai, et al.
Published: (2024)
by: Lau, Kin Wai, et al.
Published: (2024)
Combining audio control and style transfer using latent diffusion
by: Demerlé, Nils, et al.
Published: (2024)
by: Demerlé, Nils, et al.
Published: (2024)
EDTC: enhance depth of text comprehension in automated audio captioning
by: Tan, Liwen, et al.
Published: (2024)
by: Tan, Liwen, et al.
Published: (2024)
From Aesthetics to Human Preferences: Comparative Perspectives of Evaluating Text-to-Music Systems
by: Zhang, Huan, et al.
Published: (2025)
by: Zhang, Huan, et al.
Published: (2025)
AudioSetCaps: An Enriched Audio-Caption Dataset using Automated Generation Pipeline with Large Audio and Language Models
by: Bai, Jisheng, et al.
Published: (2024)
by: Bai, Jisheng, et al.
Published: (2024)
Positive and negative sampling strategies for self-supervised learning on audio-video data
by: Wang, Shanshan, et al.
Published: (2024)
by: Wang, Shanshan, et al.
Published: (2024)
Similar Items
-
Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions
by: Yuan, Yi, et al.
Published: (2024) -
Sounding that Object: Interactive Object-Aware Image to Audio Generation
by: Li, Tingle, et al.
Published: (2025) -
DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation
by: Song, Yakun, et al.
Published: (2025) -
T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining
by: Yuan, Yi, et al.
Published: (2024) -
MagiCodec: Simple Masked Gaussian-Injected Codec for High-Fidelity Reconstruction and Generation
by: Song, Yakun, et al.
Published: (2025)