Saved in:
| Main Authors: | Li, Xiquan, Xu, Xuenan, Ma, Ziyang, Chen, Wenxi, He, Haolin, Kong, Qiuqiang, Chen, Xie |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.01155 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
DRCap: Decoding CLAP Latents with Retrieval-Augmented Generation for Zero-shot Audio Captioning
by: Li, Xiquan, et al.
Published: (2024)
by: Li, Xiquan, et al.
Published: (2024)
Resonate: Reinforcing Text-to-Audio Generation via Online Feedback from Large Audio Language Models
by: Li, Xiquan, et al.
Published: (2026)
by: Li, Xiquan, et al.
Published: (2026)
SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs
by: Chen, Wenxi, et al.
Published: (2024)
by: Chen, Wenxi, et al.
Published: (2024)
SemanticAudio: Audio Generation and Editing in Semantic Space
by: Dai, Zheqi, et al.
Published: (2026)
by: Dai, Zheqi, et al.
Published: (2026)
Audio ControlNet for Fine-Grained Audio Generation and Editing
by: Zhu, Haina, et al.
Published: (2026)
by: Zhu, Haina, et al.
Published: (2026)
Towards Weakly Supervised Text-to-Audio Grounding
by: Xu, Xuenan, et al.
Published: (2024)
by: Xu, Xuenan, et al.
Published: (2024)
MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows
by: Li, Xiquan, et al.
Published: (2025)
by: Li, Xiquan, et al.
Published: (2025)
EAT: Self-Supervised Pre-Training with Efficient Audio Transformer
by: Chen, Wenxi, et al.
Published: (2024)
by: Chen, Wenxi, et al.
Published: (2024)
Towards Reliable Large Audio Language Model
by: Ma, Ziyang, et al.
Published: (2025)
by: Ma, Ziyang, et al.
Published: (2025)
Piano Transcription by Hierarchical Language Modeling with Pretrained Roll-based Encoders
by: Li, Dichucheng, et al.
Published: (2025)
by: Li, Dichucheng, et al.
Published: (2025)
AudioComposer: Towards Fine-grained Audio Generation with Natural Language Descriptions
by: Wang, Yuanyuan, et al.
Published: (2024)
by: Wang, Yuanyuan, et al.
Published: (2024)
T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining
by: Yuan, Yi, et al.
Published: (2024)
by: Yuan, Yi, et al.
Published: (2024)
Video-to-Audio Generation with Fine-grained Temporal Semantics
by: Hu, Yuchen, et al.
Published: (2024)
by: Hu, Yuchen, et al.
Published: (2024)
TinyMU: A Compact Audio-Language Model for Music Understanding
by: Li, Xiquan, et al.
Published: (2026)
by: Li, Xiquan, et al.
Published: (2026)
AuDirector: A Self-Reflective Closed-Loop Framework for Immersive Audio Storytelling
by: Ren, Yiming, et al.
Published: (2026)
by: Ren, Yiming, et al.
Published: (2026)
EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark
by: Ma, Ziyang, et al.
Published: (2024)
by: Ma, Ziyang, et al.
Published: (2024)
Training-Free Multi-Step Audio Source Separation
by: Zang, Yongyi, et al.
Published: (2025)
by: Zang, Yongyi, et al.
Published: (2025)
ProLAP: Probabilistic Language-Audio Pre-Training
by: Manabe, Toranosuke, et al.
Published: (2025)
by: Manabe, Toranosuke, et al.
Published: (2025)
GSound-SIR: A Spatial Impulse Response Ray-Tracing and High-order Ambisonic Auralization Python Toolkit
by: Zang, Yongyi, et al.
Published: (2025)
by: Zang, Yongyi, et al.
Published: (2025)
FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion
by: Chen, Shunian, et al.
Published: (2025)
by: Chen, Shunian, et al.
Published: (2025)
SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization
by: Chen, Wenxi, et al.
Published: (2025)
by: Chen, Wenxi, et al.
Published: (2025)
Universal Sound Separation with Self-Supervised Audio Masked Autoencoder
by: Zhao, Junqi, et al.
Published: (2024)
by: Zhao, Junqi, et al.
Published: (2024)
SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing
by: Ma, Ziyang, et al.
Published: (2026)
by: Ma, Ziyang, et al.
Published: (2026)
Enhance Temporal Relations in Audio Captioning with Sound Event Detection
by: Xie, Zeyu, et al.
Published: (2023)
by: Xie, Zeyu, et al.
Published: (2023)
MMEDIT: A Unified Framework for Multi-Type Audio Editing via Audio Language Model
by: Tao, Ye, et al.
Published: (2025)
by: Tao, Ye, et al.
Published: (2025)
Improving Audio Question Answering with Variational Inference
by: Chen, Haolin
Published: (2026)
by: Chen, Haolin
Published: (2026)
AudioTime: A Temporally-aligned Audio-text Benchmark Dataset
by: Xie, Zeyu, et al.
Published: (2024)
by: Xie, Zeyu, et al.
Published: (2024)
Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation
by: Wu, Shih-Lun, et al.
Published: (2023)
by: Wu, Shih-Lun, et al.
Published: (2023)
Can Audio Large Language Models Verify Speaker Identity?
by: Ren, Yiming, et al.
Published: (2025)
by: Ren, Yiming, et al.
Published: (2025)
Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model
by: Ma, Ziyang, et al.
Published: (2025)
by: Ma, Ziyang, et al.
Published: (2025)
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning
by: Sun, Luoyi, et al.
Published: (2023)
by: Sun, Luoyi, et al.
Published: (2023)
AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining
by: Liu, Haohe, et al.
Published: (2023)
by: Liu, Haohe, et al.
Published: (2023)
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation
by: Xie, Zeyu, et al.
Published: (2024)
by: Xie, Zeyu, et al.
Published: (2024)
Enhancing Audio Generation Diversity with Visual Information
by: Xie, Zeyu, et al.
Published: (2024)
by: Xie, Zeyu, et al.
Published: (2024)
BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet Tag-guided Synthetic Data
by: Xu, Xuenan, et al.
Published: (2023)
by: Xu, Xuenan, et al.
Published: (2023)
PicoAudio2: Temporal Controllable Text-to-Audio Generation with Natural Language Description
by: Zheng, Zihao, et al.
Published: (2025)
by: Zheng, Zihao, et al.
Published: (2025)
Accessible Fine-grained Data Representation via Spatial Audio
by: Liu, Can, et al.
Published: (2026)
by: Liu, Can, et al.
Published: (2026)
Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt
by: Shi, Yanfeng, et al.
Published: (2026)
by: Shi, Yanfeng, et al.
Published: (2026)
SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding
by: Sun, Luoyi, et al.
Published: (2026)
by: Sun, Luoyi, et al.
Published: (2026)
Language-Queried Target Sound Extraction Without Parallel Training Data
by: Ma, Hao, et al.
Published: (2024)
by: Ma, Hao, et al.
Published: (2024)
Similar Items
-
DRCap: Decoding CLAP Latents with Retrieval-Augmented Generation for Zero-shot Audio Captioning
by: Li, Xiquan, et al.
Published: (2024) -
Resonate: Reinforcing Text-to-Audio Generation via Online Feedback from Large Audio Language Models
by: Li, Xiquan, et al.
Published: (2026) -
SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs
by: Chen, Wenxi, et al.
Published: (2024) -
SemanticAudio: Audio Generation and Editing in Semantic Space
by: Dai, Zheqi, et al.
Published: (2026) -
Audio ControlNet for Fine-Grained Audio Generation and Editing
by: Zhu, Haina, et al.
Published: (2026)