Saved in:
| Main Authors: | Yang, Dongchao, Wang, Yuanyuan, Chong, Dading, Liu, Songxiang, Wu, Xixin, Meng, Helen |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.04683 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
UniAudio: An Audio Foundation Model Toward Universal Audio Generation
by: Yang, Dongchao, et al.
Published: (2023)
by: Yang, Dongchao, et al.
Published: (2023)
UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner
by: Yang, Dongchao, et al.
Published: (2024)
by: Yang, Dongchao, et al.
Published: (2024)
ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling
by: Yang, Dongchao, et al.
Published: (2025)
by: Yang, Dongchao, et al.
Published: (2025)
SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models
by: Yang, Dongchao, et al.
Published: (2024)
by: Yang, Dongchao, et al.
Published: (2024)
UniSep: Universal Target Audio Separation with Language Models at Scale
by: Wang, Yuanyuan, et al.
Published: (2025)
by: Wang, Yuanyuan, et al.
Published: (2025)
AudioComposer: Towards Fine-grained Audio Generation with Natural Language Descriptions
by: Wang, Yuanyuan, et al.
Published: (2024)
by: Wang, Yuanyuan, et al.
Published: (2024)
Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation
by: Yan, Canxiang, et al.
Published: (2025)
by: Yan, Canxiang, et al.
Published: (2025)
UniSRM: A Unified Speech Reward Model for Reasoning-Based Fine-grained Assessment
by: Wang, Yuanyuan, et al.
Published: (2026)
by: Wang, Yuanyuan, et al.
Published: (2026)
DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models
by: Wang, Yuanyuan, et al.
Published: (2025)
by: Wang, Yuanyuan, et al.
Published: (2025)
UniTok-Audio: A Unified Audio Generation Framework via Generative Modeling on Discrete Codec Tokens
by: Liu, Chengwei, et al.
Published: (2025)
by: Liu, Chengwei, et al.
Published: (2025)
Omni-AutoThink: Adaptive Multimodal Reasoning via Reinforcement Learning
by: Yang, Dongchao, et al.
Published: (2025)
by: Yang, Dongchao, et al.
Published: (2025)
UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities
by: Xu, Xuenan, et al.
Published: (2025)
by: Xu, Xuenan, et al.
Published: (2025)
Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation
by: Guo, Haohan, et al.
Published: (2024)
by: Guo, Haohan, et al.
Published: (2024)
ELEGANCE: Efficient LLM Guidance for Audio-Visual Target Speech Extraction
by: Wu, Wenxuan, et al.
Published: (2025)
by: Wu, Wenxuan, et al.
Published: (2025)
MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models
by: Gong, Yitian, et al.
Published: (2026)
by: Gong, Yitian, et al.
Published: (2026)
SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models
by: Yang, Dongchao, et al.
Published: (2024)
by: Yang, Dongchao, et al.
Published: (2024)
Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder
by: Guo, Haohan, et al.
Published: (2024)
by: Guo, Haohan, et al.
Published: (2024)
SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis
by: Guo, Haohan, et al.
Published: (2024)
by: Guo, Haohan, et al.
Published: (2024)
Incorporating Linguistic Constraints from External Knowledge Source for Audio-Visual Target Speech Extraction
by: Wu, Wenxuan, et al.
Published: (2025)
by: Wu, Wenxuan, et al.
Published: (2025)
Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from Text
by: Mei, Jiahao, et al.
Published: (2026)
by: Mei, Jiahao, et al.
Published: (2026)
MMEDIT: A Unified Framework for Multi-Type Audio Editing via Audio Language Model
by: Tao, Ye, et al.
Published: (2025)
by: Tao, Ye, et al.
Published: (2025)
Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction
by: Chen, Xueyuan, et al.
Published: (2024)
by: Chen, Xueyuan, et al.
Published: (2024)
UniCodec: Unified Audio Codec with Single Domain-Adaptive Codebook
by: Jiang, Yidi, et al.
Published: (2025)
by: Jiang, Yidi, et al.
Published: (2025)
AudioLCM: Text-to-Audio Generation with Latent Consistency Models
by: Liu, Huadai, et al.
Published: (2024)
by: Liu, Huadai, et al.
Published: (2024)
$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction
by: Wu, Wenxuan, et al.
Published: (2025)
by: Wu, Wenxuan, et al.
Published: (2025)
CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction
by: Chen, Xueyuan, et al.
Published: (2024)
by: Chen, Xueyuan, et al.
Published: (2024)
Causal Tracing of Audio-Text Fusion in Large Audio Language Models
by: Chen, Wei-Chih, et al.
Published: (2026)
by: Chen, Wei-Chih, et al.
Published: (2026)
BATON: Aligning Text-to-Audio Model with Human Preference Feedback
by: Liao, Huan, et al.
Published: (2024)
by: Liao, Huan, et al.
Published: (2024)
Resonate: Reinforcing Text-to-Audio Generation via Online Feedback from Large Audio Language Models
by: Li, Xiquan, et al.
Published: (2026)
by: Li, Xiquan, et al.
Published: (2026)
Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models
by: Wang, Yanyun, et al.
Published: (2026)
by: Wang, Yanyun, et al.
Published: (2026)
Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs
by: Zhang, Linhao, et al.
Published: (2026)
by: Zhang, Linhao, et al.
Published: (2026)
WAKE: Watermarking Audio with Key Enrichment
by: Xu, Yaoxun, et al.
Published: (2025)
by: Xu, Yaoxun, et al.
Published: (2025)
All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation
by: Foo, Leonardo Haw-Yang, et al.
Published: (2026)
by: Foo, Leonardo Haw-Yang, et al.
Published: (2026)
Eureka-Audio: Triggering Audio Intelligence in Compact Language Models
by: Zhang, Dan, et al.
Published: (2026)
by: Zhang, Dan, et al.
Published: (2026)
ChronosAudio: A Comprehensive Long-Audio Benchmark for Evaluating Audio-Large Language Models
by: Luo, Kaiwen, et al.
Published: (2026)
by: Luo, Kaiwen, et al.
Published: (2026)
AudioRAG+: Feedback-driven Retrieval-augmented Audio Generation with Large Audio Language Models
by: Zhao, Junqi, et al.
Published: (2025)
by: Zhao, Junqi, et al.
Published: (2025)
Are Audio-Language Models Listening? Audio-Specialist Heads for Adaptive Audio Steering
by: Glazer, Neta, et al.
Published: (2026)
by: Glazer, Neta, et al.
Published: (2026)
DreamAudio: Customized Text-to-Audio Generation with Diffusion Models
by: Yuan, Yi, et al.
Published: (2025)
by: Yuan, Yi, et al.
Published: (2025)
UniSync: A Unified Framework for Audio-Visual Synchronization
by: Feng, Tao, et al.
Published: (2025)
by: Feng, Tao, et al.
Published: (2025)
AudioToolAgent: An Agentic Framework for Audio-Language Models
by: Wijngaard, Gijs, et al.
Published: (2025)
by: Wijngaard, Gijs, et al.
Published: (2025)
Similar Items
-
UniAudio: An Audio Foundation Model Toward Universal Audio Generation
by: Yang, Dongchao, et al.
Published: (2023) -
UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner
by: Yang, Dongchao, et al.
Published: (2024) -
ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling
by: Yang, Dongchao, et al.
Published: (2025) -
SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models
by: Yang, Dongchao, et al.
Published: (2024) -
UniSep: Universal Target Audio Separation with Language Models at Scale
by: Wang, Yuanyuan, et al.
Published: (2025)