Enregistré dans:
| Auteurs principaux: | Yang, Dongchao, Liu, Songxiang, Wang, Disong, Wang, Yuanyuan, Wan, Guanglu, Meng, Helen |
|---|---|
| Format: | Preprint |
| Publié: |
2025
|
| Sujets: | |
| Accès en ligne: | https://arxiv.org/abs/2512.03783 |
| Tags: |
Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
|
Documents similaires
UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization
par: Yang, Dongchao, et autres
Publié: (2026)
par: Yang, Dongchao, et autres
Publié: (2026)
SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models
par: Yang, Dongchao, et autres
Publié: (2024)
par: Yang, Dongchao, et autres
Publié: (2024)
ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling
par: Yang, Dongchao, et autres
Publié: (2025)
par: Yang, Dongchao, et autres
Publié: (2025)
SupCLAP: Controlling Optimization Trajectory Drift in Audio-Text Contrastive Learning with Support Vector Regularization
par: Luo, Jiehui, et autres
Publié: (2025)
par: Luo, Jiehui, et autres
Publié: (2025)
AuTAgent: A Reinforcement Learning Framework for Tool-Augmented Audio Reasoning
par: Tong, Siqian, et autres
Publié: (2026)
par: Tong, Siqian, et autres
Publié: (2026)
AMB-DSGDN: Adaptive Modality-Balanced Dynamic Semantic Graph Differential Network for Multimodal Emotion Recognition
par: Wang, Yunsheng, et autres
Publié: (2026)
par: Wang, Yunsheng, et autres
Publié: (2026)
Interpretable All-Type Audio Deepfake Detection with Audio LLMs via Frequency-Time Reinforcement Learning
par: Xie, Yuankun, et autres
Publié: (2026)
par: Xie, Yuankun, et autres
Publié: (2026)
Adaptive Vehicle Speed Classification via BMCNN with Reinforcement Learning-Enhanced Acoustic Processing
par: Zhang, Yuli, et autres
Publié: (2025)
par: Zhang, Yuli, et autres
Publié: (2025)
MeanFlow-Accelerated Multimodal Video-to-Audio Synthesis via One-Step Generation
par: Yang, Xiaoran, et autres
Publié: (2025)
par: Yang, Xiaoran, et autres
Publié: (2025)
UniSRM: A Unified Speech Reward Model for Reasoning-Based Fine-grained Assessment
par: Wang, Yuanyuan, et autres
Publié: (2026)
par: Wang, Yuanyuan, et autres
Publié: (2026)
LongCat-Flash-Omni Technical Report
par: Meituan LongCat Team, et autres
Publié: (2025)
par: Meituan LongCat Team, et autres
Publié: (2025)
MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning
par: Wang, Yueqian, et autres
Publié: (2025)
par: Wang, Yueqian, et autres
Publié: (2025)
UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit Normalization
par: Wang, Yuejiao, et autres
Publié: (2024)
par: Wang, Yuejiao, et autres
Publié: (2024)
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation
par: Liu, Zihan, et autres
Publié: (2025)
par: Liu, Zihan, et autres
Publié: (2025)
MARS-Sep: Multimodal-Aligned Reinforced Sound Separation
par: Zhang, Zihan, et autres
Publié: (2025)
par: Zhang, Zihan, et autres
Publié: (2025)
GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models
par: You, Zuyao, et autres
Publié: (2026)
par: You, Zuyao, et autres
Publié: (2026)
MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech
par: Wang, Chengyao, et autres
Publié: (2025)
par: Wang, Chengyao, et autres
Publié: (2025)
ATRI: Mitigating Multilingual Audio Text Retrieval Inconsistencies by Reducing Data Distribution Errors
par: Yin, Yuguo, et autres
Publié: (2025)
par: Yin, Yuguo, et autres
Publié: (2025)
AutoStyle-TTS: Retrieval-Augmented Generation based Automatic Style Matching Text-to-Speech Synthesis
par: Luo, Dan, et autres
Publié: (2025)
par: Luo, Dan, et autres
Publié: (2025)
Omni-CLST: Error-aware Curriculum Learning with guided Selective chain-of-Thought for audio question answering
par: Zhao, Jinghua, et autres
Publié: (2025)
par: Zhao, Jinghua, et autres
Publié: (2025)
Structure-Aware Piano Accompaniment via Style Planning and Dataset-Aligned Pattern Retrieval
par: Zang, Wanyu, et autres
Publié: (2026)
par: Zang, Wanyu, et autres
Publié: (2026)
Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC
par: Kang, Jiawen, et autres
Publié: (2024)
par: Kang, Jiawen, et autres
Publié: (2024)
SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline
par: Wang, Helin, et autres
Publié: (2025)
par: Wang, Helin, et autres
Publié: (2025)
Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception
par: Wan, Zhen, et autres
Publié: (2026)
par: Wan, Zhen, et autres
Publié: (2026)
Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder
par: Guo, Haohan, et autres
Publié: (2024)
par: Guo, Haohan, et autres
Publié: (2024)
Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard
par: Yang, Yudong, et autres
Publié: (2025)
par: Yang, Yudong, et autres
Publié: (2025)
Rebellion: Noise-Robust Reasoning Training for Audio Reasoning Models
par: Huang, Tiansheng, et autres
Publié: (2025)
par: Huang, Tiansheng, et autres
Publié: (2025)
Boosting ASR Robustness via Test-Time Reinforcement Learning with Audio-Text Semantic Rewards
par: Fang, Linghan, et autres
Publié: (2026)
par: Fang, Linghan, et autres
Publié: (2026)
AST: Adaptive, Seamless, and Training-Free Precise Speech Editing
par: Lv, Sihan, et autres
Publié: (2026)
par: Lv, Sihan, et autres
Publié: (2026)
AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders
par: Aparin, Georgii, et autres
Publié: (2026)
par: Aparin, Georgii, et autres
Publié: (2026)
Switchcodec: Adaptive residual-expert sparse quantization for high-fidelity neural audio coding
par: Wang, Xiangbo, et autres
Publié: (2026)
par: Wang, Xiangbo, et autres
Publié: (2026)
Enhancing Quranic Learning: A Multimodal Deep Learning Approach for Arabic Phoneme Recognition
par: Kucukmanisa, Ayhan, et autres
Publié: (2025)
par: Kucukmanisa, Ayhan, et autres
Publié: (2025)
Quantifying Multimodal Imbalance: A GMM-Guided Adaptive Loss for Audio-Visual Learning
par: Liu, Zhaocheng, et autres
Publié: (2025)
par: Liu, Zhaocheng, et autres
Publié: (2025)
DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models
par: Wang, Yuanyuan, et autres
Publié: (2025)
par: Wang, Yuanyuan, et autres
Publié: (2025)
LTS-VoiceAgent: A Listen-Think-Speak Framework for Efficient Streaming Voice Interaction via Semantic Triggering and Incremental Reasoning
par: Zou, Wenhao, et autres
Publié: (2026)
par: Zou, Wenhao, et autres
Publié: (2026)
Towards Robust Speech Deepfake Detection via Human-Inspired Reasoning
par: Dvirniak, Artem, et autres
Publié: (2026)
par: Dvirniak, Artem, et autres
Publié: (2026)
Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction
par: Chen, Xueyuan, et autres
Publié: (2024)
par: Chen, Xueyuan, et autres
Publié: (2024)
Who Will Top the Charts? Multimodal Music Popularity Prediction via Adaptive Fusion of Modality Experts and Temporal Engagement Modeling
par: Choudhary, Yash, et autres
Publié: (2025)
par: Choudhary, Yash, et autres
Publié: (2025)
SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models
par: Yin, Han, et autres
Publié: (2025)
par: Yin, Han, et autres
Publié: (2025)
MSR-HuBERT: Self-supervised Pre-training for Adaptation to Multiple Sampling Rates
par: Huang, Zikang, et autres
Publié: (2026)
par: Huang, Zikang, et autres
Publié: (2026)
Documents similaires
-
UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization
par: Yang, Dongchao, et autres
Publié: (2026) -
SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models
par: Yang, Dongchao, et autres
Publié: (2024) -
ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling
par: Yang, Dongchao, et autres
Publié: (2025) -
SupCLAP: Controlling Optimization Trajectory Drift in Audio-Text Contrastive Learning with Support Vector Regularization
par: Luo, Jiehui, et autres
Publié: (2025) -
AuTAgent: A Reinforcement Learning Framework for Tool-Augmented Audio Reasoning
par: Tong, Siqian, et autres
Publié: (2026)