Saved in:
| Main Authors: | Li, Xueqing, Ma, Hao, Li, Zehan, Chen, Rujin, Zhu, Boyu, Jing, Ruihao, Kang, Jian, Li, Jie, Zhang, Chi, Zhang, Xiao-Lei, Li, Xuelong |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.04721 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Multilingual Speech Recognition Using Discrete Tokens with a Two-step Training Strategy
by: Li, Zehan, et al.
Published: (2025)
by: Li, Zehan, et al.
Published: (2025)
Rare Word Recognition and Translation Without Fine-Tuning via Task Vector in Speech Models
by: Jing, Ruihao, et al.
Published: (2025)
by: Jing, Ruihao, et al.
Published: (2025)
$\text{M}^3\text{PDB}$: A Multimodal, Multi-Label, Multilingual Prompt Database for Speech Generation
by: Zhu, Boyu, et al.
Published: (2025)
by: Zhu, Boyu, et al.
Published: (2025)
Enhancing Intelligibility for Generative Target Speech Extraction via Joint Optimization with Target Speaker ASR
by: Ma, Hao, et al.
Published: (2025)
by: Ma, Hao, et al.
Published: (2025)
High-Fidelity Generative Audio Compression at 0.275kbps
by: Ma, Hao, et al.
Published: (2026)
by: Ma, Hao, et al.
Published: (2026)
Eliminating Quantization Errors in Classification-Based Sound Source Localization
by: Feng, Linfeng, et al.
Published: (2023)
by: Feng, Linfeng, et al.
Published: (2023)
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models
by: Li, Longhao, et al.
Published: (2026)
by: Li, Longhao, et al.
Published: (2026)
AudioSpa: Spatializing Sound Events with Text
by: Feng, Linfeng, et al.
Published: (2025)
by: Feng, Linfeng, et al.
Published: (2025)
GOAT-TTS: Expressive and Realistic Speech Generation via A Dual-Branch LLM
by: Song, Yaodong, et al.
Published: (2025)
by: Song, Yaodong, et al.
Published: (2025)
BoSS: Beyond-Semantic Speech
by: Wang, Qing, et al.
Published: (2025)
by: Wang, Qing, et al.
Published: (2025)
Towards Multimodal Query-Based Spatial Audio Source Extraction
by: Yu, Chenxin, et al.
Published: (2025)
by: Yu, Chenxin, et al.
Published: (2025)
UniForm: A Unified Multi-Task Diffusion Transformer for Audio-Video Generation
by: Zhao, Lei, et al.
Published: (2025)
by: Zhao, Lei, et al.
Published: (2025)
TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios
by: Li, Zehan, et al.
Published: (2025)
by: Li, Zehan, et al.
Published: (2025)
Diffusion-Based Adversarial Purification for Speaker Verification
by: Bai, Yibo, et al.
Published: (2023)
by: Bai, Yibo, et al.
Published: (2023)
Bridging the Modality Gap: Softly Discretizing Audio Representation for LLM-based Automatic Speech Recognition
by: Yang, Mu, et al.
Published: (2025)
by: Yang, Mu, et al.
Published: (2025)
DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation
by: Wang, Jianzong, et al.
Published: (2023)
by: Wang, Jianzong, et al.
Published: (2023)
DualSpec: Text-to-spatial-audio Generation via Dual-Spectrogram Guided Diffusion Model
by: Zhao, Lei, et al.
Published: (2025)
by: Zhao, Lei, et al.
Published: (2025)
GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness
by: Chen, Hongjie, et al.
Published: (2025)
by: Chen, Hongjie, et al.
Published: (2025)
Speaker Contrastive Learning for Source Speaker Tracing
by: Wang, Qing, et al.
Published: (2024)
by: Wang, Qing, et al.
Published: (2024)
Pianoroll-Event: A Novel Score Representation for Symbolic Music
by: Qian, Lekai, et al.
Published: (2026)
by: Qian, Lekai, et al.
Published: (2026)
Enhancing Fully Formatted End-to-End Speech Recognition with Knowledge Distillation via Multi-Codebook Vector Quantization
by: You, Jian, et al.
Published: (2025)
by: You, Jian, et al.
Published: (2025)
Deep Learning Based Stage-wise Two-dimensional Speaker Localization with Large Ad-hoc Microphone Arrays
by: Liu, Shupei, et al.
Published: (2022)
by: Liu, Shupei, et al.
Published: (2022)
Bridging the Gap Between Semantic and User Preference Spaces for Multi-modal Music Representation Learning
by: Pan, Xiaofeng, et al.
Published: (2025)
by: Pan, Xiaofeng, et al.
Published: (2025)
DIFFA: Large Language Diffusion Models Can Listen and Understand
by: Zhou, Jiaming, et al.
Published: (2025)
by: Zhou, Jiaming, et al.
Published: (2025)
Language-Codec: Bridging Discrete Codec Representations and Speech Language Models
by: Ji, Shengpeng, et al.
Published: (2024)
by: Ji, Shengpeng, et al.
Published: (2024)
Bridging the Perception Gap: A Lightweight Coarse-to-Fine Architecture for Edge Audio Systems
by: Zhang, Hengfan, et al.
Published: (2026)
by: Zhang, Hengfan, et al.
Published: (2026)
LL-SDR: Low-Latency Speech enhancement through Discrete Representations
by: Li, Jingyi, et al.
Published: (2026)
by: Li, Jingyi, et al.
Published: (2026)
Bridging the Gap between Audio and Text using Parallel-attention for User-defined Keyword Spotting
by: Kim, Youkyum, et al.
Published: (2024)
by: Kim, Youkyum, et al.
Published: (2024)
FNSE-SBGAN: Far-field Speech Enhancement with Schrodinger Bridge and Generative Adversarial Networks
by: Lei, Tong, et al.
Published: (2025)
by: Lei, Tong, et al.
Published: (2025)
Melodia: Training-Free Music Editing Guided by Attention Probing in Diffusion Models
by: Yang, Yi, et al.
Published: (2025)
by: Yang, Yi, et al.
Published: (2025)
Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration
by: Li, Haowen, et al.
Published: (2026)
by: Li, Haowen, et al.
Published: (2026)
Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty
by: Xue, Hongfei, et al.
Published: (2025)
by: Xue, Hongfei, et al.
Published: (2025)
A Composite Predictive-Generative Approach to Monaural Universal Speech Enhancement
by: Zhang, Jie, et al.
Published: (2025)
by: Zhang, Jie, et al.
Published: (2025)
UniFlow: Unifying Speech Front-End Tasks via Continuous Generative Modeling
by: Wang, Ziqian, et al.
Published: (2025)
by: Wang, Ziqian, et al.
Published: (2025)
Refining Self-Supervised Learnt Speech Representation using Brain Activations
by: Li, Hengyu, et al.
Published: (2024)
by: Li, Hengyu, et al.
Published: (2024)
Bridge-SR: Schrödinger Bridge for Efficient SR
by: Li, Chang, et al.
Published: (2025)
by: Li, Chang, et al.
Published: (2025)
Enhancing Non-Core Language Instruction-Following in Speech LLMs via Semi-Implicit Cross-Lingual CoT Reasoning
by: Xue, Hongfei, et al.
Published: (2025)
by: Xue, Hongfei, et al.
Published: (2025)
SCDNet: Self-supervised Learning Feature-based Speaker Change Detection
by: Li, Yue, et al.
Published: (2024)
by: Li, Yue, et al.
Published: (2024)
EchoFree: Towards Ultra Lightweight and Efficient Neural Acoustic Echo Cancellation
by: Li, Xingchen, et al.
Published: (2025)
by: Li, Xingchen, et al.
Published: (2025)
From Continuous to Discrete: Cross-Domain Collaborative General Speech Enhancement via Hierarchical Language Models
by: Mu, Zhaoxi, et al.
Published: (2025)
by: Mu, Zhaoxi, et al.
Published: (2025)
Similar Items
-
Multilingual Speech Recognition Using Discrete Tokens with a Two-step Training Strategy
by: Li, Zehan, et al.
Published: (2025) -
Rare Word Recognition and Translation Without Fine-Tuning via Task Vector in Speech Models
by: Jing, Ruihao, et al.
Published: (2025) -
$\text{M}^3\text{PDB}$: A Multimodal, Multi-Label, Multilingual Prompt Database for Speech Generation
by: Zhu, Boyu, et al.
Published: (2025) -
Enhancing Intelligibility for Generative Target Speech Extraction via Joint Optimization with Target Speaker ASR
by: Ma, Hao, et al.
Published: (2025) -
High-Fidelity Generative Audio Compression at 0.275kbps
by: Ma, Hao, et al.
Published: (2026)