:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Li, Xueqing, Ma, Hao, Li, Zehan, Chen, Rujin, Zhu, Boyu, Jing, Ruihao, Kang, Jian, Li, Jie, Zhang, Chi, Zhang, Xiao-Lei, Li, Xuelong
Format:	Preprint
Published:	2025
Subjects:	Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2504.04721
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Multilingual Speech Recognition Using Discrete Tokens with a Two-step Training Strategy
by: Li, Zehan, et al.
Published: (2025)

Rare Word Recognition and Translation Without Fine-Tuning via Task Vector in Speech Models
by: Jing, Ruihao, et al.
Published: (2025)

$\text{M}^3\text{PDB}$: A Multimodal, Multi-Label, Multilingual Prompt Database for Speech Generation
by: Zhu, Boyu, et al.
Published: (2025)

Enhancing Intelligibility for Generative Target Speech Extraction via Joint Optimization with Target Speaker ASR
by: Ma, Hao, et al.
Published: (2025)

High-Fidelity Generative Audio Compression at 0.275kbps
by: Ma, Hao, et al.
Published: (2026)

Eliminating Quantization Errors in Classification-Based Sound Source Localization
by: Feng, Linfeng, et al.
Published: (2023)

Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models
by: Li, Longhao, et al.
Published: (2026)

AudioSpa: Spatializing Sound Events with Text
by: Feng, Linfeng, et al.
Published: (2025)

GOAT-TTS: Expressive and Realistic Speech Generation via A Dual-Branch LLM
by: Song, Yaodong, et al.
Published: (2025)

BoSS: Beyond-Semantic Speech
by: Wang, Qing, et al.
Published: (2025)

Towards Multimodal Query-Based Spatial Audio Source Extraction
by: Yu, Chenxin, et al.
Published: (2025)

UniForm: A Unified Multi-Task Diffusion Transformer for Audio-Video Generation
by: Zhao, Lei, et al.
Published: (2025)

TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios
by: Li, Zehan, et al.
Published: (2025)

Diffusion-Based Adversarial Purification for Speaker Verification
by: Bai, Yibo, et al.
Published: (2023)

Bridging the Modality Gap: Softly Discretizing Audio Representation for LLM-based Automatic Speech Recognition
by: Yang, Mu, et al.
Published: (2025)

DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation
by: Wang, Jianzong, et al.
Published: (2023)

DualSpec: Text-to-spatial-audio Generation via Dual-Spectrogram Guided Diffusion Model
by: Zhao, Lei, et al.
Published: (2025)

GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness
by: Chen, Hongjie, et al.
Published: (2025)

Speaker Contrastive Learning for Source Speaker Tracing
by: Wang, Qing, et al.
Published: (2024)

Pianoroll-Event: A Novel Score Representation for Symbolic Music
by: Qian, Lekai, et al.
Published: (2026)

Enhancing Fully Formatted End-to-End Speech Recognition with Knowledge Distillation via Multi-Codebook Vector Quantization
by: You, Jian, et al.
Published: (2025)

Deep Learning Based Stage-wise Two-dimensional Speaker Localization with Large Ad-hoc Microphone Arrays
by: Liu, Shupei, et al.
Published: (2022)

Bridging the Gap Between Semantic and User Preference Spaces for Multi-modal Music Representation Learning
by: Pan, Xiaofeng, et al.
Published: (2025)

DIFFA: Large Language Diffusion Models Can Listen and Understand
by: Zhou, Jiaming, et al.
Published: (2025)

Language-Codec: Bridging Discrete Codec Representations and Speech Language Models
by: Ji, Shengpeng, et al.
Published: (2024)

Bridging the Perception Gap: A Lightweight Coarse-to-Fine Architecture for Edge Audio Systems
by: Zhang, Hengfan, et al.
Published: (2026)

LL-SDR: Low-Latency Speech enhancement through Discrete Representations
by: Li, Jingyi, et al.
Published: (2026)

Bridging the Gap between Audio and Text using Parallel-attention for User-defined Keyword Spotting
by: Kim, Youkyum, et al.
Published: (2024)

FNSE-SBGAN: Far-field Speech Enhancement with Schrodinger Bridge and Generative Adversarial Networks
by: Lei, Tong, et al.
Published: (2025)

Melodia: Training-Free Music Editing Guided by Attention Probing in Diffusion Models
by: Yang, Yi, et al.
Published: (2025)

Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration
by: Li, Haowen, et al.
Published: (2026)

Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty
by: Xue, Hongfei, et al.
Published: (2025)

A Composite Predictive-Generative Approach to Monaural Universal Speech Enhancement
by: Zhang, Jie, et al.
Published: (2025)

UniFlow: Unifying Speech Front-End Tasks via Continuous Generative Modeling
by: Wang, Ziqian, et al.
Published: (2025)

Refining Self-Supervised Learnt Speech Representation using Brain Activations
by: Li, Hengyu, et al.
Published: (2024)

Bridge-SR: Schrödinger Bridge for Efficient SR
by: Li, Chang, et al.
Published: (2025)

Enhancing Non-Core Language Instruction-Following in Speech LLMs via Semi-Implicit Cross-Lingual CoT Reasoning
by: Xue, Hongfei, et al.
Published: (2025)

SCDNet: Self-supervised Learning Feature-based Speaker Change Detection
by: Li, Yue, et al.
Published: (2024)

EchoFree: Towards Ultra Lightweight and Efficient Neural Acoustic Echo Cancellation
by: Li, Xingchen, et al.
Published: (2025)

From Continuous to Discrete: Cross-Domain Collaborative General Speech Enhancement via Hierarchical Language Models
by: Mu, Zhaoxi, et al.
Published: (2025)