:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Ma, Hao, Peng, Zhiyuan, Li, Xu, Shao, Mingjie, Wu, Xixin, Liu, Ju
Format:	Preprint
Published:	2024
Subjects:	Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2402.17455
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Language-Queried Target Sound Extraction Without Parallel Training Data
by: Ma, Hao, et al.
Published: (2024)

Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy
by: Wu, Wenxuan, et al.
Published: (2024)

Towards Multimodal Query-Based Spatial Audio Source Extraction
by: Yu, Chenxin, et al.
Published: (2025)

Leveraging Audio-Only Data for Text-Queried Target Sound Extraction
by: Saijo, Kohei, et al.
Published: (2024)

Extending Whisper with prompt tuning to target-speaker ASR
by: Ma, Hao, et al.
Published: (2023)

Target Speech Extraction with Pre-trained Self-supervised Learning Models
by: Peng, Junyi, et al.
Published: (2024)

Context-Aware Query Refinement for Target Sound Extraction: Handling Partially Matched Queries
by: Sato, Ryo, et al.
Published: (2025)

Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training
by: Yang, Yifan, et al.
Published: (2026)

Text-Queried Target Sound Event Localization
by: Zhao, Jinzheng, et al.
Published: (2024)

Training Strategies for Modality Dropout Resilient Multi-Modal Target Speaker Extraction
by: Korse, Srikanth, et al.
Published: (2025)

Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-training
by: Li, Yiming, et al.
Published: (2024)

Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling
by: Wang, Quanxiu, et al.
Published: (2024)

Enhancing Intelligibility for Generative Target Speech Extraction via Joint Optimization with Target Speaker ASR
by: Ma, Hao, et al.
Published: (2025)

Bridging the gap between training and inference in LM-based TTS models
by: Zhang, Ruonan, et al.
Published: (2025)

ELEGANCE: Efficient LLM Guidance for Audio-Visual Target Speech Extraction
by: Wu, Wenxuan, et al.
Published: (2025)

SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer
by: Wang, Helin, et al.
Published: (2024)

Regularized Contrastive Pre-training for Few-shot Bioacoustic Sound Detection
by: Moummad, Ilyass, et al.
Published: (2023)

Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction
by: Chen, Xueyuan, et al.
Published: (2024)

DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions
by: Chen, Weidong, et al.
Published: (2025)

CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction
by: Chen, Xueyuan, et al.
Published: (2024)

Incorporating Linguistic Constraints from External Knowledge Source for Audio-Visual Target Speech Extraction
by: Wu, Wenxuan, et al.
Published: (2025)

Cross-attention Inspired Selective State Space Models for Target Sound Extraction
by: Wu, Donghang, et al.
Published: (2024)

Leveraging Language Information for Target Language Extraction
by: Yıldırım, Mehmet Sinan, et al.
Published: (2025)

TSE-PI: Target Sound Extraction under Reverberant Environments with Pitch Information
by: Wang, Yiwen, et al.
Published: (2024)

Progressive Residual Extraction based Pre-training for Speech Representation Learning
by: Wang, Tianrui, et al.
Published: (2024)

TripleC Learning and Lightweight Speech Enhancement for Multi-Condition Target Speech Extraction
by: Huang, Ziling
Published: (2025)

Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries
by: Cai, Pengfei, et al.
Published: (2025)

Self-Guided Target Sound Extraction and Classification Through Universal Sound Separation Model and Multiple Clues
by: Kwon, Younghoo, et al.
Published: (2025)

Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT
by: Dai, Dongyang, et al.
Published: (2025)

MC-LExt: Multi-Channel Target Speaker Extraction with Onset-Prompted Speaker Conditioning Mechanism
by: Ling, Tongtao, et al.
Published: (2025)

Leveraging Sound Source Trajectories for Universal Sound Separation
by: Wu, Donghang, et al.
Published: (2024)

Leveraging LLM and Text-Queried Separation for Noise-Robust Sound Event Detection
by: Yin, Han, et al.
Published: (2024)

Enhancing Real-World Active Speaker Detection with Multi-Modal Extraction Pre-Training
by: Tao, Ruijie, et al.
Published: (2024)

Leveraging Pre-trained AudioLDM for Sound Generation: A Benchmark Study
by: Yuan, Yi, et al.
Published: (2023)

SoundBeam meets M2D: Target Sound Extraction with Audio Foundation Model
by: Hernandez-Olivan, Carlos, et al.
Published: (2024)

Flexible Multi-Channel Target Speaker Extraction Using Geometry-Conditioned Spatially Selective Non-linear Filters
by: Li, Jiatong, et al.
Published: (2026)

M3ANet: Multi-scale and Multi-Modal Alignment Network for Brain-Assisted Target Speaker Extraction
by: Fan, Cunhang, et al.
Published: (2025)

DiffSound: Differentiable Modal Sound Rendering and Inverse Rendering for Diverse Inference Tasks
by: Jin, Xutong, et al.
Published: (2024)

$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction
by: Wu, Wenxuan, et al.
Published: (2025)

Temporal Adaptation of Pre-trained Foundation Models for Music Structure Analysis
by: Zhang, Yixiao, et al.
Published: (2025)