Saved in:
| Main Authors: | Wang, Helin, Shi, Bowen, Tjandra, Andros, Hoffman, John, Wu, Yi-Chiao, Vyas, Apoorv, Dehak, Najim, Lee, Ann, Hsu, Wei-Ning |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.19702 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SAM Audio: Segment Anything in Audio
by: Shi, Bowen, et al.
Published: (2025)
by: Shi, Bowen, et al.
Published: (2025)
Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning
by: Chien, Chung-Ming, et al.
Published: (2024)
by: Chien, Chung-Ming, et al.
Published: (2024)
Audiobox TTA-RAG: Improving Zero-Shot and Few-Shot Text-To-Audio with Retrieval-Augmented Generation
by: Yang, Mu, et al.
Published: (2024)
by: Yang, Mu, et al.
Published: (2024)
Generative Pre-training for Speech with Flow Matching
by: Liu, Alexander H., et al.
Published: (2023)
by: Liu, Alexander H., et al.
Published: (2023)
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
by: Tjandra, Andros, et al.
Published: (2025)
by: Tjandra, Andros, et al.
Published: (2025)
The AudioMOS Challenge 2025
by: Huang, Wen-Chin, et al.
Published: (2025)
by: Huang, Wen-Chin, et al.
Published: (2025)
SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer
by: Wang, Helin, et al.
Published: (2024)
by: Wang, Helin, et al.
Published: (2024)
Noise-robust Speech Separation with Fast Generative Correction
by: Wang, Helin, et al.
Published: (2024)
by: Wang, Helin, et al.
Published: (2024)
MusicFlow: Cascaded Flow Matching for Text Guided Music Generation
by: Prajwal, K R, et al.
Published: (2024)
by: Prajwal, K R, et al.
Published: (2024)
ReFESS-QI: Reference-Free Evaluation For Speech Separation With Joint Quality And Intelligibility Scoring
by: Frummer, Ari, et al.
Published: (2025)
by: Frummer, Ari, et al.
Published: (2025)
Reconstruct! Don't Encode: Self-Supervised Representation Reconstruction Loss for High-Intelligibility and Low-Latency Streaming Neural Audio Codec
by: Lee, Junhyeok, et al.
Published: (2026)
by: Lee, Junhyeok, et al.
Published: (2026)
SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis
by: Wang, Helin, et al.
Published: (2024)
by: Wang, Helin, et al.
Published: (2024)
MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion With Increased Controllability via Multiple Guidances
by: Lee, Junhyeok, et al.
Published: (2025)
by: Lee, Junhyeok, et al.
Published: (2025)
Data Selection Effects on Self-Supervised Learning of Audio Representations for French Audiovisual Broadcasts
by: Pelloin, Valentin, et al.
Published: (2026)
by: Pelloin, Valentin, et al.
Published: (2026)
EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer
by: Hai, Jiarui, et al.
Published: (2024)
by: Hai, Jiarui, et al.
Published: (2024)
UltraEval-Audio: A Unified Framework for Comprehensive Evaluation of Audio Foundation Models
by: Shi, Qundong, et al.
Published: (2026)
by: Shi, Qundong, et al.
Published: (2026)
BWSNet: Automatic Perceptual Assessment of Audio Signals
by: Veillon, Clément Le Moine, et al.
Published: (2023)
by: Veillon, Clément Le Moine, et al.
Published: (2023)
Perceptual Audio Coding: A 40-Year Historical Perspective
by: Herre, Jürgen, et al.
Published: (2025)
by: Herre, Jürgen, et al.
Published: (2025)
Exploring Perceptual Audio Quality Measurement on Stereo Processing Using the Open Dataset of Audio Quality
by: Delgado, Pablo M., et al.
Published: (2025)
by: Delgado, Pablo M., et al.
Published: (2025)
PromptSep: Generative Audio Separation via Multimodal Prompting
by: Wen, Yutong, et al.
Published: (2025)
by: Wen, Yutong, et al.
Published: (2025)
Study of Pre-processing Defenses against Adversarial Attacks on State-of-the-art Speaker Recognition Systems
by: Joshi, Sonal, et al.
Published: (2021)
by: Joshi, Sonal, et al.
Published: (2021)
Diff-VS: Efficient Audio-Aware Diffusion U-Net for Vocals Separation
by: Yun-Ning, et al.
Published: (2026)
by: Yun-Ning, et al.
Published: (2026)
Example-Based Framework for Perceptually Guided Audio Texture Generation
by: Kamath, Purnima, et al.
Published: (2023)
by: Kamath, Purnima, et al.
Published: (2023)
Multimodal Representation Loss Between Timed Text and Audio for Regularized Speech Separation
by: Hsieh, Tsun-An, et al.
Published: (2024)
by: Hsieh, Tsun-An, et al.
Published: (2024)
AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation
by: Manakul, Potsawee, et al.
Published: (2025)
by: Manakul, Potsawee, et al.
Published: (2025)
Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits
by: Feng, Tiantian, et al.
Published: (2025)
by: Feng, Tiantian, et al.
Published: (2025)
U-SAM: An audio language Model for Unified Speech, Audio, and Music Understanding
by: Wang, Ziqian, et al.
Published: (2025)
by: Wang, Ziqian, et al.
Published: (2025)
Unraveling Adversarial Examples against Speaker Identification -- Techniques for Attack Detection and Victim Model Classification
by: Joshi, Sonal, et al.
Published: (2024)
by: Joshi, Sonal, et al.
Published: (2024)
Perceptual Musical Features for Interpretable Audio Tagging
by: Lyberatos, Vassilis, et al.
Published: (2023)
by: Lyberatos, Vassilis, et al.
Published: (2023)
SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations
by: Yang, Xiaoyu, et al.
Published: (2025)
by: Yang, Xiaoyu, et al.
Published: (2025)
XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception
by: Han, HyoJung, et al.
Published: (2024)
by: Han, HyoJung, et al.
Published: (2024)
Unified Audio Event Detection
by: Jiang, Yidi, et al.
Published: (2024)
by: Jiang, Yidi, et al.
Published: (2024)
FiPA-SR -- FiLM-Conditioned Perceptually Informed Audio Super-Resolution
by: Abreu, Wallace, et al.
Published: (2026)
by: Abreu, Wallace, et al.
Published: (2026)
Unsupervised Single-Channel Audio Separation with Diffusion Source Priors
by: Shi, Runwu, et al.
Published: (2025)
by: Shi, Runwu, et al.
Published: (2025)
A Survey of Audio Reasoning in Multimodal Foundation Models
by: Guo, Zhihan, et al.
Published: (2026)
by: Guo, Zhihan, et al.
Published: (2026)
SAM: A Mamba-2 State-Space Audio-Language Model
by: Lee, Taehan, et al.
Published: (2025)
by: Lee, Taehan, et al.
Published: (2025)
Audio-Mind: An Auditable Agentic Framework for Audio Understanding
by: Wang, Yucheng, et al.
Published: (2026)
by: Wang, Yucheng, et al.
Published: (2026)
MACE: Leveraging Audio for Evaluating Audio Captioning Systems
by: Dixit, Satvik, et al.
Published: (2024)
by: Dixit, Satvik, et al.
Published: (2024)
DiT-Flow: Speech Enhancement Robust to Multiple Distortions based on Flow Matching in Latent Space and Diffusion Transformers
by: Cao, Tianyu, et al.
Published: (2026)
by: Cao, Tianyu, et al.
Published: (2026)
SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline
by: Wang, Helin, et al.
Published: (2025)
by: Wang, Helin, et al.
Published: (2025)
Similar Items
-
SAM Audio: Segment Anything in Audio
by: Shi, Bowen, et al.
Published: (2025) -
Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning
by: Chien, Chung-Ming, et al.
Published: (2024) -
Audiobox TTA-RAG: Improving Zero-Shot and Few-Shot Text-To-Audio with Retrieval-Augmented Generation
by: Yang, Mu, et al.
Published: (2024) -
Generative Pre-training for Speech with Flow Matching
by: Liu, Alexander H., et al.
Published: (2023) -
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
by: Tjandra, Andros, et al.
Published: (2025)