:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Pengfei, Xie, Tianxin, Yang, Minghao, Liu, Li
Format:	Preprint
Published:	2026
Subjects:	Audio and Speech Processing Artificial Intelligence Databases Human-Computer Interaction Multiagent Systems Sound 68T07, 92C55 I.2.7; J.3; I.2.6
Online Access:	https://arxiv.org/abs/2602.15909
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech
by: Mehta, Shivam, et al.
Published: (2024)

Matcha-TTS: A fast TTS architecture with conditional flow matching
by: Mehta, Shivam, et al.
Published: (2023)

Make Some Noise: Towards LLM audio reasoning and generation using sound tokens
by: Mehta, Shivam, et al.
Published: (2025)

SemAlignVC: Enhancing zero-shot timbre conversion using semantic alignment
by: Mehta, Shivam, et al.
Published: (2025)

Decoding EEG Speech Perception with Transformers and VAE-based Data Augmentation
by: Chen, Terrance Yu-Hao, et al.
Published: (2025)

Unified speech and gesture synthesis using flow matching
by: Mehta, Shivam, et al.
Published: (2023)

Rethinking Masking Strategies for Masked Prediction-based Audio Self-supervised Learning
by: Niizumi, Daisuke, et al.
Published: (2026)

Prevailing Research Areas for Music AI in the Era of Foundation Models
by: Wei, Megan, et al.
Published: (2024)

Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis
by: Mehta, Shivam, et al.
Published: (2024)

HELIX: Scaling Raw Audio Understanding with Hybrid Mamba-Attention Beyond the Quadratic Limit
by: Khushiyant, et al.
Published: (2026)

SeamlessEdit: Background Noise Aware Zero-Shot Speech Editing with in-Context Enhancement
by: Chen, Kuan-Yu, et al.
Published: (2025)

AURA: Agent for Understanding, Reasoning, and Automated Tool Use in Voice-Driven Tasks
by: Maben, Leander Melroy, et al.
Published: (2025)

TuneGenie: Reasoning-based LLM agents for preferential music generation
by: Pandey, Amitesh, et al.
Published: (2025)

Beyond Deep Learning: Speech Segmentation and Phone Classification with Neural Assemblies
by: Adelson, Trevor, et al.
Published: (2026)

The OCON model: an old but gold solution for distributable supervised classification
by: Giacomelli, Stefano, et al.
Published: (2024)

M6(GPT)3: Generating Multitrack Modifiable Multi-Minute MIDI Music from Text using Genetic algorithms, Probabilistic methods and GPT Models in any Progression and Time Signature
by: Poćwiardowski, Jakub, et al.
Published: (2024)

CognitiveArm: Enabling Real-Time EEG-Controlled Prosthetic Arm Using Embodied Machine Learning
by: Basit, Abdul, et al.
Published: (2025)

Window Size Versus Accuracy Experiments in Voice Activity Detectors
by: McKinnon, Max, et al.
Published: (2026)

Less Stress, More Privacy: Stress Detection on Anonymized Speech of Air Traffic Controllers
by: Viswanathan, Janaki, et al.
Published: (2025)

Investigating Prosodic Signatures via Speech Pre-Trained Models for Audio Deepfake Source Attribution
by: Phukan, Orchid Chetia, et al.
Published: (2024)

Strong Alone, Stronger Together: Synergizing Modality-Binding Foundation Models with Optimal Transport for Non-Verbal Emotion Recognition
by: Phukan, Orchid Chetia, et al.
Published: (2024)

Multi-View Multi-Task Modeling with Speech Foundation Models for Speech Forensic Tasks
by: Phukan, Orchid Chetia, et al.
Published: (2024)

Avengers Assemble: Amalgamation of Non-Semantic Features for Depression Detection
by: Phukan, Orchid Chetia, et al.
Published: (2024)

SeQuiFi: Mitigating Catastrophic Forgetting in Speech Emotion Recognition with Sequential Class-Finetuning
by: Jain, Sarthak, et al.
Published: (2024)

Representation Loss Minimization with Randomized Selection Strategy for Efficient Environmental Fake Audio Detection
by: Phukan, Orchid Chetia, et al.
Published: (2024)

Score Distillation Sampling for Audio: Source Separation, Synthesis, and Beyond
by: Richter-Powell, Jessie, et al.
Published: (2025)

Generation of Musical Timbres using a Text-Guided Diffusion Model
by: Yuan, Weixuan, et al.
Published: (2025)

Self-Improvement for Audio Large Language Model using Unlabeled Speech
by: Wang, Shaowen, et al.
Published: (2025)

MAIN-VC: Lightweight Speech Representation Disentanglement for One-shot Voice Conversion
by: Li, Pengcheng, et al.
Published: (2024)

BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5
by: Chen, Zhehuai, et al.
Published: (2024)

Automatic Album Sequencing
by: Herrmann, Vincent, et al.
Published: (2024)

ParaNoise-SV: Integrated Approach for Noise-Robust Speaker Verification with Parallel Joint Learning of Speech Enhancement and Noise Extraction
by: Kim, Minu, et al.
Published: (2025)

Quantum-Enhanced Analysis and Grading of Vocal Performance
by: Agarwal, Rohan
Published: (2025)

A Survey on World Models Grounded in Acoustic Physical Information
by: Chen, Xiaoliang, et al.
Published: (2025)

Fine-Tuning Large Audio-Language Models with LoRA for Precise Temporal Localization of Prolonged Exposure Therapy Elements
by: BN, Suhas, et al.
Published: (2025)

OR-VSKC: Resolving Visual-Semantic Knowledge Conflicts in Operating Rooms with Synthetic Data-Guided Alignment
by: Zhao, Weiyi, et al.
Published: (2025)

FakeSound2: A Benchmark for Explainable and Generalizable Deepfake Sound Detection
by: Xie, Zeyu, et al.
Published: (2025)

Communication Access Real-Time Translation Through Collaborative Correction of Automatic Speech Recognition
by: Kuhn, Korbinian, et al.
Published: (2025)

FakeSound: Deepfake General Audio Detection
by: Xie, Zeyu, et al.
Published: (2024)

Automatic Speech Recognition (ASR) for the Diagnosis of pronunciation of Speech Sound Disorders in Korean children
by: Ahn, Taekyung, et al.
Published: (2024)