Saved in:
| Main Authors: | Zhang, Pengfei, Xie, Tianxin, Yang, Minghao, Liu, Li |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.15909 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech
by: Mehta, Shivam, et al.
Published: (2024)
by: Mehta, Shivam, et al.
Published: (2024)
Matcha-TTS: A fast TTS architecture with conditional flow matching
by: Mehta, Shivam, et al.
Published: (2023)
by: Mehta, Shivam, et al.
Published: (2023)
Make Some Noise: Towards LLM audio reasoning and generation using sound tokens
by: Mehta, Shivam, et al.
Published: (2025)
by: Mehta, Shivam, et al.
Published: (2025)
SemAlignVC: Enhancing zero-shot timbre conversion using semantic alignment
by: Mehta, Shivam, et al.
Published: (2025)
by: Mehta, Shivam, et al.
Published: (2025)
Decoding EEG Speech Perception with Transformers and VAE-based Data Augmentation
by: Chen, Terrance Yu-Hao, et al.
Published: (2025)
by: Chen, Terrance Yu-Hao, et al.
Published: (2025)
Unified speech and gesture synthesis using flow matching
by: Mehta, Shivam, et al.
Published: (2023)
by: Mehta, Shivam, et al.
Published: (2023)
Rethinking Masking Strategies for Masked Prediction-based Audio Self-supervised Learning
by: Niizumi, Daisuke, et al.
Published: (2026)
by: Niizumi, Daisuke, et al.
Published: (2026)
Prevailing Research Areas for Music AI in the Era of Foundation Models
by: Wei, Megan, et al.
Published: (2024)
by: Wei, Megan, et al.
Published: (2024)
Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis
by: Mehta, Shivam, et al.
Published: (2024)
by: Mehta, Shivam, et al.
Published: (2024)
HELIX: Scaling Raw Audio Understanding with Hybrid Mamba-Attention Beyond the Quadratic Limit
by: Khushiyant, et al.
Published: (2026)
by: Khushiyant, et al.
Published: (2026)
SeamlessEdit: Background Noise Aware Zero-Shot Speech Editing with in-Context Enhancement
by: Chen, Kuan-Yu, et al.
Published: (2025)
by: Chen, Kuan-Yu, et al.
Published: (2025)
AURA: Agent for Understanding, Reasoning, and Automated Tool Use in Voice-Driven Tasks
by: Maben, Leander Melroy, et al.
Published: (2025)
by: Maben, Leander Melroy, et al.
Published: (2025)
TuneGenie: Reasoning-based LLM agents for preferential music generation
by: Pandey, Amitesh, et al.
Published: (2025)
by: Pandey, Amitesh, et al.
Published: (2025)
Beyond Deep Learning: Speech Segmentation and Phone Classification with Neural Assemblies
by: Adelson, Trevor, et al.
Published: (2026)
by: Adelson, Trevor, et al.
Published: (2026)
The OCON model: an old but gold solution for distributable supervised classification
by: Giacomelli, Stefano, et al.
Published: (2024)
by: Giacomelli, Stefano, et al.
Published: (2024)
M6(GPT)3: Generating Multitrack Modifiable Multi-Minute MIDI Music from Text using Genetic algorithms, Probabilistic methods and GPT Models in any Progression and Time Signature
by: Poćwiardowski, Jakub, et al.
Published: (2024)
by: Poćwiardowski, Jakub, et al.
Published: (2024)
CognitiveArm: Enabling Real-Time EEG-Controlled Prosthetic Arm Using Embodied Machine Learning
by: Basit, Abdul, et al.
Published: (2025)
by: Basit, Abdul, et al.
Published: (2025)
Window Size Versus Accuracy Experiments in Voice Activity Detectors
by: McKinnon, Max, et al.
Published: (2026)
by: McKinnon, Max, et al.
Published: (2026)
Less Stress, More Privacy: Stress Detection on Anonymized Speech of Air Traffic Controllers
by: Viswanathan, Janaki, et al.
Published: (2025)
by: Viswanathan, Janaki, et al.
Published: (2025)
Investigating Prosodic Signatures via Speech Pre-Trained Models for Audio Deepfake Source Attribution
by: Phukan, Orchid Chetia, et al.
Published: (2024)
by: Phukan, Orchid Chetia, et al.
Published: (2024)
Strong Alone, Stronger Together: Synergizing Modality-Binding Foundation Models with Optimal Transport for Non-Verbal Emotion Recognition
by: Phukan, Orchid Chetia, et al.
Published: (2024)
by: Phukan, Orchid Chetia, et al.
Published: (2024)
Multi-View Multi-Task Modeling with Speech Foundation Models for Speech Forensic Tasks
by: Phukan, Orchid Chetia, et al.
Published: (2024)
by: Phukan, Orchid Chetia, et al.
Published: (2024)
Avengers Assemble: Amalgamation of Non-Semantic Features for Depression Detection
by: Phukan, Orchid Chetia, et al.
Published: (2024)
by: Phukan, Orchid Chetia, et al.
Published: (2024)
SeQuiFi: Mitigating Catastrophic Forgetting in Speech Emotion Recognition with Sequential Class-Finetuning
by: Jain, Sarthak, et al.
Published: (2024)
by: Jain, Sarthak, et al.
Published: (2024)
Representation Loss Minimization with Randomized Selection Strategy for Efficient Environmental Fake Audio Detection
by: Phukan, Orchid Chetia, et al.
Published: (2024)
by: Phukan, Orchid Chetia, et al.
Published: (2024)
Score Distillation Sampling for Audio: Source Separation, Synthesis, and Beyond
by: Richter-Powell, Jessie, et al.
Published: (2025)
by: Richter-Powell, Jessie, et al.
Published: (2025)
Generation of Musical Timbres using a Text-Guided Diffusion Model
by: Yuan, Weixuan, et al.
Published: (2025)
by: Yuan, Weixuan, et al.
Published: (2025)
Self-Improvement for Audio Large Language Model using Unlabeled Speech
by: Wang, Shaowen, et al.
Published: (2025)
by: Wang, Shaowen, et al.
Published: (2025)
MAIN-VC: Lightweight Speech Representation Disentanglement for One-shot Voice Conversion
by: Li, Pengcheng, et al.
Published: (2024)
by: Li, Pengcheng, et al.
Published: (2024)
BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5
by: Chen, Zhehuai, et al.
Published: (2024)
by: Chen, Zhehuai, et al.
Published: (2024)
Automatic Album Sequencing
by: Herrmann, Vincent, et al.
Published: (2024)
by: Herrmann, Vincent, et al.
Published: (2024)
ParaNoise-SV: Integrated Approach for Noise-Robust Speaker Verification with Parallel Joint Learning of Speech Enhancement and Noise Extraction
by: Kim, Minu, et al.
Published: (2025)
by: Kim, Minu, et al.
Published: (2025)
Quantum-Enhanced Analysis and Grading of Vocal Performance
by: Agarwal, Rohan
Published: (2025)
by: Agarwal, Rohan
Published: (2025)
A Survey on World Models Grounded in Acoustic Physical Information
by: Chen, Xiaoliang, et al.
Published: (2025)
by: Chen, Xiaoliang, et al.
Published: (2025)
Fine-Tuning Large Audio-Language Models with LoRA for Precise Temporal Localization of Prolonged Exposure Therapy Elements
by: BN, Suhas, et al.
Published: (2025)
by: BN, Suhas, et al.
Published: (2025)
OR-VSKC: Resolving Visual-Semantic Knowledge Conflicts in Operating Rooms with Synthetic Data-Guided Alignment
by: Zhao, Weiyi, et al.
Published: (2025)
by: Zhao, Weiyi, et al.
Published: (2025)
FakeSound2: A Benchmark for Explainable and Generalizable Deepfake Sound Detection
by: Xie, Zeyu, et al.
Published: (2025)
by: Xie, Zeyu, et al.
Published: (2025)
Communication Access Real-Time Translation Through Collaborative Correction of Automatic Speech Recognition
by: Kuhn, Korbinian, et al.
Published: (2025)
by: Kuhn, Korbinian, et al.
Published: (2025)
FakeSound: Deepfake General Audio Detection
by: Xie, Zeyu, et al.
Published: (2024)
by: Xie, Zeyu, et al.
Published: (2024)
Automatic Speech Recognition (ASR) for the Diagnosis of pronunciation of Speech Sound Disorders in Korean children
by: Ahn, Taekyung, et al.
Published: (2024)
by: Ahn, Taekyung, et al.
Published: (2024)
Similar Items
-
Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech
by: Mehta, Shivam, et al.
Published: (2024) -
Matcha-TTS: A fast TTS architecture with conditional flow matching
by: Mehta, Shivam, et al.
Published: (2023) -
Make Some Noise: Towards LLM audio reasoning and generation using sound tokens
by: Mehta, Shivam, et al.
Published: (2025) -
SemAlignVC: Enhancing zero-shot timbre conversion using semantic alignment
by: Mehta, Shivam, et al.
Published: (2025) -
Decoding EEG Speech Perception with Transformers and VAE-based Data Augmentation
by: Chen, Terrance Yu-Hao, et al.
Published: (2025)