:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Cha, Yoonmin, Chun, Dawit, Park, Sung
Format:	Preprint
Published:	2026
Subjects:	Sound Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2604.16441
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

CIPHER: Conformer-based Inference of Phonemes from High-density EEG
by: Madishetty, Varshith
Published: (2026)

PMF-CEC: Phoneme-augmented Multimodal Fusion for Context-aware ASR Error Correction with Error-specific Selective Decoding
by: He, Jiajun, et al.
Published: (2025)

BrainECHO: Semantic Brain Signal Decoding through Vector-Quantized Spectrogram Reconstruction for Whisper-Enhanced Text Generation
by: Li, Jilong, et al.
Published: (2024)

Speaker- and Text-Independent Estimation of Articulatory Movements and Phoneme Alignments from Speech
by: Weise, Tobias, et al.
Published: (2024)

PARCO: Phoneme-Augmented Robust Contextual ASR via Contrastive Entity Disambiguation
by: He, Jiajun, et al.
Published: (2025)

ProKWS: Personalized Keyword Spotting via Collaborative Learning of Phonemes and Prosody
by: Pan, Jianan, et al.
Published: (2026)

Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR
by: Lee, Jaeyoung, et al.
Published: (2026)

Harf-Speech: A Clinically Aligned Framework for Arabic Phoneme-Level Speech Assessment
by: Azad, Asif, et al.
Published: (2026)

TSPC: A Two-Stage Phoneme-Centric Architecture for code-switching Vietnamese-English Speech Recognition
by: Anh, Tran Nguyen, et al.
Published: (2025)

MOSS-TTSD: Text to Spoken Dialogue Generation
by: Zhang, Yuqian, et al.
Published: (2026)

DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs
by: Papi, Sara, et al.
Published: (2026)

Do LLM Decoders Listen Fairly? Benchmarking How Language Model Priors Shape Bias in Speech Recognition
by: Ginjala, Srishti, et al.
Published: (2026)

Cross-Attention is Half Explanation in Speech-to-Text Models
by: Papi, Sara, et al.
Published: (2025)

Soundwave: Less is More for Speech-Text Alignment in LLMs
by: Zhang, Yuhao, et al.
Published: (2025)

Beyond Modality Limitations: A Unified MLLM Approach to Automated Speaking Assessment with Effective Curriculum Learning
by: Fang, Yu-Hsuan, et al.
Published: (2025)

Story2MIDI: Emotionally Aligned Music Generation from Text
by: Shokri, Mohammad, et al.
Published: (2025)

ASKD-Whisper: Adaptive Self-knowledge Distillation for Efficient and Low-Latency Automatic Speech Recognition
by: Lee, Junseok, et al.
Published: (2026)

Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech
by: Kotoge, Rikuto, et al.
Published: (2025)

Emotion-Aligned Generation in Diffusion Text to Speech Models via Preference-Guided Optimization
by: Shi, Jiacheng, et al.
Published: (2025)

Multi-class Decoding of Attended Speaker Direction Using Electroencephalogram and Audio Spatial Spectrum
by: Zhang, Yuanming, et al.
Published: (2024)

MEBM-Phoneme: Multi-scale Enhanced BrainMagic for End-to-End MEG Phoneme Classification
by: Jinghua, Liang, et al.
Published: (2026)

A Penny for Your Thoughts: Decoding Speech from Inexpensive Brain Signals
by: Auster, Quentin, et al.
Published: (2025)

Raon-Speech Technical Report
by: Kim, Beomsoo, et al.
Published: (2026)

Seamless Dysfluent Speech Text Alignment for Disordered Speech Analysis
by: Ye, Zongli, et al.
Published: (2025)

AQAScore: Evaluating Semantic Alignment in Text-to-Audio Generation via Audio Question Answering
by: Kuan, Chun-Yi, et al.
Published: (2026)

MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts
by: Lou, Yuxuan, et al.
Published: (2026)

Unveiling Audio Deepfake Origins: A Deep Metric learning And Conformer Network Approach With Ensemble Fusion
by: Kulkarni, Ajinkya, et al.
Published: (2025)

DiffuSpeech: Silent Thought, Spoken Answer via Unified Speech-Text Diffusion
by: Lou, Yuxuan, et al.
Published: (2026)

Breaking Through the Spike: Spike Window Decoding for Accelerated and Precise Automatic Speech Recognition
by: Zhang, Wei, et al.
Published: (2025)

A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data
by: Chou, Cheng-Kang, et al.
Published: (2025)

Neural networks for Text-to-Speech evaluation
by: Trofimenko, Ilya, et al.
Published: (2026)

Voice Communication Analysis in Esports
by: Vinot, Aymeric, et al.
Published: (2024)

USAT: A Universal Speaker-Adaptive Text-to-Speech Approach
by: Wang, Wenbin, et al.
Published: (2024)

Length-Aware Rotary Position Embedding for Text-Speech Alignment
by: Kim, Hyeongju, et al.
Published: (2025)

Text2midi: Generating Symbolic Music from Captions
by: Bhandari, Keshav, et al.
Published: (2024)

Dual Information Speech Language Models for Emotional Conversations
by: Wang, Chun, et al.
Published: (2025)

Multi-Convformer: Extending Conformer with Multiple Convolution Kernels
by: Prabhu, Darshan, et al.
Published: (2024)

A Human-in-the-Loop Approach to Improving Cross-Text Prosody Transfer
by: Maurya, Himanshu, et al.
Published: (2024)

How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation System?
by: Papi, Sara, et al.
Published: (2024)

Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation
by: Xue, Jinlong, et al.
Published: (2024)