Saved in:
| Main Authors: | Cha, Yoonmin, Chun, Dawit, Park, Sung |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.16441 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
CIPHER: Conformer-based Inference of Phonemes from High-density EEG
by: Madishetty, Varshith
Published: (2026)
by: Madishetty, Varshith
Published: (2026)
PMF-CEC: Phoneme-augmented Multimodal Fusion for Context-aware ASR Error Correction with Error-specific Selective Decoding
by: He, Jiajun, et al.
Published: (2025)
by: He, Jiajun, et al.
Published: (2025)
BrainECHO: Semantic Brain Signal Decoding through Vector-Quantized Spectrogram Reconstruction for Whisper-Enhanced Text Generation
by: Li, Jilong, et al.
Published: (2024)
by: Li, Jilong, et al.
Published: (2024)
Speaker- and Text-Independent Estimation of Articulatory Movements and Phoneme Alignments from Speech
by: Weise, Tobias, et al.
Published: (2024)
by: Weise, Tobias, et al.
Published: (2024)
PARCO: Phoneme-Augmented Robust Contextual ASR via Contrastive Entity Disambiguation
by: He, Jiajun, et al.
Published: (2025)
by: He, Jiajun, et al.
Published: (2025)
ProKWS: Personalized Keyword Spotting via Collaborative Learning of Phonemes and Prosody
by: Pan, Jianan, et al.
Published: (2026)
by: Pan, Jianan, et al.
Published: (2026)
Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR
by: Lee, Jaeyoung, et al.
Published: (2026)
by: Lee, Jaeyoung, et al.
Published: (2026)
Harf-Speech: A Clinically Aligned Framework for Arabic Phoneme-Level Speech Assessment
by: Azad, Asif, et al.
Published: (2026)
by: Azad, Asif, et al.
Published: (2026)
TSPC: A Two-Stage Phoneme-Centric Architecture for code-switching Vietnamese-English Speech Recognition
by: Anh, Tran Nguyen, et al.
Published: (2025)
by: Anh, Tran Nguyen, et al.
Published: (2025)
MOSS-TTSD: Text to Spoken Dialogue Generation
by: Zhang, Yuqian, et al.
Published: (2026)
by: Zhang, Yuqian, et al.
Published: (2026)
DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs
by: Papi, Sara, et al.
Published: (2026)
by: Papi, Sara, et al.
Published: (2026)
Do LLM Decoders Listen Fairly? Benchmarking How Language Model Priors Shape Bias in Speech Recognition
by: Ginjala, Srishti, et al.
Published: (2026)
by: Ginjala, Srishti, et al.
Published: (2026)
Cross-Attention is Half Explanation in Speech-to-Text Models
by: Papi, Sara, et al.
Published: (2025)
by: Papi, Sara, et al.
Published: (2025)
Soundwave: Less is More for Speech-Text Alignment in LLMs
by: Zhang, Yuhao, et al.
Published: (2025)
by: Zhang, Yuhao, et al.
Published: (2025)
Beyond Modality Limitations: A Unified MLLM Approach to Automated Speaking Assessment with Effective Curriculum Learning
by: Fang, Yu-Hsuan, et al.
Published: (2025)
by: Fang, Yu-Hsuan, et al.
Published: (2025)
Story2MIDI: Emotionally Aligned Music Generation from Text
by: Shokri, Mohammad, et al.
Published: (2025)
by: Shokri, Mohammad, et al.
Published: (2025)
ASKD-Whisper: Adaptive Self-knowledge Distillation for Efficient and Low-Latency Automatic Speech Recognition
by: Lee, Junseok, et al.
Published: (2026)
by: Lee, Junseok, et al.
Published: (2026)
Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech
by: Kotoge, Rikuto, et al.
Published: (2025)
by: Kotoge, Rikuto, et al.
Published: (2025)
Emotion-Aligned Generation in Diffusion Text to Speech Models via Preference-Guided Optimization
by: Shi, Jiacheng, et al.
Published: (2025)
by: Shi, Jiacheng, et al.
Published: (2025)
Multi-class Decoding of Attended Speaker Direction Using Electroencephalogram and Audio Spatial Spectrum
by: Zhang, Yuanming, et al.
Published: (2024)
by: Zhang, Yuanming, et al.
Published: (2024)
MEBM-Phoneme: Multi-scale Enhanced BrainMagic for End-to-End MEG Phoneme Classification
by: Jinghua, Liang, et al.
Published: (2026)
by: Jinghua, Liang, et al.
Published: (2026)
A Penny for Your Thoughts: Decoding Speech from Inexpensive Brain Signals
by: Auster, Quentin, et al.
Published: (2025)
by: Auster, Quentin, et al.
Published: (2025)
Raon-Speech Technical Report
by: Kim, Beomsoo, et al.
Published: (2026)
by: Kim, Beomsoo, et al.
Published: (2026)
Seamless Dysfluent Speech Text Alignment for Disordered Speech Analysis
by: Ye, Zongli, et al.
Published: (2025)
by: Ye, Zongli, et al.
Published: (2025)
AQAScore: Evaluating Semantic Alignment in Text-to-Audio Generation via Audio Question Answering
by: Kuan, Chun-Yi, et al.
Published: (2026)
by: Kuan, Chun-Yi, et al.
Published: (2026)
MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts
by: Lou, Yuxuan, et al.
Published: (2026)
by: Lou, Yuxuan, et al.
Published: (2026)
Unveiling Audio Deepfake Origins: A Deep Metric learning And Conformer Network Approach With Ensemble Fusion
by: Kulkarni, Ajinkya, et al.
Published: (2025)
by: Kulkarni, Ajinkya, et al.
Published: (2025)
DiffuSpeech: Silent Thought, Spoken Answer via Unified Speech-Text Diffusion
by: Lou, Yuxuan, et al.
Published: (2026)
by: Lou, Yuxuan, et al.
Published: (2026)
Breaking Through the Spike: Spike Window Decoding for Accelerated and Precise Automatic Speech Recognition
by: Zhang, Wei, et al.
Published: (2025)
by: Zhang, Wei, et al.
Published: (2025)
A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data
by: Chou, Cheng-Kang, et al.
Published: (2025)
by: Chou, Cheng-Kang, et al.
Published: (2025)
Neural networks for Text-to-Speech evaluation
by: Trofimenko, Ilya, et al.
Published: (2026)
by: Trofimenko, Ilya, et al.
Published: (2026)
Voice Communication Analysis in Esports
by: Vinot, Aymeric, et al.
Published: (2024)
by: Vinot, Aymeric, et al.
Published: (2024)
USAT: A Universal Speaker-Adaptive Text-to-Speech Approach
by: Wang, Wenbin, et al.
Published: (2024)
by: Wang, Wenbin, et al.
Published: (2024)
Length-Aware Rotary Position Embedding for Text-Speech Alignment
by: Kim, Hyeongju, et al.
Published: (2025)
by: Kim, Hyeongju, et al.
Published: (2025)
Text2midi: Generating Symbolic Music from Captions
by: Bhandari, Keshav, et al.
Published: (2024)
by: Bhandari, Keshav, et al.
Published: (2024)
Dual Information Speech Language Models for Emotional Conversations
by: Wang, Chun, et al.
Published: (2025)
by: Wang, Chun, et al.
Published: (2025)
Multi-Convformer: Extending Conformer with Multiple Convolution Kernels
by: Prabhu, Darshan, et al.
Published: (2024)
by: Prabhu, Darshan, et al.
Published: (2024)
A Human-in-the-Loop Approach to Improving Cross-Text Prosody Transfer
by: Maurya, Himanshu, et al.
Published: (2024)
by: Maurya, Himanshu, et al.
Published: (2024)
How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation System?
by: Papi, Sara, et al.
Published: (2024)
by: Papi, Sara, et al.
Published: (2024)
Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation
by: Xue, Jinlong, et al.
Published: (2024)
by: Xue, Jinlong, et al.
Published: (2024)
Similar Items
-
CIPHER: Conformer-based Inference of Phonemes from High-density EEG
by: Madishetty, Varshith
Published: (2026) -
PMF-CEC: Phoneme-augmented Multimodal Fusion for Context-aware ASR Error Correction with Error-specific Selective Decoding
by: He, Jiajun, et al.
Published: (2025) -
BrainECHO: Semantic Brain Signal Decoding through Vector-Quantized Spectrogram Reconstruction for Whisper-Enhanced Text Generation
by: Li, Jilong, et al.
Published: (2024) -
Speaker- and Text-Independent Estimation of Articulatory Movements and Phoneme Alignments from Speech
by: Weise, Tobias, et al.
Published: (2024) -
PARCO: Phoneme-Augmented Robust Contextual ASR via Contrastive Entity Disambiguation
by: He, Jiajun, et al.
Published: (2025)