:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wu, Yihan, Lu, Yichen, Peng, Yifan, Wang, Xihua, Song, Ruihua, Watanabe, Shinji
Format:	Preprint
Published:	2024
Subjects:	Audio and Speech Processing Artificial Intelligence
Online Access:	https://arxiv.org/abs/2412.19005
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Robust Audiovisual Speech Recognition Models with Mixture-of-Experts
by: Wu, Yihan, et al.
Published: (2024)

SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition
by: Wu, Yihan, et al.
Published: (2024)

Neural Blind Source Separation and Diarization for Distant Speech Recognition
by: Bando, Yoshiaki, et al.
Published: (2024)

FastAdaSP: Multitask-Adapted Efficient Inference for Large Speech Language Model
by: Lu, Yichen, et al.
Published: (2024)

Speech-DRAME: A Framework for Human-Aligned Benchmarks in Speech Role-Play
by: Shi, Jiatong, et al.
Published: (2025)

OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models
by: Chen, William, et al.
Published: (2025)

Aligning Text-to-Music Evaluation with Human Preferences
by: Huang, Yichen, et al.
Published: (2025)

VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning
by: Cheng, Xin, et al.
Published: (2025)

LoVA: Long-form Video-to-Audio Generation
by: Cheng, Xin, et al.
Published: (2024)

Text-To-Speech Synthesis In The Wild
by: Jung, Jee-weon, et al.
Published: (2024)

Do Neural Codecs Generalize? A Controlled Study Across Unseen Languages and Non-Speech Tasks
by: Wang, Shih-Heng, et al.
Published: (2026)

The Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and Recognition
by: Gao, Ming, et al.
Published: (2025)

Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation
by: Shakeel, Muhammad, et al.
Published: (2024)

SpoofCeleb: Speech Deepfake Detection and SASV In The Wild
by: Jung, Jee-weon, et al.
Published: (2024)

Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper
by: Yeo, Jeong Hun, et al.
Published: (2023)

OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification
by: Peng, Yifan, et al.
Published: (2024)

Towards Robust Speech Representation Learning for Thousands of Languages
by: Chen, William, et al.
Published: (2024)

Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss
by: Shakeel, Muhammad, et al.
Published: (2024)

Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition
by: Xie, Jiamin, et al.
Published: (2025)

Enhancing Speech Emotion Recognition through Segmental Average Pooling of Self-Supervised Learning Features
by: Hyeon, Jonghwan, et al.
Published: (2024)

Improving Code-Switching Speech Recognition with TTS Data Augmentation
by: Yeo, Yue Heng, et al.
Published: (2026)

Contextualized Automatic Speech Recognition with Dynamic Vocabulary
by: Sudo, Yui, et al.
Published: (2024)

Clustering and Mining Accented Speech for Inclusive and Fair Speech Recognition
by: Kim, Jaeyoung, et al.
Published: (2024)

SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline
by: Wang, Helin, et al.
Published: (2025)

Beyond Silence: Bias Analysis through Loss and Asymmetric Approach in Audio Anti-Spoofing
by: Shim, Hye-jin, et al.
Published: (2024)

Multi-Convformer: Extending Conformer with Multiple Convolution Kernels
by: Prabhu, Darshan, et al.
Published: (2024)

Speech Recognition-based Feature Extraction for Enhanced Automatic Severity Classification in Dysarthric Speech
by: Choi, Yerin, et al.
Published: (2024)

Qieemo: Speech Is All You Need in the Emotion Recognition in Conversations
by: Chen, Jinming, et al.
Published: (2025)

Personalized Adversarial Data Augmentation for Dysarthric and Elderly Speech Recognition
by: Jin, Zengrui, et al.
Published: (2022)

GMP-TL: Gender-augmented Multi-scale Pseudo-label Enhanced Transfer Learning for Speech Emotion Recognition
by: Pan, Yu, et al.
Published: (2024)

ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration
by: Someki, Masao, et al.
Published: (2024)

Aligning Speech to Languages to Enhance Code-switching Speech Recognition
by: Liu, Hexin, et al.
Published: (2024)

MFHCA: Enhancing Speech Emotion Recognition Via Multi-Spatial Fusion and Hierarchical Cooperative Attention
by: Jiao, Xinxin, et al.
Published: (2024)

Discrete Speech Unit Extraction via Independent Component Analysis
by: Nakamura, Tomohiko, et al.
Published: (2025)

Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search
by: Sudo, Yui, et al.
Published: (2024)

Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC
by: Kang, Jiawen, et al.
Published: (2024)

TinyML for Speech Recognition
by: Barovic, Andrew, et al.
Published: (2025)

Open Source State-Of-the-Art Solution for Romanian Speech Recognition
by: Pirlogeanu, Gabriel, et al.
Published: (2025)

Variational Low-Rank Adaptation for Personalized Impaired Speech Recognition
by: Pokel, Niclas, et al.
Published: (2025)

The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge
by: Guo, Yiwei, et al.
Published: (2024)