:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Kässmann, Tobias, Liu, Yining, Liu, Danni
Format:	Preprint
Published:	2024
Subjects:	Sound Computation and Language Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2407.17172
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Configurable Multilingual ASR with Speech Summary Representations
by: Zhu, Harrison, et al.
Published: (2024)

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing
by: Zheng, Zhisheng, et al.
Published: (2025)

FluentEditor2: Text-based Speech Editing by Modeling Multi-Scale Acoustic and Prosody Consistency
by: Liu, Rui, et al.
Published: (2024)

Sequential Editing for Lifelong Training of Speech Recognition Models
by: Kulshreshtha, Devang, et al.
Published: (2024)

SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding
by: Parcollet, Titouan, et al.
Published: (2023)

FASA: a Flexible and Automatic Speech Aligner for Extracting High-quality Aligned Children Speech Data
by: Liu, Dancheng, et al.
Published: (2024)

Fine-grained Speech Sentiment Analysis in Chinese Psychological Support Hotlines Based on Large-scale Pre-trained Model
by: Chen, Zhonglong, et al.
Published: (2024)

StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model
by: Guo, Shoutao, et al.
Published: (2025)

Next Tokens Denoising for Speech Synthesis
by: Liu, Yanqing, et al.
Published: (2025)

Large Language Models for Dysfluency Detection in Stuttered Speech
by: Wagner, Dominik, et al.
Published: (2024)

Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation
by: Liu, Henglyu, et al.
Published: (2025)

OWSM-Biasing: Contextualizing Open Whisper-Style Speech Models for Automatic Speech Recognition with Dynamic Vocabulary
by: Sudo, Yui, et al.
Published: (2025)

S2S-Arena: Evaluating Paralinguistic Instruction Following in Speech-to-Speech Models
by: Jiang, Feng, et al.
Published: (2025)

Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis
by: Jia, Zhenqi, et al.
Published: (2024)

Generative Expressive Conversational Speech Synthesis
by: Liu, Rui, et al.
Published: (2024)

Pairwise Evaluation of Accent Similarity in Speech Synthesis
by: Zhong, Jinzuomu, et al.
Published: (2025)

Revisiting Self-supervised Learning of Speech Representation from a Mutual Information Perspective
by: Liu, Alexander H., et al.
Published: (2024)

PolySpeech: Exploring Unified Multitask Speech Models for Competitiveness with Single-task Models
by: Yang, Runyan, et al.
Published: (2024)

MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-token Prediction
by: Wang, Jianjin, et al.
Published: (2025)

S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models
by: Fang, Yuanbo, et al.
Published: (2025)

Autoregressive Speech Synthesis without Vector Quantization
by: Meng, Lingwei, et al.
Published: (2024)

SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation
by: Sun, Chunyu, et al.
Published: (2025)

U-GIFT: Uncertainty-Guided Firewall for Toxic Speech in Few-Shot Scenario
by: Song, Jiaxin, et al.
Published: (2025)

USAD: Universal Speech and Audio Representation via Distillation
by: Chang, Heng-Jui, et al.
Published: (2025)

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models
by: Yang, Dongchao, et al.
Published: (2024)

Post-decoder Biasing for End-to-End Speech Recognition of Multi-turn Medical Interview
by: Liu, Heyang, et al.
Published: (2024)

Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System
by: Meng, Lingwei, et al.
Published: (2024)

Leveraging Large Language Models for Spontaneous Speech-Based Suicide Risk Detection
by: Gao, Yifan, et al.
Published: (2025)

Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation
by: Hu, Rui, et al.
Published: (2025)

DQ-Whisper: Joint Distillation and Quantization for Efficient Multilingual Speech Recognition
by: Shao, Hang, et al.
Published: (2023)

SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models
by: Zhang, Xin, et al.
Published: (2023)

Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs
by: Futami, Hayato, et al.
Published: (2025)

Continuous Speech Tokenizer in Text To Speech
by: Li, Yixing, et al.
Published: (2024)

Scaling Speech-Text Pre-training with Synthetic Interleaved Data
by: Zeng, Aohan, et al.
Published: (2024)

Closing the Modality Reasoning Gap for Speech Large Language Models
by: Wang, Chaoren, et al.
Published: (2026)

Boosting Large Language Model for Speech Synthesis: An Empirical Study
by: Hao, Hongkun, et al.
Published: (2023)

Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE Dataset
by: Liu, Rui, et al.
Published: (2025)

Swedish Whispers; Leveraging a Massive Speech Corpus for Swedish Speech Recognition
by: Vesterbacka, Leonora, et al.
Published: (2025)

Modeling Sarcastic Speech: Semantic and Prosodic Cues in a Speech Synthesis Framework
by: Li, Zhu, et al.
Published: (2025)

Self-Powered LLM Modality Expansion for Large Speech-Text Models
by: Yu, Tengfei, et al.
Published: (2024)