Saved in:
| Main Authors: | Kässmann, Tobias, Liu, Yining, Liu, Danni |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2407.17172 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Configurable Multilingual ASR with Speech Summary Representations
by: Zhu, Harrison, et al.
Published: (2024)
by: Zhu, Harrison, et al.
Published: (2024)
VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing
by: Zheng, Zhisheng, et al.
Published: (2025)
by: Zheng, Zhisheng, et al.
Published: (2025)
FluentEditor2: Text-based Speech Editing by Modeling Multi-Scale Acoustic and Prosody Consistency
by: Liu, Rui, et al.
Published: (2024)
by: Liu, Rui, et al.
Published: (2024)
Sequential Editing for Lifelong Training of Speech Recognition Models
by: Kulshreshtha, Devang, et al.
Published: (2024)
by: Kulshreshtha, Devang, et al.
Published: (2024)
SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding
by: Parcollet, Titouan, et al.
Published: (2023)
by: Parcollet, Titouan, et al.
Published: (2023)
FASA: a Flexible and Automatic Speech Aligner for Extracting High-quality Aligned Children Speech Data
by: Liu, Dancheng, et al.
Published: (2024)
by: Liu, Dancheng, et al.
Published: (2024)
Fine-grained Speech Sentiment Analysis in Chinese Psychological Support Hotlines Based on Large-scale Pre-trained Model
by: Chen, Zhonglong, et al.
Published: (2024)
by: Chen, Zhonglong, et al.
Published: (2024)
StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model
by: Guo, Shoutao, et al.
Published: (2025)
by: Guo, Shoutao, et al.
Published: (2025)
Next Tokens Denoising for Speech Synthesis
by: Liu, Yanqing, et al.
Published: (2025)
by: Liu, Yanqing, et al.
Published: (2025)
Large Language Models for Dysfluency Detection in Stuttered Speech
by: Wagner, Dominik, et al.
Published: (2024)
by: Wagner, Dominik, et al.
Published: (2024)
Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation
by: Liu, Henglyu, et al.
Published: (2025)
by: Liu, Henglyu, et al.
Published: (2025)
OWSM-Biasing: Contextualizing Open Whisper-Style Speech Models for Automatic Speech Recognition with Dynamic Vocabulary
by: Sudo, Yui, et al.
Published: (2025)
by: Sudo, Yui, et al.
Published: (2025)
S2S-Arena: Evaluating Paralinguistic Instruction Following in Speech-to-Speech Models
by: Jiang, Feng, et al.
Published: (2025)
by: Jiang, Feng, et al.
Published: (2025)
Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis
by: Jia, Zhenqi, et al.
Published: (2024)
by: Jia, Zhenqi, et al.
Published: (2024)
Generative Expressive Conversational Speech Synthesis
by: Liu, Rui, et al.
Published: (2024)
by: Liu, Rui, et al.
Published: (2024)
Pairwise Evaluation of Accent Similarity in Speech Synthesis
by: Zhong, Jinzuomu, et al.
Published: (2025)
by: Zhong, Jinzuomu, et al.
Published: (2025)
Revisiting Self-supervised Learning of Speech Representation from a Mutual Information Perspective
by: Liu, Alexander H., et al.
Published: (2024)
by: Liu, Alexander H., et al.
Published: (2024)
PolySpeech: Exploring Unified Multitask Speech Models for Competitiveness with Single-task Models
by: Yang, Runyan, et al.
Published: (2024)
by: Yang, Runyan, et al.
Published: (2024)
MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-token Prediction
by: Wang, Jianjin, et al.
Published: (2025)
by: Wang, Jianjin, et al.
Published: (2025)
S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models
by: Fang, Yuanbo, et al.
Published: (2025)
by: Fang, Yuanbo, et al.
Published: (2025)
Autoregressive Speech Synthesis without Vector Quantization
by: Meng, Lingwei, et al.
Published: (2024)
by: Meng, Lingwei, et al.
Published: (2024)
SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation
by: Sun, Chunyu, et al.
Published: (2025)
by: Sun, Chunyu, et al.
Published: (2025)
U-GIFT: Uncertainty-Guided Firewall for Toxic Speech in Few-Shot Scenario
by: Song, Jiaxin, et al.
Published: (2025)
by: Song, Jiaxin, et al.
Published: (2025)
USAD: Universal Speech and Audio Representation via Distillation
by: Chang, Heng-Jui, et al.
Published: (2025)
by: Chang, Heng-Jui, et al.
Published: (2025)
SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models
by: Yang, Dongchao, et al.
Published: (2024)
by: Yang, Dongchao, et al.
Published: (2024)
Post-decoder Biasing for End-to-End Speech Recognition of Multi-turn Medical Interview
by: Liu, Heyang, et al.
Published: (2024)
by: Liu, Heyang, et al.
Published: (2024)
Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System
by: Meng, Lingwei, et al.
Published: (2024)
by: Meng, Lingwei, et al.
Published: (2024)
Leveraging Large Language Models for Spontaneous Speech-Based Suicide Risk Detection
by: Gao, Yifan, et al.
Published: (2025)
by: Gao, Yifan, et al.
Published: (2025)
Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation
by: Hu, Rui, et al.
Published: (2025)
by: Hu, Rui, et al.
Published: (2025)
DQ-Whisper: Joint Distillation and Quantization for Efficient Multilingual Speech Recognition
by: Shao, Hang, et al.
Published: (2023)
by: Shao, Hang, et al.
Published: (2023)
SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models
by: Zhang, Xin, et al.
Published: (2023)
by: Zhang, Xin, et al.
Published: (2023)
Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs
by: Futami, Hayato, et al.
Published: (2025)
by: Futami, Hayato, et al.
Published: (2025)
Continuous Speech Tokenizer in Text To Speech
by: Li, Yixing, et al.
Published: (2024)
by: Li, Yixing, et al.
Published: (2024)
Scaling Speech-Text Pre-training with Synthetic Interleaved Data
by: Zeng, Aohan, et al.
Published: (2024)
by: Zeng, Aohan, et al.
Published: (2024)
Closing the Modality Reasoning Gap for Speech Large Language Models
by: Wang, Chaoren, et al.
Published: (2026)
by: Wang, Chaoren, et al.
Published: (2026)
Boosting Large Language Model for Speech Synthesis: An Empirical Study
by: Hao, Hongkun, et al.
Published: (2023)
by: Hao, Hongkun, et al.
Published: (2023)
Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE Dataset
by: Liu, Rui, et al.
Published: (2025)
by: Liu, Rui, et al.
Published: (2025)
Swedish Whispers; Leveraging a Massive Speech Corpus for Swedish Speech Recognition
by: Vesterbacka, Leonora, et al.
Published: (2025)
by: Vesterbacka, Leonora, et al.
Published: (2025)
Modeling Sarcastic Speech: Semantic and Prosodic Cues in a Speech Synthesis Framework
by: Li, Zhu, et al.
Published: (2025)
by: Li, Zhu, et al.
Published: (2025)
Self-Powered LLM Modality Expansion for Large Speech-Text Models
by: Yu, Tengfei, et al.
Published: (2024)
by: Yu, Tengfei, et al.
Published: (2024)
Similar Items
-
Configurable Multilingual ASR with Speech Summary Representations
by: Zhu, Harrison, et al.
Published: (2024) -
VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing
by: Zheng, Zhisheng, et al.
Published: (2025) -
FluentEditor2: Text-based Speech Editing by Modeling Multi-Scale Acoustic and Prosody Consistency
by: Liu, Rui, et al.
Published: (2024) -
Sequential Editing for Lifelong Training of Speech Recognition Models
by: Kulshreshtha, Devang, et al.
Published: (2024) -
SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding
by: Parcollet, Titouan, et al.
Published: (2023)