:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Iliescu, Dan Andrei, Mohan, Devang Savita Ram, Teh, Tian Huey, Hodari, Zack
Format:	Preprint
Published:	2023
Subjects:	Audio and Speech Processing Artificial Intelligence Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2303.09446
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS
by: Shin, Seungyoun, et al.
Published: (2025)

Maestro-EVC: Controllable Emotional Voice Conversion Guided by References and Explicit Prosody
by: Yoon, Jinsung, et al.
Published: (2025)

DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles
by: Liu, Jiaxuan, et al.
Published: (2024)

Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP
by: Zhong, Jinzuomu, et al.
Published: (2023)

RepCNN: Micro-sized, Mighty Models for Wakeword Detection
by: Kundu, Arnav, et al.
Published: (2024)

A Human-in-the-Loop Approach to Improving Cross-Text Prosody Transfer
by: Maurya, Himanshu, et al.
Published: (2024)

ProKWS: Personalized Keyword Spotting via Collaborative Learning of Phonemes and Prosody
by: Pan, Jianan, et al.
Published: (2026)

ProMode: A Speech Prosody Model Conditioned on Acoustic and Textual Inputs
by: Eren, Eray, et al.
Published: (2025)

Investigating Stochastic Methods for Prosody Modeling in Speech Synthesis
by: Mayer, Paul, et al.
Published: (2025)

Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter
by: Andrusenko, Andrei, et al.
Published: (2024)

DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training
by: Oh, Hyung-Seok, et al.
Published: (2023)

NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding
by: Bataev, Vladimir, et al.
Published: (2025)

FlexCTC: GPU-powered CTC Beam Decoding With Advanced Contextual Abilities
by: Grigoryan, Lilit, et al.
Published: (2025)

Pushing the Limits of Beam Search Decoding for Transducer-based ASR models
by: Grigoryan, Lilit, et al.
Published: (2025)

OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models
by: Chen, William, et al.
Published: (2025)

Instruction Data Generation and Unsupervised Adaptation for Speech Language Models
by: Noroozi, Vahid, et al.
Published: (2024)

An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation
by: Gunduz, Ahmet, et al.
Published: (2024)

XMAD-Bench: Cross-Domain Multilingual Audio Deepfake Benchmark
by: Ciobanu, Ioan-Paul, et al.
Published: (2025)

ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models
by: Qian, Kaizhi, et al.
Published: (2025)

Usefulness of Emotional Prosody in Neural Machine Translation
by: Brazier, Charles, et al.
Published: (2024)

Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models
by: Lee, Kyowoon, et al.
Published: (2025)

Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting
by: Han, Wooseok, et al.
Published: (2024)

Prosody-Adaptable Audio Codecs for Zero-Shot Voice Conversion via In-Context Learning
by: Zhao, Junchuan, et al.
Published: (2025)

GTR-Voice: Articulatory Phonetics Informed Controllable Expressive Speech Synthesis
by: Li, Zehua Kcriss, et al.
Published: (2024)

WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling
by: Yang, Guanrou, et al.
Published: (2026)

Do Music Generation Models Encode Music Theory?
by: Wei, Megan, et al.
Published: (2024)

PRESENT: Zero-Shot Text-to-Prosody Control
by: Lam, Perry, et al.
Published: (2024)

DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation
by: Jia, Dongya, et al.
Published: (2025)

MoonCast: High-Quality Zero-Shot Podcast Generation
by: Ju, Zeqian, et al.
Published: (2025)

Controlling Surprisal in Music Generation via Information Content Curve Matching
by: Bjare, Mathias Rose, et al.
Published: (2024)

A Variational Framework for Improving Naturalness in Generative Spoken Language Models
by: Chen, Li-Wei, et al.
Published: (2025)

TiCo: Time-Controllable Spoken Dialogue Model
by: Chang, Kai-Wei, et al.
Published: (2026)

Imagine to Hear: Auditory Knowledge Generation can be an Effective Assistant for Language Models
by: Yoo, Suho, et al.
Published: (2025)

GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators
by: Hu, Yuchen, et al.
Published: (2024)

Unsupervised Speech Segmentation: A General Approach Using Speech Language Models
by: Elmakies, Avishai, et al.
Published: (2025)

Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting
by: Yang, Chao-Han Huck, et al.
Published: (2023)

AQAScore: Evaluating Semantic Alignment in Text-to-Audio Generation via Audio Question Answering
by: Kuan, Chun-Yi, et al.
Published: (2026)

PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems
by: Mitsui, Kentaro, et al.
Published: (2024)

C3LLM: Conditional Multimodal Content Generation Using Large Language Models
by: Wang, Zixuan, et al.
Published: (2024)

Missing Melodies: AI Music Generation and its "Nearly" Complete Omission of the Global South
by: Mehta, Atharva, et al.
Published: (2024)