:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Mai, Long, Carson-Berndsen, Julie
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence Sound Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2501.04877
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Leveraging Unit Language Guidance to Advance Speech Modeling in Textless Speech-to-Speech Translation
by: Zhang, Yuhao, et al.
Published: (2025)

Textless NLP -- Zero Resource Challenge with Low Resource Compute
by: Ramadass, Krithiga, et al.
Published: (2024)

CTC-based Non-autoregressive Textless Speech-to-Speech Translation
by: Fang, Qingkai, et al.
Published: (2024)

Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
by: Wang, Xiong, et al.
Published: (2024)

How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation System?
by: Papi, Sara, et al.
Published: (2024)

NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction
by: Wang, Qichao, et al.
Published: (2025)

An LLM Benchmark for Addressee Recognition in Multi-modal Multi-party Dialogue
by: Inoue, Koji, et al.
Published: (2025)

Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLM
by: Thebaud, Thomas, et al.
Published: (2025)

Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models
by: Gao, Kuofeng, et al.
Published: (2024)

Toward Conversational Hungarian Speech Recognition: Introducing the BEA-Large and BEA-Dialogue Datasets
by: Gedeon, Máté, et al.
Published: (2025)

Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction
by: Yan, Haoqiu, et al.
Published: (2024)

Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing
by: Choi, Jeongsoo, et al.
Published: (2025)

Speech ReaLLM -- Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time
by: Seide, Frank, et al.
Published: (2024)

MULTI-Bench: A Multi-Turn Interactive Benchmark for Assessing Emotional Intelligence ability of Spoken Dialogue Models
by: Deng, Yayue, et al.
Published: (2025)

Using LLM for Real-Time Transcription and Summarization of Doctor-Patient Interactions into ePuskesmas in Indonesia: A Proof-of-Concept Study
by: Khatim, Nur Ahmad, et al.
Published: (2024)

Textless Acoustic Model with Self-Supervised Distillation for Noise-Robust Expressive Speech-to-Speech Translation
by: Hwang, Min-Jae, et al.
Published: (2024)

ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling
by: Jiang, Yuxuan, et al.
Published: (2025)

PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems
by: Mitsui, Kentaro, et al.
Published: (2024)

Dialogue in Resonance: An Interactive Music Piece for Piano and Real-Time Automatic Transcription System
by: Bang, Hayeon, et al.
Published: (2025)

LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis
by: Fang, Qingkai, et al.
Published: (2025)

MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation
by: Peng, Yifan, et al.
Published: (2024)

SUTA-LM: Bridging Test-Time Adaptation and Language Model Rescoring for Robust ASR
by: Huang, Wei-Ping, et al.
Published: (2025)

Analyzing Speech Unit Selection for Textless Speech-to-Speech Translation
by: Duret, Jarod, et al.
Published: (2024)

Real-time Speech Summarization for Medical Conversations
by: Le-Duc, Khai, et al.
Published: (2024)

Bob's Confetti: Phonetic Memorization Attacks in Music and Video Generation
by: Roh, Jaechul, et al.
Published: (2025)

Text2midi: Generating Symbolic Music from Captions
by: Bhandari, Keshav, et al.
Published: (2024)

Frame-Stacked Local Transformers For Efficient Multi-Codebook Speech Generation
by: Fejgin, Roy, et al.
Published: (2025)

Controlling Surprisal in Music Generation via Information Content Curve Matching
by: Bjare, Mathias Rose, et al.
Published: (2024)

FLEURS-R: A Restored Multilingual Speech Corpus for Generation Tasks
by: Ma, Min, et al.
Published: (2024)

Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation
by: Wang, Jun, et al.
Published: (2025)

Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation
by: Xue, Jinlong, et al.
Published: (2024)

Channel-Aware Domain-Adaptive Generative Adversarial Network for Robust Speech Recognition
by: Wang, Chien-Chun, et al.
Published: (2024)

VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation
by: Wang, Yuhao, et al.
Published: (2025)

Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation
by: Yan, Canxiang, et al.
Published: (2025)

KidSpeak: A General Multi-purpose LLM for Kids' Speech Recognition and Screening
by: Sharma, Rohan, et al.
Published: (2025)

SongComposer: A Large Language Model for Lyric and Melody Generation in Song Composition
by: Ding, Shuangrui, et al.
Published: (2024)

A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation
by: Ma, Zhengrui, et al.
Published: (2024)

Optimizing the Songwriting Process: Genre-Based Lyric Generation Using Deep Learning Models
by: Cai, Tracy, et al.
Published: (2024)

SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation
by: Yu, Wenyi, et al.
Published: (2024)

LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement
by: Kang, Boyi, et al.
Published: (2025)