Saved in:
| Main Authors: | Mai, Long, Carson-Berndsen, Julie |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2501.04877 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Leveraging Unit Language Guidance to Advance Speech Modeling in Textless Speech-to-Speech Translation
by: Zhang, Yuhao, et al.
Published: (2025)
by: Zhang, Yuhao, et al.
Published: (2025)
Textless NLP -- Zero Resource Challenge with Low Resource Compute
by: Ramadass, Krithiga, et al.
Published: (2024)
by: Ramadass, Krithiga, et al.
Published: (2024)
CTC-based Non-autoregressive Textless Speech-to-Speech Translation
by: Fang, Qingkai, et al.
Published: (2024)
by: Fang, Qingkai, et al.
Published: (2024)
Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
by: Wang, Xiong, et al.
Published: (2024)
by: Wang, Xiong, et al.
Published: (2024)
How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation System?
by: Papi, Sara, et al.
Published: (2024)
by: Papi, Sara, et al.
Published: (2024)
NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction
by: Wang, Qichao, et al.
Published: (2025)
by: Wang, Qichao, et al.
Published: (2025)
An LLM Benchmark for Addressee Recognition in Multi-modal Multi-party Dialogue
by: Inoue, Koji, et al.
Published: (2025)
by: Inoue, Koji, et al.
Published: (2025)
Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLM
by: Thebaud, Thomas, et al.
Published: (2025)
by: Thebaud, Thomas, et al.
Published: (2025)
Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models
by: Gao, Kuofeng, et al.
Published: (2024)
by: Gao, Kuofeng, et al.
Published: (2024)
Toward Conversational Hungarian Speech Recognition: Introducing the BEA-Large and BEA-Dialogue Datasets
by: Gedeon, Máté, et al.
Published: (2025)
by: Gedeon, Máté, et al.
Published: (2025)
Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction
by: Yan, Haoqiu, et al.
Published: (2024)
by: Yan, Haoqiu, et al.
Published: (2024)
Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing
by: Choi, Jeongsoo, et al.
Published: (2025)
by: Choi, Jeongsoo, et al.
Published: (2025)
Speech ReaLLM -- Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time
by: Seide, Frank, et al.
Published: (2024)
by: Seide, Frank, et al.
Published: (2024)
MULTI-Bench: A Multi-Turn Interactive Benchmark for Assessing Emotional Intelligence ability of Spoken Dialogue Models
by: Deng, Yayue, et al.
Published: (2025)
by: Deng, Yayue, et al.
Published: (2025)
Using LLM for Real-Time Transcription and Summarization of Doctor-Patient Interactions into ePuskesmas in Indonesia: A Proof-of-Concept Study
by: Khatim, Nur Ahmad, et al.
Published: (2024)
by: Khatim, Nur Ahmad, et al.
Published: (2024)
Textless Acoustic Model with Self-Supervised Distillation for Noise-Robust Expressive Speech-to-Speech Translation
by: Hwang, Min-Jae, et al.
Published: (2024)
by: Hwang, Min-Jae, et al.
Published: (2024)
ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling
by: Jiang, Yuxuan, et al.
Published: (2025)
by: Jiang, Yuxuan, et al.
Published: (2025)
PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems
by: Mitsui, Kentaro, et al.
Published: (2024)
by: Mitsui, Kentaro, et al.
Published: (2024)
Dialogue in Resonance: An Interactive Music Piece for Piano and Real-Time Automatic Transcription System
by: Bang, Hayeon, et al.
Published: (2025)
by: Bang, Hayeon, et al.
Published: (2025)
LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis
by: Fang, Qingkai, et al.
Published: (2025)
by: Fang, Qingkai, et al.
Published: (2025)
MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation
by: Peng, Yifan, et al.
Published: (2024)
by: Peng, Yifan, et al.
Published: (2024)
SUTA-LM: Bridging Test-Time Adaptation and Language Model Rescoring for Robust ASR
by: Huang, Wei-Ping, et al.
Published: (2025)
by: Huang, Wei-Ping, et al.
Published: (2025)
Analyzing Speech Unit Selection for Textless Speech-to-Speech Translation
by: Duret, Jarod, et al.
Published: (2024)
by: Duret, Jarod, et al.
Published: (2024)
Real-time Speech Summarization for Medical Conversations
by: Le-Duc, Khai, et al.
Published: (2024)
by: Le-Duc, Khai, et al.
Published: (2024)
Bob's Confetti: Phonetic Memorization Attacks in Music and Video Generation
by: Roh, Jaechul, et al.
Published: (2025)
by: Roh, Jaechul, et al.
Published: (2025)
Text2midi: Generating Symbolic Music from Captions
by: Bhandari, Keshav, et al.
Published: (2024)
by: Bhandari, Keshav, et al.
Published: (2024)
Frame-Stacked Local Transformers For Efficient Multi-Codebook Speech Generation
by: Fejgin, Roy, et al.
Published: (2025)
by: Fejgin, Roy, et al.
Published: (2025)
Controlling Surprisal in Music Generation via Information Content Curve Matching
by: Bjare, Mathias Rose, et al.
Published: (2024)
by: Bjare, Mathias Rose, et al.
Published: (2024)
FLEURS-R: A Restored Multilingual Speech Corpus for Generation Tasks
by: Ma, Min, et al.
Published: (2024)
by: Ma, Min, et al.
Published: (2024)
Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation
by: Wang, Jun, et al.
Published: (2025)
by: Wang, Jun, et al.
Published: (2025)
Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation
by: Xue, Jinlong, et al.
Published: (2024)
by: Xue, Jinlong, et al.
Published: (2024)
Channel-Aware Domain-Adaptive Generative Adversarial Network for Robust Speech Recognition
by: Wang, Chien-Chun, et al.
Published: (2024)
by: Wang, Chien-Chun, et al.
Published: (2024)
VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation
by: Wang, Yuhao, et al.
Published: (2025)
by: Wang, Yuhao, et al.
Published: (2025)
Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation
by: Yan, Canxiang, et al.
Published: (2025)
by: Yan, Canxiang, et al.
Published: (2025)
KidSpeak: A General Multi-purpose LLM for Kids' Speech Recognition and Screening
by: Sharma, Rohan, et al.
Published: (2025)
by: Sharma, Rohan, et al.
Published: (2025)
SongComposer: A Large Language Model for Lyric and Melody Generation in Song Composition
by: Ding, Shuangrui, et al.
Published: (2024)
by: Ding, Shuangrui, et al.
Published: (2024)
A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation
by: Ma, Zhengrui, et al.
Published: (2024)
by: Ma, Zhengrui, et al.
Published: (2024)
Optimizing the Songwriting Process: Genre-Based Lyric Generation Using Deep Learning Models
by: Cai, Tracy, et al.
Published: (2024)
by: Cai, Tracy, et al.
Published: (2024)
SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation
by: Yu, Wenyi, et al.
Published: (2024)
by: Yu, Wenyi, et al.
Published: (2024)
LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement
by: Kang, Boyi, et al.
Published: (2025)
by: Kang, Boyi, et al.
Published: (2025)
Similar Items
-
Leveraging Unit Language Guidance to Advance Speech Modeling in Textless Speech-to-Speech Translation
by: Zhang, Yuhao, et al.
Published: (2025) -
Textless NLP -- Zero Resource Challenge with Low Resource Compute
by: Ramadass, Krithiga, et al.
Published: (2024) -
CTC-based Non-autoregressive Textless Speech-to-Speech Translation
by: Fang, Qingkai, et al.
Published: (2024) -
Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
by: Wang, Xiong, et al.
Published: (2024) -
How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation System?
by: Papi, Sara, et al.
Published: (2024)