Saved in:
| Main Authors: | Kim, Minsu, Jung, Jee-weon, Rha, Hyeongseop, Maiti, Soumi, Arora, Siddhant, Chang, Xuankai, Watanabe, Shinji, Ro, Yong Man |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2402.16021 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks
by: Maiti, Soumi, et al.
Published: (2023)
by: Maiti, Soumi, et al.
Published: (2023)
Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation
by: Kim, Minsu, et al.
Published: (2024)
by: Kim, Minsu, et al.
Published: (2024)
SynesLM: A Unified Approach for Audio-visual Speech Recognition and Translation via Language Model and Synthetic Data
by: Lu, Yichen, et al.
Published: (2024)
by: Lu, Yichen, et al.
Published: (2024)
OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer
by: Peng, Yifan, et al.
Published: (2024)
by: Peng, Yifan, et al.
Published: (2024)
Decoding Strategies for Diffusion-Based ASR: A Systematic Evaluation of Confidence-Based Thresholding
by: Yeo, Jeong Hun, et al.
Published: (2026)
by: Yeo, Jeong Hun, et al.
Published: (2026)
MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
by: Yeo, Jeong Hun, et al.
Published: (2025)
by: Yeo, Jeong Hun, et al.
Published: (2025)
Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs
by: Futami, Hayato, et al.
Published: (2025)
by: Futami, Hayato, et al.
Published: (2025)
Chain-of-Thought Training for Open E2E Spoken Dialogue Systems
by: Arora, Siddhant, et al.
Published: (2025)
by: Arora, Siddhant, et al.
Published: (2025)
UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions
by: Arora, Siddhant, et al.
Published: (2023)
by: Arora, Siddhant, et al.
Published: (2023)
Improving Design of Input Condition Invariant Speech Enhancement
by: Zhang, Wangyou, et al.
Published: (2024)
by: Zhang, Wangyou, et al.
Published: (2024)
Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper
by: Yeo, Jeong Hun, et al.
Published: (2023)
by: Yeo, Jeong Hun, et al.
Published: (2023)
On the Evaluation of Speech Foundation Models for Spoken Language Understanding
by: Arora, Siddhant, et al.
Published: (2024)
by: Arora, Siddhant, et al.
Published: (2024)
Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation
by: Kim, Minsu, et al.
Published: (2023)
by: Kim, Minsu, et al.
Published: (2023)
Decoder-only Architecture for Streaming End-to-end Speech Recognition
by: Tsunoo, Emiru, et al.
Published: (2024)
by: Tsunoo, Emiru, et al.
Published: (2024)
Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation
by: Wu, Shih-Lun, et al.
Published: (2023)
by: Wu, Shih-Lun, et al.
Published: (2023)
Towards Robust Speech Representation Learning for Thousands of Languages
by: Chen, William, et al.
Published: (2024)
by: Chen, William, et al.
Published: (2024)
Hypothesis Clustering and Merging: Novel MultiTalker Speech Recognition with Speaker Tokens
by: Kashiwagi, Yosuke, et al.
Published: (2024)
by: Kashiwagi, Yosuke, et al.
Published: (2024)
Spiralformer: Low Latency Encoder for Streaming Speech Recognition with Circular Layer Skipping and Early Exiting
by: Tsunoo, Emiru, et al.
Published: (2025)
by: Tsunoo, Emiru, et al.
Published: (2025)
Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting
by: Kashiwagi, Yosuke, et al.
Published: (2024)
by: Kashiwagi, Yosuke, et al.
Published: (2024)
Text-To-Speech Synthesis In The Wild
by: Jung, Jee-weon, et al.
Published: (2024)
by: Jung, Jee-weon, et al.
Published: (2024)
Towards Inclusive Communication: A Unified Framework for Generating Spoken Language from Sign, Lip, and Audio
by: Yeo, Jeong Hun, et al.
Published: (2025)
by: Yeo, Jeong Hun, et al.
Published: (2025)
Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement
by: Zhang, Wangyou, et al.
Published: (2024)
by: Zhang, Wangyou, et al.
Published: (2024)
Semi-Autoregressive Streaming ASR With Label Context
by: Arora, Siddhant, et al.
Published: (2023)
by: Arora, Siddhant, et al.
Published: (2023)
SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics
by: Saeki, Takaaki, et al.
Published: (2024)
by: Saeki, Takaaki, et al.
Published: (2024)
AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
by: Choi, Jeongsoo, et al.
Published: (2023)
by: Choi, Jeongsoo, et al.
Published: (2023)
Finding Task-specific Subnetworks in Multi-task Spoken Language Understanding Model
by: Futami, Hayato, et al.
Published: (2024)
by: Futami, Hayato, et al.
Published: (2024)
Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation
by: Tsunoo, Emiru, et al.
Published: (2023)
by: Tsunoo, Emiru, et al.
Published: (2023)
SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition
by: Wu, Yihan, et al.
Published: (2024)
by: Wu, Yihan, et al.
Published: (2024)
Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics
by: Arora, Siddhant, et al.
Published: (2025)
by: Arora, Siddhant, et al.
Published: (2025)
Robust Audiovisual Speech Recognition Models with Mixture-of-Experts
by: Wu, Yihan, et al.
Published: (2024)
by: Wu, Yihan, et al.
Published: (2024)
Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language
by: Yeo, Jeong Hun, et al.
Published: (2024)
by: Yeo, Jeong Hun, et al.
Published: (2024)
Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing
by: Yeo, Jeong Hun, et al.
Published: (2024)
by: Yeo, Jeong Hun, et al.
Published: (2024)
OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification
by: Peng, Yifan, et al.
Published: (2024)
by: Peng, Yifan, et al.
Published: (2024)
Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition
by: Cornell, Samuele, et al.
Published: (2024)
by: Cornell, Samuele, et al.
Published: (2024)
Beyond Silence: Bias Analysis through Loss and Asymmetric Approach in Audio Anti-Spoofing
by: Shim, Hye-jin, et al.
Published: (2024)
by: Shim, Hye-jin, et al.
Published: (2024)
SpoofCeleb: Speech Deepfake Detection and SASV In The Wild
by: Jung, Jee-weon, et al.
Published: (2024)
by: Jung, Jee-weon, et al.
Published: (2024)
Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models
by: Choi, Jeongsoo, et al.
Published: (2023)
by: Choi, Jeongsoo, et al.
Published: (2023)
Chain-of-Thought Reasoning in Streaming Full-Duplex End-to-End Spoken Dialogue Systems
by: Arora, Siddhant, et al.
Published: (2025)
by: Arora, Siddhant, et al.
Published: (2025)
Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition
by: Kim, Minsu, et al.
Published: (2023)
by: Kim, Minsu, et al.
Published: (2023)
Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations
by: Yeo, Jeong Hun, et al.
Published: (2025)
by: Yeo, Jeong Hun, et al.
Published: (2025)
Similar Items
-
Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks
by: Maiti, Soumi, et al.
Published: (2023) -
Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation
by: Kim, Minsu, et al.
Published: (2024) -
SynesLM: A Unified Approach for Audio-visual Speech Recognition and Translation via Language Model and Synthetic Data
by: Lu, Yichen, et al.
Published: (2024) -
OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer
by: Peng, Yifan, et al.
Published: (2024) -
Decoding Strategies for Diffusion-Based ASR: A Systematic Evaluation of Confidence-Based Thresholding
by: Yeo, Jeong Hun, et al.
Published: (2026)