:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Kim, Minsu, Jung, Jee-weon, Rha, Hyeongseop, Maiti, Soumi, Arora, Siddhant, Chang, Xuankai, Watanabe, Shinji, Ro, Yong Man
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Artificial Intelligence Computer Vision and Pattern Recognition Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2402.16021
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks
by: Maiti, Soumi, et al.
Published: (2023)

Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation
by: Kim, Minsu, et al.
Published: (2024)

SynesLM: A Unified Approach for Audio-visual Speech Recognition and Translation via Language Model and Synthetic Data
by: Lu, Yichen, et al.
Published: (2024)

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer
by: Peng, Yifan, et al.
Published: (2024)

Decoding Strategies for Diffusion-Based ASR: A Systematic Evaluation of Confidence-Based Thresholding
by: Yeo, Jeong Hun, et al.
Published: (2026)

MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
by: Yeo, Jeong Hun, et al.
Published: (2025)

Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs
by: Futami, Hayato, et al.
Published: (2025)

Chain-of-Thought Training for Open E2E Spoken Dialogue Systems
by: Arora, Siddhant, et al.
Published: (2025)

UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions
by: Arora, Siddhant, et al.
Published: (2023)

Improving Design of Input Condition Invariant Speech Enhancement
by: Zhang, Wangyou, et al.
Published: (2024)

Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper
by: Yeo, Jeong Hun, et al.
Published: (2023)

On the Evaluation of Speech Foundation Models for Spoken Language Understanding
by: Arora, Siddhant, et al.
Published: (2024)

Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation
by: Kim, Minsu, et al.
Published: (2023)

Decoder-only Architecture for Streaming End-to-end Speech Recognition
by: Tsunoo, Emiru, et al.
Published: (2024)

Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation
by: Wu, Shih-Lun, et al.
Published: (2023)

Towards Robust Speech Representation Learning for Thousands of Languages
by: Chen, William, et al.
Published: (2024)

Hypothesis Clustering and Merging: Novel MultiTalker Speech Recognition with Speaker Tokens
by: Kashiwagi, Yosuke, et al.
Published: (2024)

Spiralformer: Low Latency Encoder for Streaming Speech Recognition with Circular Layer Skipping and Early Exiting
by: Tsunoo, Emiru, et al.
Published: (2025)

Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting
by: Kashiwagi, Yosuke, et al.
Published: (2024)

Text-To-Speech Synthesis In The Wild
by: Jung, Jee-weon, et al.
Published: (2024)

Towards Inclusive Communication: A Unified Framework for Generating Spoken Language from Sign, Lip, and Audio
by: Yeo, Jeong Hun, et al.
Published: (2025)

Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement
by: Zhang, Wangyou, et al.
Published: (2024)

Semi-Autoregressive Streaming ASR With Label Context
by: Arora, Siddhant, et al.
Published: (2023)

SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics
by: Saeki, Takaaki, et al.
Published: (2024)

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
by: Choi, Jeongsoo, et al.
Published: (2023)

Finding Task-specific Subnetworks in Multi-task Spoken Language Understanding Model
by: Futami, Hayato, et al.
Published: (2024)

Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation
by: Tsunoo, Emiru, et al.
Published: (2023)

SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition
by: Wu, Yihan, et al.
Published: (2024)

Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics
by: Arora, Siddhant, et al.
Published: (2025)

Robust Audiovisual Speech Recognition Models with Mixture-of-Experts
by: Wu, Yihan, et al.
Published: (2024)

Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language
by: Yeo, Jeong Hun, et al.
Published: (2024)

Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing
by: Yeo, Jeong Hun, et al.
Published: (2024)

OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification
by: Peng, Yifan, et al.
Published: (2024)

Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition
by: Cornell, Samuele, et al.
Published: (2024)

Beyond Silence: Bias Analysis through Loss and Asymmetric Approach in Audio Anti-Spoofing
by: Shim, Hye-jin, et al.
Published: (2024)

SpoofCeleb: Speech Deepfake Detection and SASV In The Wild
by: Jung, Jee-weon, et al.
Published: (2024)

Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models
by: Choi, Jeongsoo, et al.
Published: (2023)

Chain-of-Thought Reasoning in Streaming Full-Duplex End-to-End Spoken Dialogue Systems
by: Arora, Siddhant, et al.
Published: (2025)

Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition
by: Kim, Minsu, et al.
Published: (2023)

Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations
by: Yeo, Jeong Hun, et al.
Published: (2025)