:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wang, Yuyue, Cheng, Xin, Wu, Yihan, Wang, Xihua, Tian, Jinchuan, Song, Ruihua
Format:	Preprint
Published:	2025
Subjects:	Multimedia
Online Access:	https://arxiv.org/abs/2511.22229
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

LoVA: Long-form Video-to-Audio Generation
by: Cheng, Xin, et al.
Published: (2024)

ChronusOmni: Improving Time Awareness of Omni Large Language Models
by: Chen, Yijing, et al.
Published: (2025)

SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition
by: Wu, Yihan, et al.
Published: (2024)

Human-Inspired Computing for Robust and Efficient Audio-Visual Speech Recognition
by: Liu, Qianhui, et al.
Published: (2024)

RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues
by: Pan, Tianrui, et al.
Published: (2024)

SpeechEE: A Novel Benchmark for Speech Event Extraction
by: Wang, Bin, et al.
Published: (2024)

ELEGANCE: Efficient LLM Guidance for Audio-Visual Target Speech Extraction
by: Wu, Wenxuan, et al.
Published: (2025)

SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural Language Description
by: Jin, Zeyu, et al.
Published: (2024)

DeepMoLM: Leveraging Visual and Geometric Structural Information for Molecule-Text Modeling
by: Lan, Jing, et al.
Published: (2026)

ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation
by: Shi, Jiatong, et al.
Published: (2025)

MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model
by: Wang, Sen, et al.
Published: (2024)

Controllable Text-to-Speech Synthesis with Masked-Autoencoded Style-Rich Representation
by: Wang, Yongqi, et al.
Published: (2025)

LCB-net: Long-Context Biasing for Audio-Visual Speech Recognition
by: Yu, Fan, et al.
Published: (2024)

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
by: Choi, Jeongsoo, et al.
Published: (2023)

EyEar: Learning Audio Synchronized Human Gaze Trajectory Based on Physics-Informed Dynamics
by: Liu, Xiaochuan, et al.
Published: (2025)

Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation
by: Kim, Sungnyun, et al.
Published: (2025)

JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation
by: Chao, Jianghan, et al.
Published: (2025)

SVLA: A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation
by: Huynh, Ngoc Dung, et al.
Published: (2025)

Large Language Models are Strong Audio-Visual Speech Recognition Learners
by: Cappellazzo, Umberto, et al.
Published: (2024)

Incorporating Linguistic Constraints from External Knowledge Source for Audio-Visual Target Speech Extraction
by: Wu, Wenxuan, et al.
Published: (2025)

Bridging The Multi-Modality Gaps of Audio, Visual and Linguistic for Speech Enhancement
by: Lin, Meng-Ping, et al.
Published: (2025)

Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech
by: Liu, Rui, et al.
Published: (2024)

AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition
by: Liu, Zehua, et al.
Published: (2024)

VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning
by: Cheng, Xin, et al.
Published: (2025)

AVE Speech: A Comprehensive Multi-Modal Dataset for Speech Recognition Integrating Audio, Visual, and Electromyographic Signals
by: Zhou, Dongliang, et al.
Published: (2025)

VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music
by: Shi, Jiatong, et al.
Published: (2024)

Audio-Visual Speech Separation via Bottleneck Iterative Network
by: Zhang, Sidong, et al.
Published: (2025)

Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling
by: Du, Jiarong, et al.
Published: (2025)

UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts
by: Cheng, Zhi-Qi, et al.
Published: (2024)

VisAug: Facilitating Speech-Rich Web Video Navigation and Engagement with Auto-Generated Visual Augmentations
by: Zhao, Baoquan, et al.
Published: (2025)

LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition
by: Hao, Bowen, et al.
Published: (2025)

Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models
by: He, Xin, et al.
Published: (2024)

Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations
by: Yeo, Jeong Hun, et al.
Published: (2025)

Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition
by: Wu, Linzhi, et al.
Published: (2026)

Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement
by: Su, Fei, et al.
Published: (2026)

STCTS: Generative Semantic Compression for Ultra-Low Bitrate Speech via Explicit Text-Prosody-Timbre Decomposition
by: Wang, Siyu, et al.
Published: (2025)

Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech Translation
by: Goncalves, Lucas, et al.
Published: (2024)

TeMTG: Text-Enhanced Multi-Hop Temporal Graph Modeling for Audio-Visual Video Parsing
by: Chen, Yaru, et al.
Published: (2025)

SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition
by: Wang, Hao, et al.
Published: (2024)

Towards Multimodal Empathetic Response Generation: A Rich Text-Speech-Vision Avatar-based Benchmark
by: Zhang, Han, et al.
Published: (2025)