Saved in:
| Main Authors: | Wang, Yuyue, Cheng, Xin, Wu, Yihan, Wang, Xihua, Tian, Jinchuan, Song, Ruihua |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.22229 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
LoVA: Long-form Video-to-Audio Generation
by: Cheng, Xin, et al.
Published: (2024)
by: Cheng, Xin, et al.
Published: (2024)
ChronusOmni: Improving Time Awareness of Omni Large Language Models
by: Chen, Yijing, et al.
Published: (2025)
by: Chen, Yijing, et al.
Published: (2025)
SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition
by: Wu, Yihan, et al.
Published: (2024)
by: Wu, Yihan, et al.
Published: (2024)
Human-Inspired Computing for Robust and Efficient Audio-Visual Speech Recognition
by: Liu, Qianhui, et al.
Published: (2024)
by: Liu, Qianhui, et al.
Published: (2024)
RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues
by: Pan, Tianrui, et al.
Published: (2024)
by: Pan, Tianrui, et al.
Published: (2024)
SpeechEE: A Novel Benchmark for Speech Event Extraction
by: Wang, Bin, et al.
Published: (2024)
by: Wang, Bin, et al.
Published: (2024)
ELEGANCE: Efficient LLM Guidance for Audio-Visual Target Speech Extraction
by: Wu, Wenxuan, et al.
Published: (2025)
by: Wu, Wenxuan, et al.
Published: (2025)
SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural Language Description
by: Jin, Zeyu, et al.
Published: (2024)
by: Jin, Zeyu, et al.
Published: (2024)
DeepMoLM: Leveraging Visual and Geometric Structural Information for Molecule-Text Modeling
by: Lan, Jing, et al.
Published: (2026)
by: Lan, Jing, et al.
Published: (2026)
ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation
by: Shi, Jiatong, et al.
Published: (2025)
by: Shi, Jiatong, et al.
Published: (2025)
MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model
by: Wang, Sen, et al.
Published: (2024)
by: Wang, Sen, et al.
Published: (2024)
Controllable Text-to-Speech Synthesis with Masked-Autoencoded Style-Rich Representation
by: Wang, Yongqi, et al.
Published: (2025)
by: Wang, Yongqi, et al.
Published: (2025)
LCB-net: Long-Context Biasing for Audio-Visual Speech Recognition
by: Yu, Fan, et al.
Published: (2024)
by: Yu, Fan, et al.
Published: (2024)
AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
by: Choi, Jeongsoo, et al.
Published: (2023)
by: Choi, Jeongsoo, et al.
Published: (2023)
EyEar: Learning Audio Synchronized Human Gaze Trajectory Based on Physics-Informed Dynamics
by: Liu, Xiaochuan, et al.
Published: (2025)
by: Liu, Xiaochuan, et al.
Published: (2025)
Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation
by: Kim, Sungnyun, et al.
Published: (2025)
by: Kim, Sungnyun, et al.
Published: (2025)
JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation
by: Chao, Jianghan, et al.
Published: (2025)
by: Chao, Jianghan, et al.
Published: (2025)
SVLA: A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation
by: Huynh, Ngoc Dung, et al.
Published: (2025)
by: Huynh, Ngoc Dung, et al.
Published: (2025)
Large Language Models are Strong Audio-Visual Speech Recognition Learners
by: Cappellazzo, Umberto, et al.
Published: (2024)
by: Cappellazzo, Umberto, et al.
Published: (2024)
Incorporating Linguistic Constraints from External Knowledge Source for Audio-Visual Target Speech Extraction
by: Wu, Wenxuan, et al.
Published: (2025)
by: Wu, Wenxuan, et al.
Published: (2025)
Bridging The Multi-Modality Gaps of Audio, Visual and Linguistic for Speech Enhancement
by: Lin, Meng-Ping, et al.
Published: (2025)
by: Lin, Meng-Ping, et al.
Published: (2025)
Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech
by: Liu, Rui, et al.
Published: (2024)
by: Liu, Rui, et al.
Published: (2024)
AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition
by: Liu, Zehua, et al.
Published: (2024)
by: Liu, Zehua, et al.
Published: (2024)
VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning
by: Cheng, Xin, et al.
Published: (2025)
by: Cheng, Xin, et al.
Published: (2025)
AVE Speech: A Comprehensive Multi-Modal Dataset for Speech Recognition Integrating Audio, Visual, and Electromyographic Signals
by: Zhou, Dongliang, et al.
Published: (2025)
by: Zhou, Dongliang, et al.
Published: (2025)
VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music
by: Shi, Jiatong, et al.
Published: (2024)
by: Shi, Jiatong, et al.
Published: (2024)
Audio-Visual Speech Separation via Bottleneck Iterative Network
by: Zhang, Sidong, et al.
Published: (2025)
by: Zhang, Sidong, et al.
Published: (2025)
Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling
by: Du, Jiarong, et al.
Published: (2025)
by: Du, Jiarong, et al.
Published: (2025)
UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts
by: Cheng, Zhi-Qi, et al.
Published: (2024)
by: Cheng, Zhi-Qi, et al.
Published: (2024)
VisAug: Facilitating Speech-Rich Web Video Navigation and Engagement with Auto-Generated Visual Augmentations
by: Zhao, Baoquan, et al.
Published: (2025)
by: Zhao, Baoquan, et al.
Published: (2025)
LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition
by: Hao, Bowen, et al.
Published: (2025)
by: Hao, Bowen, et al.
Published: (2025)
Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models
by: He, Xin, et al.
Published: (2024)
by: He, Xin, et al.
Published: (2024)
Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations
by: Yeo, Jeong Hun, et al.
Published: (2025)
by: Yeo, Jeong Hun, et al.
Published: (2025)
Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition
by: Wu, Linzhi, et al.
Published: (2026)
by: Wu, Linzhi, et al.
Published: (2026)
Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement
by: Su, Fei, et al.
Published: (2026)
by: Su, Fei, et al.
Published: (2026)
STCTS: Generative Semantic Compression for Ultra-Low Bitrate Speech via Explicit Text-Prosody-Timbre Decomposition
by: Wang, Siyu, et al.
Published: (2025)
by: Wang, Siyu, et al.
Published: (2025)
Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech Translation
by: Goncalves, Lucas, et al.
Published: (2024)
by: Goncalves, Lucas, et al.
Published: (2024)
TeMTG: Text-Enhanced Multi-Hop Temporal Graph Modeling for Audio-Visual Video Parsing
by: Chen, Yaru, et al.
Published: (2025)
by: Chen, Yaru, et al.
Published: (2025)
SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition
by: Wang, Hao, et al.
Published: (2024)
by: Wang, Hao, et al.
Published: (2024)
Towards Multimodal Empathetic Response Generation: A Rich Text-Speech-Vision Avatar-based Benchmark
by: Zhang, Han, et al.
Published: (2025)
by: Zhang, Han, et al.
Published: (2025)
Similar Items
-
LoVA: Long-form Video-to-Audio Generation
by: Cheng, Xin, et al.
Published: (2024) -
ChronusOmni: Improving Time Awareness of Omni Large Language Models
by: Chen, Yijing, et al.
Published: (2025) -
SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition
by: Wu, Yihan, et al.
Published: (2024) -
Human-Inspired Computing for Robust and Efficient Audio-Visual Speech Recognition
by: Liu, Qianhui, et al.
Published: (2024) -
RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues
by: Pan, Tianrui, et al.
Published: (2024)