Saved in:
| Main Authors: | Qian, Xinyuan, Gao, Jiaran, Zhang, Yaodan, Zhang, Qiquan, Liu, Hexin, Garcia, Leibny Paola, Li, Haizhou |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2411.07751 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Human-Inspired Computing for Robust and Efficient Audio-Visual Speech Recognition
by: Liu, Qianhui, et al.
Published: (2024)
by: Liu, Qianhui, et al.
Published: (2024)
Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition
by: Liu, Rui, et al.
Published: (2025)
by: Liu, Rui, et al.
Published: (2025)
ELEGANCE: Efficient LLM Guidance for Audio-Visual Target Speech Extraction
by: Wu, Wenxuan, et al.
Published: (2025)
by: Wu, Wenxuan, et al.
Published: (2025)
Audio-Visual Speech Separation via Bottleneck Iterative Network
by: Zhang, Sidong, et al.
Published: (2025)
by: Zhang, Sidong, et al.
Published: (2025)
LCB-net: Long-Context Biasing for Audio-Visual Speech Recognition
by: Yu, Fan, et al.
Published: (2024)
by: Yu, Fan, et al.
Published: (2024)
Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions
by: Zhao, Jinzheng, et al.
Published: (2023)
by: Zhao, Jinzheng, et al.
Published: (2023)
Incorporating Linguistic Constraints from External Knowledge Source for Audio-Visual Target Speech Extraction
by: Wu, Wenxuan, et al.
Published: (2025)
by: Wu, Wenxuan, et al.
Published: (2025)
Efficient Video to Audio Mapper with Visual Scene Detection
by: Yi, Mingjing, et al.
Published: (2024)
by: Yi, Mingjing, et al.
Published: (2024)
IML-Spikeformer: Input-aware Multi-Level Spiking Transformer for Speech Processing
by: Song, Zeyang, et al.
Published: (2025)
by: Song, Zeyang, et al.
Published: (2025)
$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction
by: Wu, Wenxuan, et al.
Published: (2025)
by: Wu, Wenxuan, et al.
Published: (2025)
Bridging The Multi-Modality Gaps of Audio, Visual and Linguistic for Speech Enhancement
by: Lin, Meng-Ping, et al.
Published: (2025)
by: Lin, Meng-Ping, et al.
Published: (2025)
RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues
by: Pan, Tianrui, et al.
Published: (2024)
by: Pan, Tianrui, et al.
Published: (2024)
Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech
by: Niu, Xinlei, et al.
Published: (2025)
by: Niu, Xinlei, et al.
Published: (2025)
Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement
by: Su, Fei, et al.
Published: (2026)
by: Su, Fei, et al.
Published: (2026)
LSTMSE-Net: Long Short Term Speech Enhancement Network for Audio-visual Speech Enhancement
by: Jain, Arnav, et al.
Published: (2024)
by: Jain, Arnav, et al.
Published: (2024)
Mamba in Speech: Towards an Alternative to Self-Attention
by: Zhang, Xiangyu, et al.
Published: (2024)
by: Zhang, Xiangyu, et al.
Published: (2024)
VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music
by: Shi, Jiatong, et al.
Published: (2024)
by: Shi, Jiatong, et al.
Published: (2024)
Plug-and-Steer: Decoupling Separation and Selection in Audio-Visual Target Speaker Extraction
by: Kwak, Doyeop, et al.
Published: (2026)
by: Kwak, Doyeop, et al.
Published: (2026)
Cinematic Audio Source Separation Using Visual Cues
by: Zhang, Kang, et al.
Published: (2026)
by: Zhang, Kang, et al.
Published: (2026)
Low-latency Speech Enhancement via Speech Token Generation
by: Xue, Huaying, et al.
Published: (2023)
by: Xue, Huaying, et al.
Published: (2023)
Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition
by: Wu, Linzhi, et al.
Published: (2026)
by: Wu, Linzhi, et al.
Published: (2026)
SONIQUE: Video Background Music Generation Using Unpaired Audio-Visual Data
by: Zhang, Liqian, et al.
Published: (2024)
by: Zhang, Liqian, et al.
Published: (2024)
SHMamba: Structured Hyperbolic State Space Model for Audio-Visual Question Answering
by: Yang, Zhe, et al.
Published: (2024)
by: Yang, Zhe, et al.
Published: (2024)
Rhythmic Foley: A Framework For Seamless Audio-Visual Alignment In Video-to-Audio Synthesis
by: Huang, Zhiqi, et al.
Published: (2024)
by: Huang, Zhiqi, et al.
Published: (2024)
Building Audio-Visual Digital Twins with Smartphones
by: Lan, Zitong, et al.
Published: (2025)
by: Lan, Zitong, et al.
Published: (2025)
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing
by: Yue, Xianghu, et al.
Published: (2024)
by: Yue, Xianghu, et al.
Published: (2024)
Sonic4D: Spatial Audio Generation for Immersive 4D Scene Exploration
by: Xie, Siyi, et al.
Published: (2025)
by: Xie, Siyi, et al.
Published: (2025)
pTSE-T: Presentation Target Speaker Extraction using Unaligned Text Cues
by: Jiang, Ziyang, et al.
Published: (2024)
by: Jiang, Ziyang, et al.
Published: (2024)
Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling
by: Du, Jiarong, et al.
Published: (2025)
by: Du, Jiarong, et al.
Published: (2025)
DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis
by: Tian, Wenjie, et al.
Published: (2025)
by: Tian, Wenjie, et al.
Published: (2025)
Zero-Shot Fake Video Detection by Audio-Visual Consistency
by: Li, Xiaolou, et al.
Published: (2024)
by: Li, Xiaolou, et al.
Published: (2024)
Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions
by: Yuan, Yi, et al.
Published: (2024)
by: Yuan, Yi, et al.
Published: (2024)
Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities
by: Sudarsanam, Parthasaarathy, et al.
Published: (2025)
by: Sudarsanam, Parthasaarathy, et al.
Published: (2025)
AVE Speech: A Comprehensive Multi-Modal Dataset for Speech Recognition Integrating Audio, Visual, and Electromyographic Signals
by: Zhou, Dongliang, et al.
Published: (2025)
by: Zhou, Dongliang, et al.
Published: (2025)
Improving Speech Enhancement by Integrating Inter-Channel and Band Features with Dual-branch Conformer
by: Li, Jizhen, et al.
Published: (2024)
by: Li, Jizhen, et al.
Published: (2024)
Quality-Aware End-to-End Audio-Visual Neural Speaker Diarization
by: He, Mao-Kui, et al.
Published: (2024)
by: He, Mao-Kui, et al.
Published: (2024)
AD-AVSR: Asymmetric Dual-stream Enhancement for Robust Audio-Visual Speech Recognition
by: Xue, Junxiao, et al.
Published: (2025)
by: Xue, Junxiao, et al.
Published: (2025)
Speech Separation with Pretrained Frontend to Minimize Domain Mismatch
by: Wang, Wupeng, et al.
Published: (2024)
by: Wang, Wupeng, et al.
Published: (2024)
Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation
by: Kim, Sungnyun, et al.
Published: (2025)
by: Kim, Sungnyun, et al.
Published: (2025)
Multimodal Emotion Recognition from Raw Audio with Sinc-convolution
by: Zhang, Xiaohui, et al.
Published: (2024)
by: Zhang, Xiaohui, et al.
Published: (2024)
Similar Items
-
Human-Inspired Computing for Robust and Efficient Audio-Visual Speech Recognition
by: Liu, Qianhui, et al.
Published: (2024) -
Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition
by: Liu, Rui, et al.
Published: (2025) -
ELEGANCE: Efficient LLM Guidance for Audio-Visual Target Speech Extraction
by: Wu, Wenxuan, et al.
Published: (2025) -
Audio-Visual Speech Separation via Bottleneck Iterative Network
by: Zhang, Sidong, et al.
Published: (2025) -
LCB-net: Long-Context Biasing for Audio-Visual Speech Recognition
by: Yu, Fan, et al.
Published: (2024)