Saved in:
| Main Authors: | Li, Kai, Xie, Fenghua, Chen, Hang, Yuan, Kexin, Hu, Xiaolin |
|---|---|
| Format: | Preprint |
| Published: |
2022
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2212.10744 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention
by: Li, Kai, et al.
Published: (2025)
by: Li, Kai, et al.
Published: (2025)
RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation
by: Pegg, Samuel, et al.
Published: (2023)
by: Pegg, Samuel, et al.
Published: (2023)
IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation
by: Li, Kai, et al.
Published: (2023)
by: Li, Kai, et al.
Published: (2023)
Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling
by: Du, Jiarong, et al.
Published: (2025)
by: Du, Jiarong, et al.
Published: (2025)
Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment
by: Liu, Chen, et al.
Published: (2025)
by: Liu, Chen, et al.
Published: (2025)
AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition
by: Liu, Zehua, et al.
Published: (2024)
by: Liu, Zehua, et al.
Published: (2024)
Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction
by: Mu, Zhaoxi, et al.
Published: (2024)
by: Mu, Zhaoxi, et al.
Published: (2024)
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
by: Rouditchenko, Andrew, et al.
Published: (2024)
by: Rouditchenko, Andrew, et al.
Published: (2024)
Semantic Audio-Visual Navigation in Continuous Environments
by: Zeng, Yichen, et al.
Published: (2026)
by: Zeng, Yichen, et al.
Published: (2026)
Large Language Models are Strong Audio-Visual Speech Recognition Learners
by: Cappellazzo, Umberto, et al.
Published: (2024)
by: Cappellazzo, Umberto, et al.
Published: (2024)
VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection
by: Cheng, Hao, et al.
Published: (2025)
by: Cheng, Hao, et al.
Published: (2025)
Seeing Soundscapes: Audio-Visual Generation and Separation from Soundscapes Using Audio-Visual Separator
by: Kang, Minjae, et al.
Published: (2025)
by: Kang, Minjae, et al.
Published: (2025)
MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition
by: Cappellazzo, Umberto, et al.
Published: (2025)
by: Cappellazzo, Umberto, et al.
Published: (2025)
Collaborative Hybrid Propagator for Temporal Misalignment in Audio-Visual Segmentation
by: Li, Kexin, et al.
Published: (2024)
by: Li, Kexin, et al.
Published: (2024)
Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes
by: Ryu, Hyeonggon, et al.
Published: (2025)
by: Ryu, Hyeonggon, et al.
Published: (2025)
Continual Audio-Visual Sound Separation
by: Pian, Weiguo, et al.
Published: (2024)
by: Pian, Weiguo, et al.
Published: (2024)
A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning
by: Chen, Tianle, et al.
Published: (2026)
by: Chen, Tianle, et al.
Published: (2026)
Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics
by: Liu, Chen, et al.
Published: (2025)
by: Liu, Chen, et al.
Published: (2025)
Speech Audio Generation from dynamic MRI via a Knowledge Enhanced Conditional Variational Autoencoder
by: Li, Yaxuan, et al.
Published: (2025)
by: Li, Yaxuan, et al.
Published: (2025)
mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition
by: Rouditchenko, Andrew, et al.
Published: (2025)
by: Rouditchenko, Andrew, et al.
Published: (2025)
Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs
by: Anand, et al.
Published: (2025)
by: Anand, et al.
Published: (2025)
A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition
by: Dai, Yusheng, et al.
Published: (2024)
by: Dai, Yusheng, et al.
Published: (2024)
DuoGesture: Neuro-Inspired and Biomechanically Informed Dual-Stream Co-Speech Gesture Generation
by: Paar, Ferdinand, et al.
Published: (2026)
by: Paar, Ferdinand, et al.
Published: (2026)
AV-RIR: Audio-Visual Room Impulse Response Estimation
by: Ratnarajah, Anton, et al.
Published: (2023)
by: Ratnarajah, Anton, et al.
Published: (2023)
Audio-Visual World Models: Towards Multisensory Imagination in Sight and Sound
by: Wang, Jiahua, et al.
Published: (2025)
by: Wang, Jiahua, et al.
Published: (2025)
High-Quality Sound Separation Across Diverse Categories via Visually-Guided Generative Modeling
by: Huang, Chao, et al.
Published: (2025)
by: Huang, Chao, et al.
Published: (2025)
Audio-Guided Visual Perception for Audio-Visual Navigation
by: Wang, Yi, et al.
Published: (2025)
by: Wang, Yi, et al.
Published: (2025)
Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning
by: Wang, Linge, et al.
Published: (2026)
by: Wang, Linge, et al.
Published: (2026)
TDFNet: An Efficient Audio-Visual Speech Separation Model with Top-down Fusion
by: Pegg, Samuel, et al.
Published: (2024)
by: Pegg, Samuel, et al.
Published: (2024)
OmniForcing: Unleashing Real-time Joint Audio-Visual Generation
by: Su, Yaofeng, et al.
Published: (2026)
by: Su, Yaofeng, et al.
Published: (2026)
RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement
by: Chen, Honglie, et al.
Published: (2024)
by: Chen, Honglie, et al.
Published: (2024)
Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues
by: Chen, Tianxiang, et al.
Published: (2024)
by: Chen, Tianxiang, et al.
Published: (2024)
video-SALMONN 2: Caption-Enhanced Audio-Visual Large Language Models
by: Tang, Changli, et al.
Published: (2025)
by: Tang, Changli, et al.
Published: (2025)
Rethinking Audio-Visual Adversarial Vulnerability from Temporal and Modality Perspectives
by: Zhang, Zeliang, et al.
Published: (2025)
by: Zhang, Zeliang, et al.
Published: (2025)
WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM
by: Tang, Changli, et al.
Published: (2025)
by: Tang, Changli, et al.
Published: (2025)
Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech Translation
by: Goncalves, Lucas, et al.
Published: (2024)
by: Goncalves, Lucas, et al.
Published: (2024)
JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation
by: Liu, Kai, et al.
Published: (2026)
by: Liu, Kai, et al.
Published: (2026)
Schrodinger Audio-Visual Editor: Object-Level Audiovisual Removal
by: Xu, Weihan, et al.
Published: (2025)
by: Xu, Weihan, et al.
Published: (2025)
Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs
by: Cappellazzo, Umberto, et al.
Published: (2025)
by: Cappellazzo, Umberto, et al.
Published: (2025)
Aligned Better, Listen Better for Audio-Visual Large Language Models
by: Guo, Yuxin, et al.
Published: (2025)
by: Guo, Yuxin, et al.
Published: (2025)
Similar Items
-
Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention
by: Li, Kai, et al.
Published: (2025) -
RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation
by: Pegg, Samuel, et al.
Published: (2023) -
IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation
by: Li, Kai, et al.
Published: (2023) -
Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling
by: Du, Jiarong, et al.
Published: (2025) -
Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment
by: Liu, Chen, et al.
Published: (2025)