Saved in:
| Main Authors: | Cheng, Hao, Zhao, Zhiwei, He, Yichao, Hu, Zhenzhen, Li, Jia, Wang, Meng, Hong, Richang |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.02331 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning
by: Wang, Linge, et al.
Published: (2026)
by: Wang, Linge, et al.
Published: (2026)
Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning
by: Zeng, Donghuo, et al.
Published: (2026)
by: Zeng, Donghuo, et al.
Published: (2026)
Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention
by: Li, Kai, et al.
Published: (2025)
by: Li, Kai, et al.
Published: (2025)
Semantic Audio-Visual Navigation in Continuous Environments
by: Zeng, Yichen, et al.
Published: (2026)
by: Zeng, Yichen, et al.
Published: (2026)
MoLT: Mixture of Layer-Wise Tokens for Efficient Audio-Visual Learning
by: Rho, Kyeongha, et al.
Published: (2025)
by: Rho, Kyeongha, et al.
Published: (2025)
Grid Jigsaw Representation with CLIP: A New Perspective on Image Clustering
by: Song, Zijie, et al.
Published: (2023)
by: Song, Zijie, et al.
Published: (2023)
Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment
by: Liu, Chen, et al.
Published: (2025)
by: Liu, Chen, et al.
Published: (2025)
Leveraging Audio Representations for Vibration-Based Crowd Monitoring in Stadiums
by: Chang, Yen Cheng, et al.
Published: (2025)
by: Chang, Yen Cheng, et al.
Published: (2025)
READ-Net: Clarifying Emotional Ambiguity via Adaptive Feature Recalibration for Audio-Visual Depression Detection
by: Chen, Chenglizhao, et al.
Published: (2026)
by: Chen, Chenglizhao, et al.
Published: (2026)
Learning Self-Supervised Audio-Visual Representations for Sound Recommendations
by: Krishnamurthy, Sudha
Published: (2024)
by: Krishnamurthy, Sudha
Published: (2024)
DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information
by: Nakada, Shota, et al.
Published: (2024)
by: Nakada, Shota, et al.
Published: (2024)
WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM
by: Tang, Changli, et al.
Published: (2025)
by: Tang, Changli, et al.
Published: (2025)
An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits
by: Li, Kai, et al.
Published: (2022)
by: Li, Kai, et al.
Published: (2022)
RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation
by: Pegg, Samuel, et al.
Published: (2023)
by: Pegg, Samuel, et al.
Published: (2023)
Enriching Multimodal Sentiment Analysis through Textual Emotional Descriptions of Visual-Audio Content
by: Wu, Sheng, et al.
Published: (2024)
by: Wu, Sheng, et al.
Published: (2024)
ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling
by: Yang, Jianxuan, et al.
Published: (2026)
by: Yang, Jianxuan, et al.
Published: (2026)
CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization
by: Bai, Detao, et al.
Published: (2025)
by: Bai, Detao, et al.
Published: (2025)
CLAIP-Emo: Parameter-Efficient Adaptation of Language-supervised models for In-the-Wild Audiovisual Emotion Recognition
by: Chen, Yin, et al.
Published: (2025)
by: Chen, Yin, et al.
Published: (2025)
Seeing is Believing? Enhancing Vision-Language Navigation using Visual Perturbations
by: Zhang, Xuesong, et al.
Published: (2024)
by: Zhang, Xuesong, et al.
Published: (2024)
AV-RIR: Audio-Visual Room Impulse Response Estimation
by: Ratnarajah, Anton, et al.
Published: (2023)
by: Ratnarajah, Anton, et al.
Published: (2023)
Schrodinger Audio-Visual Editor: Object-Level Audiovisual Removal
by: Xu, Weihan, et al.
Published: (2025)
by: Xu, Weihan, et al.
Published: (2025)
Audio-Guided Visual Perception for Audio-Visual Navigation
by: Wang, Yi, et al.
Published: (2025)
by: Wang, Yi, et al.
Published: (2025)
CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation
by: Chen, Yuanhong, et al.
Published: (2025)
by: Chen, Yuanhong, et al.
Published: (2025)
Cross-Modal Binary Attention: An Energy-Efficient Fusion Framework for Audio-Visual Learning
by: Saleh, Mohamed, et al.
Published: (2026)
by: Saleh, Mohamed, et al.
Published: (2026)
Rethinking Audio-Visual Adversarial Vulnerability from Temporal and Modality Perspectives
by: Zhang, Zeliang, et al.
Published: (2025)
by: Zhang, Zeliang, et al.
Published: (2025)
TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis
by: Ton, Tri, et al.
Published: (2025)
by: Ton, Tri, et al.
Published: (2025)
Global-Local Distillation Network-Based Audio-Visual Speaker Tracking with Incomplete Modalities
by: Li, Yidi, et al.
Published: (2024)
by: Li, Yidi, et al.
Published: (2024)
MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization
by: Yang, Jianxuan, et al.
Published: (2025)
by: Yang, Jianxuan, et al.
Published: (2025)
A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning
by: Chen, Tianle, et al.
Published: (2026)
by: Chen, Tianle, et al.
Published: (2026)
Cross Pseudo-Labeling for Semi-Supervised Audio-Visual Source Localization
by: Guo, Yuxin, et al.
Published: (2024)
by: Guo, Yuxin, et al.
Published: (2024)
Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition
by: Praveen, R. Gnana, et al.
Published: (2021)
by: Praveen, R. Gnana, et al.
Published: (2021)
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
by: Tseng, Yuan, et al.
Published: (2023)
by: Tseng, Yuan, et al.
Published: (2023)
AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control
by: Guo, Xinyue, et al.
Published: (2025)
by: Guo, Xinyue, et al.
Published: (2025)
Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues
by: Chen, Tianxiang, et al.
Published: (2024)
by: Chen, Tianxiang, et al.
Published: (2024)
Speech Audio Generation from dynamic MRI via a Knowledge Enhanced Conditional Variational Autoencoder
by: Li, Yaxuan, et al.
Published: (2025)
by: Li, Yaxuan, et al.
Published: (2025)
SeeingSounds: Learning Audio-to-Visual Alignment via Text
by: Carnemolla, Simone, et al.
Published: (2025)
by: Carnemolla, Simone, et al.
Published: (2025)
Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations
by: Yeo, Jeong Hun, et al.
Published: (2025)
by: Yeo, Jeong Hun, et al.
Published: (2025)
AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers
by: Araujo, Edson, et al.
Published: (2026)
by: Araujo, Edson, et al.
Published: (2026)
Audio-Visual World Models: Towards Multisensory Imagination in Sight and Sound
by: Wang, Jiahua, et al.
Published: (2025)
by: Wang, Jiahua, et al.
Published: (2025)
OmniForcing: Unleashing Real-time Joint Audio-Visual Generation
by: Su, Yaofeng, et al.
Published: (2026)
by: Su, Yaofeng, et al.
Published: (2026)
Similar Items
-
Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning
by: Wang, Linge, et al.
Published: (2026) -
Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning
by: Zeng, Donghuo, et al.
Published: (2026) -
Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention
by: Li, Kai, et al.
Published: (2025) -
Semantic Audio-Visual Navigation in Continuous Environments
by: Zeng, Yichen, et al.
Published: (2026) -
MoLT: Mixture of Layer-Wise Tokens for Efficient Audio-Visual Learning
by: Rho, Kyeongha, et al.
Published: (2025)