Saved in:
| Main Authors: | Xie, Liuyue, Kuthiala, Avik, Wei, George Z., Zheng, Ce, Bal, Ananya, Dabhi, Mosam, Wen, Liting, Rustagi, Taru, Lai, Ethan, Khyalia, Sushil, Choudhury, Rohan, Ziyadi, Morteza, Zhang, Xu, Yang, Hao, Jeni, László A. |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.21699 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Unified Spherical Frontend: Learning Rotation-Equivariant Representations of Spherical Images from Any Camera
by: Yu, Mukai, et al.
Published: (2025)
by: Yu, Mukai, et al.
Published: (2025)
3D-LFM: Lifting Foundation Model
by: Dabhi, Mosam, et al.
Published: (2023)
by: Dabhi, Mosam, et al.
Published: (2023)
MusiCRS: Benchmarking Audio-Centric Conversational Recommendation
by: Surana, Rohan, et al.
Published: (2025)
by: Surana, Rohan, et al.
Published: (2025)
Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing
by: Chen, Yaru, et al.
Published: (2025)
by: Chen, Yaru, et al.
Published: (2025)
MaskAnyone Toolkit: Offering Strategies for Minimizing Privacy Risks and Maximizing Utility in Audio-Visual Data Archiving
by: Owoyele, Babajide Alamu, et al.
Published: (2024)
by: Owoyele, Babajide Alamu, et al.
Published: (2024)
Audio Matters Too! Enhancing Markerless Motion Capture with Audio Signals for String Performance Capture
by: Jin, Yitong, et al.
Published: (2024)
by: Jin, Yitong, et al.
Published: (2024)
Exploring the Role of Audio in Multimodal Misinformation Detection
by: Liu, Moyang, et al.
Published: (2024)
by: Liu, Moyang, et al.
Published: (2024)
Generative Audio Extension and Morphing
by: Seetharaman, Prem, et al.
Published: (2026)
by: Seetharaman, Prem, et al.
Published: (2026)
EMID: An Emotional Aligned Dataset in Audio-Visual Modality
by: Zou, Jialing, et al.
Published: (2023)
by: Zou, Jialing, et al.
Published: (2023)
MixFake: Benchmarking and Enhancing Audio Deepfake Detection in Diverse Real-world Mixed Audio
by: Li, Qingcao, et al.
Published: (2026)
by: Li, Qingcao, et al.
Published: (2026)
Multimodal Semantic Communication for Generative Audio-Driven Video Conferencing
by: Tong, Haonan, et al.
Published: (2024)
by: Tong, Haonan, et al.
Published: (2024)
Enhancing Video Music Recommendation with Transformer-Driven Audio-Visual Embeddings
by: Liu, Shimiao, et al.
Published: (2025)
by: Liu, Shimiao, et al.
Published: (2025)
Look, Listen and Segment: Towards Weakly Supervised Audio-visual Semantic Segmentation
by: Li, Chengzhi, et al.
Published: (2026)
by: Li, Chengzhi, et al.
Published: (2026)
Cross-Modality and Within-Modality Regularization for Audio-Visual DeepFake Detection
by: Zou, Heqing, et al.
Published: (2024)
by: Zou, Heqing, et al.
Published: (2024)
MusicSem: A Semantically Rich Language--Audio Dataset of Natural Music Descriptions
by: Salganik, Rebecca, et al.
Published: (2026)
by: Salganik, Rebecca, et al.
Published: (2026)
Generalizing Video DeepFake Detection by Self-generated Audio-Visual Pseudo-Fakes
by: Wei, Zihe, et al.
Published: (2026)
by: Wei, Zihe, et al.
Published: (2026)
MAR3: Multi-Agent Recognition, Reasoning, and Reflection for Reference Audio-Visual Segmentation
by: Zhao, Yuan, et al.
Published: (2026)
by: Zhao, Yuan, et al.
Published: (2026)
Audio-Visual Separation with Hierarchical Fusion and Representation Alignment
by: Hu, Han, et al.
Published: (2025)
by: Hu, Han, et al.
Published: (2025)
EyEar: Learning Audio Synchronized Human Gaze Trajectory Based on Physics-Informed Dynamics
by: Liu, Xiaochuan, et al.
Published: (2025)
by: Liu, Xiaochuan, et al.
Published: (2025)
Ges-QA: A Multidimensional Quality Assessment Dataset for Audio-to-3D Gesture Generation
by: Gao, Zhilin, et al.
Published: (2025)
by: Gao, Zhilin, et al.
Published: (2025)
Manipulated Regions Localization For Partially Deepfake Audio: A Survey
by: He, Jiayi, et al.
Published: (2025)
by: He, Jiayi, et al.
Published: (2025)
TeMTG: Text-Enhanced Multi-Hop Temporal Graph Modeling for Audio-Visual Video Parsing
by: Chen, Yaru, et al.
Published: (2025)
by: Chen, Yaru, et al.
Published: (2025)
AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction
by: Chen, Zixuan, et al.
Published: (2026)
by: Chen, Zixuan, et al.
Published: (2026)
CueNet: Robust Audio-Visual Speaker Extraction through Cross-Modal Cue Mining and Interaction
by: Wang, Jiadong, et al.
Published: (2026)
by: Wang, Jiadong, et al.
Published: (2026)
Transformer-based Video Saliency Prediction with High Temporal Dimension Decoding
by: Moradi, Morteza, et al.
Published: (2024)
by: Moradi, Morteza, et al.
Published: (2024)
Delayed Commitment for Representation Readiness in Stage-wise Audio-Visual Learning
by: Xu, Xinmeng, et al.
Published: (2026)
by: Xu, Xinmeng, et al.
Published: (2026)
DreamFoley: Scalable VLMs for High-Fidelity Video-to-Audio Generation
by: Li, Fu, et al.
Published: (2025)
by: Li, Fu, et al.
Published: (2025)
StyleSpeaker: Audio-Enhanced Fine-Grained Style Modeling for Speech-Driven 3D Facial Animation
by: Yang, An, et al.
Published: (2025)
by: Yang, An, et al.
Published: (2025)
Open-Vocabulary Audio-Visual Semantic Segmentation
by: Guo, Ruohao, et al.
Published: (2024)
by: Guo, Ruohao, et al.
Published: (2024)
Towards Open-Vocabulary Video Semantic Segmentation
by: Li, Xinhao, et al.
Published: (2024)
by: Li, Xinhao, et al.
Published: (2024)
Coherent Audio-Visual Editing via Conditional Audio Generation Following Video Edits
by: Ishii, Masato, et al.
Published: (2025)
by: Ishii, Masato, et al.
Published: (2025)
Interpreting Multimodal Communication at Scale in Short-Form Video: Visual, Audio, and Textual Mental Health Discourse on TikTok
by: Zha, Mingyue, et al.
Published: (2026)
by: Zha, Mingyue, et al.
Published: (2026)
MMED: A Multimodal Micro-Expression Dataset based on Audio-Visual Fusion
by: Wang, Junbo, et al.
Published: (2025)
by: Wang, Junbo, et al.
Published: (2025)
Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding
by: Wang, Jieyi, et al.
Published: (2026)
by: Wang, Jieyi, et al.
Published: (2026)
Variable-Length Audio Fingerprinting
by: Chen, Hongjie, et al.
Published: (2026)
by: Chen, Hongjie, et al.
Published: (2026)
AFL-Net: Integrating Audio, Facial, and Lip Modalities with a Two-step Cross-attention for Robust Speaker Diarization in the Wild
by: Yin, Yongkang, et al.
Published: (2023)
by: Yin, Yongkang, et al.
Published: (2023)
LSTMSE-Net: Long Short Term Speech Enhancement Network for Audio-visual Speech Enhancement
by: Jain, Arnav, et al.
Published: (2024)
by: Jain, Arnav, et al.
Published: (2024)
Fast Audio Codec Identification Using Overlapping LCS
by: Jafari, Farzane
Published: (2025)
by: Jafari, Farzane
Published: (2025)
Quantifying and Enhancing Multi-modal Robustness with Modality Preference
by: Yang, Zequn, et al.
Published: (2024)
by: Yang, Zequn, et al.
Published: (2024)
SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding
by: Sun, Luoyi, et al.
Published: (2026)
by: Sun, Luoyi, et al.
Published: (2026)
Similar Items
-
Unified Spherical Frontend: Learning Rotation-Equivariant Representations of Spherical Images from Any Camera
by: Yu, Mukai, et al.
Published: (2025) -
3D-LFM: Lifting Foundation Model
by: Dabhi, Mosam, et al.
Published: (2023) -
MusiCRS: Benchmarking Audio-Centric Conversational Recommendation
by: Surana, Rohan, et al.
Published: (2025) -
Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing
by: Chen, Yaru, et al.
Published: (2025) -
MaskAnyone Toolkit: Offering Strategies for Minimizing Privacy Risks and Maximizing Utility in Audio-Visual Data Archiving
by: Owoyele, Babajide Alamu, et al.
Published: (2024)