Saved in:
| Main Authors: | Chen, Yu, Zhu, Hongxu, Wang, Jiadong, Chen, Kainan, Qian, Xinyuan |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.07384 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention
by: Tao, Ruijie, et al.
Published: (2024)
by: Tao, Ruijie, et al.
Published: (2024)
Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction
by: Chen, Xueyuan, et al.
Published: (2024)
by: Chen, Xueyuan, et al.
Published: (2024)
$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction
by: Wu, Wenxuan, et al.
Published: (2025)
by: Wu, Wenxuan, et al.
Published: (2025)
Region-Specific Audio Tagging for Spatial Sound
by: Zhao, Jinzheng, et al.
Published: (2025)
by: Zhao, Jinzheng, et al.
Published: (2025)
Analytic Class Incremental Learning for Sound Source Localization with Privacy Protection
by: Qian, Xinyuan, et al.
Published: (2024)
by: Qian, Xinyuan, et al.
Published: (2024)
Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities
by: Sudarsanam, Parthasaarathy, et al.
Published: (2025)
by: Sudarsanam, Parthasaarathy, et al.
Published: (2025)
Human-Inspired Computing for Robust and Efficient Audio-Visual Speech Recognition
by: Liu, Qianhui, et al.
Published: (2024)
by: Liu, Qianhui, et al.
Published: (2024)
STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment
by: Ren, Yong, et al.
Published: (2024)
by: Ren, Yong, et al.
Published: (2024)
Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions
by: Zhao, Jinzheng, et al.
Published: (2023)
by: Zhao, Jinzheng, et al.
Published: (2023)
DDAVS: Disentangled Audio Semantics and Delayed Bidirectional Alignment for Audio-Visual Segmentation
by: Tian, Jingqi, et al.
Published: (2025)
by: Tian, Jingqi, et al.
Published: (2025)
SemanticAudio: Audio Generation and Editing in Semantic Space
by: Dai, Zheqi, et al.
Published: (2026)
by: Dai, Zheqi, et al.
Published: (2026)
VidMusician: Video-to-Music Generation with Semantic-Rhythmic Alignment via Hierarchical Visual Features
by: Li, Sifei, et al.
Published: (2024)
by: Li, Sifei, et al.
Published: (2024)
AudioSpa: Spatializing Sound Events with Text
by: Feng, Linfeng, et al.
Published: (2025)
by: Feng, Linfeng, et al.
Published: (2025)
Video-to-Audio Generation with Fine-grained Temporal Semantics
by: Hu, Yuchen, et al.
Published: (2024)
by: Hu, Yuchen, et al.
Published: (2024)
Rhythmic Foley: A Framework For Seamless Audio-Visual Alignment In Video-to-Audio Synthesis
by: Huang, Zhiqi, et al.
Published: (2024)
by: Huang, Zhiqi, et al.
Published: (2024)
Audio Spatially-Guided Fusion for Audio-Visual Navigation
by: Zhou, Xinyu, et al.
Published: (2026)
by: Zhou, Xinyu, et al.
Published: (2026)
ASAudio: A Survey of Advanced Spatial Audio Research
by: Zhu, Zhiyuan, et al.
Published: (2025)
by: Zhu, Zhiyuan, et al.
Published: (2025)
Sonic4D: Spatial Audio Generation for Immersive 4D Scene Exploration
by: Xie, Siyi, et al.
Published: (2025)
by: Xie, Siyi, et al.
Published: (2025)
Unified Audio Event Detection
by: Jiang, Yidi, et al.
Published: (2024)
by: Jiang, Yidi, et al.
Published: (2024)
Semantic Proximity Alignment: Towards Human Perception-consistent Audio Tagging by Aligning with Label Text Description
by: Liu, Wuyang, et al.
Published: (2023)
by: Liu, Wuyang, et al.
Published: (2023)
Can Large Language Models Understand Spatial Audio?
by: Tang, Changli, et al.
Published: (2024)
by: Tang, Changli, et al.
Published: (2024)
Universal Spatial Audio Transcoder
by: Sagasti, Amaia, et al.
Published: (2024)
by: Sagasti, Amaia, et al.
Published: (2024)
pyAMPACT: A Score-Audio Alignment Toolkit for Performance Data Estimation and Multi-modal Processing
by: Devaney, Johanna, et al.
Published: (2024)
by: Devaney, Johanna, et al.
Published: (2024)
SmoothCLAP: Soft-Target Enhanced Contrastive Language\--Audio Pretraining for Affective Computing
by: Jing, Xin, et al.
Published: (2026)
by: Jing, Xin, et al.
Published: (2026)
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
by: Tseng, Yuan, et al.
Published: (2023)
by: Tseng, Yuan, et al.
Published: (2023)
I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception
by: Zhang, Jiawei, et al.
Published: (2024)
by: Zhang, Jiawei, et al.
Published: (2024)
AV-DTEC: Self-Supervised Audio-Visual Fusion for Drone Trajectory Estimation and Classification
by: Xiao, Zhenyuan, et al.
Published: (2024)
by: Xiao, Zhenyuan, et al.
Published: (2024)
Online Single-Channel Audio-Based Sound Speed Estimation for Robust Multi-Channel Audio Control
by: Fuglsig, Andreas Jonas, et al.
Published: (2026)
by: Fuglsig, Andreas Jonas, et al.
Published: (2026)
Quantifying Spatial Audio Quality Impairment
by: Watcharasupat, Karn N., et al.
Published: (2023)
by: Watcharasupat, Karn N., et al.
Published: (2023)
Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation
by: Zhu, Qiushi, et al.
Published: (2024)
by: Zhu, Qiushi, et al.
Published: (2024)
Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation
by: Xin, Yifei, et al.
Published: (2024)
by: Xin, Yifei, et al.
Published: (2024)
AudSemThinker: Enhancing Audio-Language Models through Reasoning over Semantics of Sound
by: Wijngaard, Gijs, et al.
Published: (2025)
by: Wijngaard, Gijs, et al.
Published: (2025)
Do Captioning Metrics Reflect Music Semantic Alignment?
by: Lee, Jinwoo, et al.
Published: (2024)
by: Lee, Jinwoo, et al.
Published: (2024)
Leveraging Diverse Semantic-based Audio Pretrained Models for Singing Voice Conversion
by: Zhang, Xueyao, et al.
Published: (2023)
by: Zhang, Xueyao, et al.
Published: (2023)
FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion
by: Chen, Shunian, et al.
Published: (2025)
by: Chen, Shunian, et al.
Published: (2025)
MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models
by: Gong, Yitian, et al.
Published: (2026)
by: Gong, Yitian, et al.
Published: (2026)
Assessing the Alignment of Audio Representations with Timbre Similarity Ratings
by: Tian, Haokun, et al.
Published: (2025)
by: Tian, Haokun, et al.
Published: (2025)
Spatial-Aware Conditioned Fusion for Audio-Visual Navigation
by: Wu, Shaohang, et al.
Published: (2026)
by: Wu, Shaohang, et al.
Published: (2026)
UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization
by: Geng, Tiantian, et al.
Published: (2024)
by: Geng, Tiantian, et al.
Published: (2024)
Uncovering the Visual Contribution in Audio-Visual Speech Recognition
by: Lin, Zhaofeng, et al.
Published: (2024)
by: Lin, Zhaofeng, et al.
Published: (2024)
Similar Items
-
Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention
by: Tao, Ruijie, et al.
Published: (2024) -
Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction
by: Chen, Xueyuan, et al.
Published: (2024) -
$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction
by: Wu, Wenxuan, et al.
Published: (2025) -
Region-Specific Audio Tagging for Spatial Sound
by: Zhao, Jinzheng, et al.
Published: (2025) -
Analytic Class Incremental Learning for Sound Source Localization with Privacy Protection
by: Qian, Xinyuan, et al.
Published: (2024)