Saved in:
| Main Authors: | Fan, Pingyi, Jiang, Anbai, Zhang, Shuwei, Lv, Zhiqiang, Han, Bing, Zheng, Xinhu, Liang, Wenrui, Li, Junjie, Zhang, Wei-Qiang, Qian, Yanmin, Chen, Xie, Lu, Cheng, Liu, Jia |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.16696 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Exploring Self-Supervised Audio Models for Generalized Anomalous Sound Detection
by: Han, Bing, et al.
Published: (2025)
by: Han, Bing, et al.
Published: (2025)
Improving Anomalous Sound Detection via Low-Rank Adaptation Fine-Tuning of Pre-Trained Audio Models
by: Zheng, Xinhu, et al.
Published: (2024)
by: Zheng, Xinhu, et al.
Published: (2024)
AnoPatch: Towards Better Consistency in Machine Anomalous Sound Detection
by: Jiang, Anbai, et al.
Published: (2024)
by: Jiang, Anbai, et al.
Published: (2024)
Data-Efficient Low-Complexity Acoustic Scene Classification via Distilling and Progressive Pruning
by: Han, Bing, et al.
Published: (2024)
by: Han, Bing, et al.
Published: (2024)
AVE Speech: A Comprehensive Multi-Modal Dataset for Speech Recognition Integrating Audio, Visual, and Electromyographic Signals
by: Zhou, Dongliang, et al.
Published: (2025)
by: Zhou, Dongliang, et al.
Published: (2025)
HAFM: Hierarchical Autoregressive Foundation Model for Music Accompaniment Generation
by: Zhu, Jian, et al.
Published: (2026)
by: Zhu, Jian, et al.
Published: (2026)
ZO-ASR: Zeroth-Order Fine-Tuning of Speech Foundation Models without Back-Propagation
by: Peng, Yuezhang, et al.
Published: (2025)
by: Peng, Yuezhang, et al.
Published: (2025)
Delayed Commitment for Representation Readiness in Stage-wise Audio-Visual Learning
by: Xu, Xinmeng, et al.
Published: (2026)
by: Xu, Xinmeng, et al.
Published: (2026)
A Survey on Music Generation from Single-Modal, Cross-Modal, and Multi-Modal Perspectives
by: Li, Shuyu, et al.
Published: (2025)
by: Li, Shuyu, et al.
Published: (2025)
Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities
by: Sudarsanam, Parthasaarathy, et al.
Published: (2025)
by: Sudarsanam, Parthasaarathy, et al.
Published: (2025)
SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding
by: Sun, Luoyi, et al.
Published: (2026)
by: Sun, Luoyi, et al.
Published: (2026)
3MDiT: Unified Tri-Modal Diffusion Transformer for Text-Driven Synchronized Audio-Video Generation
by: Li, Yaoru, et al.
Published: (2025)
by: Li, Yaoru, et al.
Published: (2025)
Audio-Visual Separation with Hierarchical Fusion and Representation Alignment
by: Hu, Han, et al.
Published: (2025)
by: Hu, Han, et al.
Published: (2025)
Robust Wake Word Spotting With Frame-Level Cross-Modal Attention Based Audio-Visual Conformer
by: Wang, Haoxu, et al.
Published: (2024)
by: Wang, Haoxu, et al.
Published: (2024)
MeMo: Attentional Momentum for Real-time Audio-visual Speaker Extraction under Impaired Visual Conditions
by: Li, Junjie, et al.
Published: (2025)
by: Li, Junjie, et al.
Published: (2025)
EigeNet: Geometry-Informed Multi-Modal Learning for Few-shot Novel View RIR Prediction
by: Jing, Chong, et al.
Published: (2026)
by: Jing, Chong, et al.
Published: (2026)
HDA-SELD: Hierarchical Cross-Modal Distillation with Multi-Level Data Augmentation for Low-Resource Audio-Visual Sound Event Localization and Detection
by: Wang, Qing, et al.
Published: (2025)
by: Wang, Qing, et al.
Published: (2025)
SonicGauss: Position-Aware Physical Sound Synthesis for 3D Gaussian Representations
by: Wang, Chunshi, et al.
Published: (2025)
by: Wang, Chunshi, et al.
Published: (2025)
A Unified Framework for Modality-Agnostic Deepfakes Detection
by: Yu, Cai, et al.
Published: (2023)
by: Yu, Cai, et al.
Published: (2023)
SHMamba: Structured Hyperbolic State Space Model for Audio-Visual Question Answering
by: Yang, Zhe, et al.
Published: (2024)
by: Yang, Zhe, et al.
Published: (2024)
A Survey on Cross-Modal Interaction Between Music and Multimodal Data
by: Li, Sifei, et al.
Published: (2025)
by: Li, Sifei, et al.
Published: (2025)
Spiking Tucker Fusion Transformer for Audio-Visual Zero-Shot Learning
by: Li, Wenrui, et al.
Published: (2024)
by: Li, Wenrui, et al.
Published: (2024)
CLAIP-Emo: Parameter-Efficient Adaptation of Language-supervised models for In-the-Wild Audiovisual Emotion Recognition
by: Chen, Yin, et al.
Published: (2025)
by: Chen, Yin, et al.
Published: (2025)
CoopASD: Cooperative Machine Anomalous Sound Detection with Privacy Concerns
by: Jiang, Anbai, et al.
Published: (2024)
by: Jiang, Anbai, et al.
Published: (2024)
ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling
by: Yang, Jianxuan, et al.
Published: (2026)
by: Yang, Jianxuan, et al.
Published: (2026)
MG-Former: A Transformer-Based Framework for Music-Driven 3D Conducting Gesture Generation
by: Qiu, Ke, et al.
Published: (2026)
by: Qiu, Ke, et al.
Published: (2026)
EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities
by: Qian, Xinyuan, et al.
Published: (2026)
by: Qian, Xinyuan, et al.
Published: (2026)
Research on Piano Timbre Transformation System Based on Diffusion Model
by: Hsu, Chun-Chieh, et al.
Published: (2026)
by: Hsu, Chun-Chieh, et al.
Published: (2026)
SyncGuard: Robust Audio Watermarking Capable of Countering Desynchronization Attacks
by: Gan, Zhenliang, et al.
Published: (2025)
by: Gan, Zhenliang, et al.
Published: (2025)
Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap
by: Lin, Yueqian, et al.
Published: (2025)
by: Lin, Yueqian, et al.
Published: (2025)
MART: Learning Hierarchical Music Audio Representations with Part-Whole Transformer
by: Yao, Dong, et al.
Published: (2023)
by: Yao, Dong, et al.
Published: (2023)
Physics-Aware Novel-View Acoustic Synthesis with Vision-Language Priors and 3D Acoustic Environment Modeling
by: Fan, Congyi, et al.
Published: (2026)
by: Fan, Congyi, et al.
Published: (2026)
XGC-AVis: Towards Audio-Visual Content Understanding with a Multi-Agent Collaborative System
by: Cao, Yuqin, et al.
Published: (2025)
by: Cao, Yuqin, et al.
Published: (2025)
RenCon 2025: Revival of the Expressive Performance Rendering Competition
by: Zhang, Huan, et al.
Published: (2026)
by: Zhang, Huan, et al.
Published: (2026)
MusicAOG: an Energy-Based Model for Learning and Sampling a Hierarchical Representation of Symbolic Music
by: Qian, Yikai, et al.
Published: (2024)
by: Qian, Yikai, et al.
Published: (2024)
AMB-DSGDN: Adaptive Modality-Balanced Dynamic Semantic Graph Differential Network for Multimodal Emotion Recognition
by: Wang, Yunsheng, et al.
Published: (2026)
by: Wang, Yunsheng, et al.
Published: (2026)
DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis
by: Tian, Wenjie, et al.
Published: (2025)
by: Tian, Wenjie, et al.
Published: (2025)
Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation
by: Tong, Xinyi, et al.
Published: (2025)
by: Tong, Xinyi, et al.
Published: (2025)
LCB-net: Long-Context Biasing for Audio-Visual Speech Recognition
by: Yu, Fan, et al.
Published: (2024)
by: Yu, Fan, et al.
Published: (2024)
Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation
by: Chen, Yuheng, et al.
Published: (2026)
by: Chen, Yuheng, et al.
Published: (2026)
Similar Items
-
Exploring Self-Supervised Audio Models for Generalized Anomalous Sound Detection
by: Han, Bing, et al.
Published: (2025) -
Improving Anomalous Sound Detection via Low-Rank Adaptation Fine-Tuning of Pre-Trained Audio Models
by: Zheng, Xinhu, et al.
Published: (2024) -
AnoPatch: Towards Better Consistency in Machine Anomalous Sound Detection
by: Jiang, Anbai, et al.
Published: (2024) -
Data-Efficient Low-Complexity Acoustic Scene Classification via Distilling and Progressive Pruning
by: Han, Bing, et al.
Published: (2024) -
AVE Speech: A Comprehensive Multi-Modal Dataset for Speech Recognition Integrating Audio, Visual, and Electromyographic Signals
by: Zhou, Dongliang, et al.
Published: (2025)