Saved in:
| Main Authors: | Jiang, Yi-Lu, Chang, Wen-Chang, Wang, Ching-Lin, Hsu, Kung-Liang, Chiu, Chih-Yi |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.11020 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Efficient Video to Audio Mapper with Visual Scene Detection
by: Yi, Mingjing, et al.
Published: (2024)
by: Yi, Mingjing, et al.
Published: (2024)
Quality-Aware End-to-End Audio-Visual Neural Speaker Diarization
by: He, Mao-Kui, et al.
Published: (2024)
by: He, Mao-Kui, et al.
Published: (2024)
AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models
by: Li, Wenyu, et al.
Published: (2025)
by: Li, Wenyu, et al.
Published: (2025)
Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions
by: Yuan, Yi, et al.
Published: (2024)
by: Yuan, Yi, et al.
Published: (2024)
Audio-Language Models for Audio-Centric Tasks: A Systematic Survey
by: Su, Yi, et al.
Published: (2025)
by: Su, Yi, et al.
Published: (2025)
Bridging The Multi-Modality Gaps of Audio, Visual and Linguistic for Speech Enhancement
by: Lin, Meng-Ping, et al.
Published: (2025)
by: Lin, Meng-Ping, et al.
Published: (2025)
BERT-like Pre-training for Symbolic Piano Music Classification Tasks
by: Chou, Yi-Hui, et al.
Published: (2021)
by: Chou, Yi-Hui, et al.
Published: (2021)
Rhythmic Foley: A Framework For Seamless Audio-Visual Alignment In Video-to-Audio Synthesis
by: Huang, Zhiqi, et al.
Published: (2024)
by: Huang, Zhiqi, et al.
Published: (2024)
Building Audio-Visual Digital Twins with Smartphones
by: Lan, Zitong, et al.
Published: (2025)
by: Lan, Zitong, et al.
Published: (2025)
Cinematic Audio Source Separation Using Visual Cues
by: Zhang, Kang, et al.
Published: (2026)
by: Zhang, Kang, et al.
Published: (2026)
Attentive-based Multi-level Feature Fusion for Voice Disorder Diagnosis
by: Shen, Lipeng, et al.
Published: (2024)
by: Shen, Lipeng, et al.
Published: (2024)
Self-Attention and Hybrid Features for Replay and Deep-Fake Audio Detection
by: Huang, Lian, et al.
Published: (2024)
by: Huang, Lian, et al.
Published: (2024)
FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation
by: Jiang, Yuxuan, et al.
Published: (2025)
by: Jiang, Yuxuan, et al.
Published: (2025)
Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions
by: Zhao, Jinzheng, et al.
Published: (2023)
by: Zhao, Jinzheng, et al.
Published: (2023)
Zero-Shot Fake Video Detection by Audio-Visual Consistency
by: Li, Xiaolou, et al.
Published: (2024)
by: Li, Xiaolou, et al.
Published: (2024)
Audio-Visual Speech Separation via Bottleneck Iterative Network
by: Zhang, Sidong, et al.
Published: (2025)
by: Zhang, Sidong, et al.
Published: (2025)
Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities
by: Sudarsanam, Parthasaarathy, et al.
Published: (2025)
by: Sudarsanam, Parthasaarathy, et al.
Published: (2025)
RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues
by: Pan, Tianrui, et al.
Published: (2024)
by: Pan, Tianrui, et al.
Published: (2024)
UNQA: Unified No-Reference Quality Assessment for Audio, Image, Video, and Audio-Visual Content
by: Cao, Yuqin, et al.
Published: (2024)
by: Cao, Yuqin, et al.
Published: (2024)
LCB-net: Long-Context Biasing for Audio-Visual Speech Recognition
by: Yu, Fan, et al.
Published: (2024)
by: Yu, Fan, et al.
Published: (2024)
Human-Inspired Computing for Robust and Efficient Audio-Visual Speech Recognition
by: Liu, Qianhui, et al.
Published: (2024)
by: Liu, Qianhui, et al.
Published: (2024)
MuseAgent-1: Interactive Grounded Multimodal Understanding of Music Scores and Performance Audio
by: Zhao, Qihao, et al.
Published: (2026)
by: Zhao, Qihao, et al.
Published: (2026)
Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement
by: Su, Fei, et al.
Published: (2026)
by: Su, Fei, et al.
Published: (2026)
Multimodal Emotion Recognition from Raw Audio with Sinc-convolution
by: Zhang, Xiaohui, et al.
Published: (2024)
by: Zhang, Xiaohui, et al.
Published: (2024)
SONIQUE: Video Background Music Generation Using Unpaired Audio-Visual Data
by: Zhang, Liqian, et al.
Published: (2024)
by: Zhang, Liqian, et al.
Published: (2024)
Plug-and-Steer: Decoupling Separation and Selection in Audio-Visual Target Speaker Extraction
by: Kwak, Doyeop, et al.
Published: (2026)
by: Kwak, Doyeop, et al.
Published: (2026)
ecVoice: Audio Text Extraction and Optimization of Video Based on Idioms Similarity Replacement
by: Lin, Jinwei
Published: (2024)
by: Lin, Jinwei
Published: (2024)
Robust Wake Word Spotting With Frame-Level Cross-Modal Attention Based Audio-Visual Conformer
by: Wang, Haoxu, et al.
Published: (2024)
by: Wang, Haoxu, et al.
Published: (2024)
X-CrossNet: A complex spectral mapping approach to target speaker extraction with cross attention speaker embedding fusion
by: Sun, Chang, et al.
Published: (2024)
by: Sun, Chang, et al.
Published: (2024)
V2A-DPO: Omni-Preference Optimization for Video-to-Audio Generation
by: Chan, Nolan, et al.
Published: (2026)
by: Chan, Nolan, et al.
Published: (2026)
STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment
by: Ren, Yong, et al.
Published: (2024)
by: Ren, Yong, et al.
Published: (2024)
FGAS: Fixed Decoder Network-Based Audio Steganography with Adversarial Perturbation Generation
by: Yan, Jialin, et al.
Published: (2025)
by: Yan, Jialin, et al.
Published: (2025)
StereoFoley: Object-Aware Stereo Audio Generation from Video
by: Karchkhadze, Tornike, et al.
Published: (2025)
by: Karchkhadze, Tornike, et al.
Published: (2025)
MART: Learning Hierarchical Music Audio Representations with Part-Whole Transformer
by: Yao, Dong, et al.
Published: (2023)
by: Yao, Dong, et al.
Published: (2023)
Leveraging Pre-trained AudioLDM for Sound Generation: A Benchmark Study
by: Yuan, Yi, et al.
Published: (2023)
by: Yuan, Yi, et al.
Published: (2023)
Mel-Refine: A Plug-and-Play Approach to Refine Mel-Spectrogram in Audio Generation
by: Guo, Hongming, et al.
Published: (2024)
by: Guo, Hongming, et al.
Published: (2024)
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
by: Tseng, Yuan, et al.
Published: (2023)
by: Tseng, Yuan, et al.
Published: (2023)
MusicSem: A Semantically Rich Language--Audio Dataset of Natural Music Descriptions
by: Salganik, Rebecca, et al.
Published: (2026)
by: Salganik, Rebecca, et al.
Published: (2026)
Unveiling Visual Biases in Audio-Visual Localization Benchmarks
by: Chen, Liangyu, et al.
Published: (2024)
by: Chen, Liangyu, et al.
Published: (2024)
Retrieval-Augmented Text-to-Audio Generation
by: Yuan, Yi, et al.
Published: (2023)
by: Yuan, Yi, et al.
Published: (2023)
Similar Items
-
Efficient Video to Audio Mapper with Visual Scene Detection
by: Yi, Mingjing, et al.
Published: (2024) -
Quality-Aware End-to-End Audio-Visual Neural Speaker Diarization
by: He, Mao-Kui, et al.
Published: (2024) -
AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models
by: Li, Wenyu, et al.
Published: (2025) -
Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions
by: Yuan, Yi, et al.
Published: (2024) -
Audio-Language Models for Audio-Centric Tasks: A Systematic Survey
by: Su, Yi, et al.
Published: (2025)