Saved in:
| Main Authors: | Zhu, Yongxin, Li, Bocheng, Xin, Yifei, Xia, Zhihua, Xu, Linli |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2411.02038 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition
by: Wang, Juncheng, et al.
Published: (2025)
by: Wang, Juncheng, et al.
Published: (2025)
Input Conditioned Layer Dropping in Speech Foundation Models
by: Hannan, Abdul, et al.
Published: (2025)
by: Hannan, Abdul, et al.
Published: (2025)
Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks
by: Liu, Chang, et al.
Published: (2025)
by: Liu, Chang, et al.
Published: (2025)
As Good as It KAN Get: High-Fidelity Audio Representation
by: Marszałek, Patryk, et al.
Published: (2025)
by: Marszałek, Patryk, et al.
Published: (2025)
Spiking Structured State Space Model for Monaural Speech Enhancement
by: Du, Yu, et al.
Published: (2023)
by: Du, Yu, et al.
Published: (2023)
HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation
by: Shan, Sizhe, et al.
Published: (2025)
by: Shan, Sizhe, et al.
Published: (2025)
Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation
by: Kim, Minsu, et al.
Published: (2024)
by: Kim, Minsu, et al.
Published: (2024)
CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization
by: Bai, Detao, et al.
Published: (2025)
by: Bai, Detao, et al.
Published: (2025)
Modeling and Driving Human Body Soundfields through Acoustic Primitives
by: Huang, Chao, et al.
Published: (2024)
by: Huang, Chao, et al.
Published: (2024)
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning
by: Sun, Luoyi, et al.
Published: (2023)
by: Sun, Luoyi, et al.
Published: (2023)
Separate to Collaborate: Dual-Stream Diffusion Model for Coordinated Piano Hand Motion Synthesis
by: Liu, Zihao, et al.
Published: (2025)
by: Liu, Zihao, et al.
Published: (2025)
Controllable Dance Generation with Style-Guided Motion Diffusion
by: Wang, Hongsong, et al.
Published: (2024)
by: Wang, Hongsong, et al.
Published: (2024)
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
by: Tseng, Yuan, et al.
Published: (2023)
by: Tseng, Yuan, et al.
Published: (2023)
Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer
by: Zhu, Yongxin, et al.
Published: (2024)
by: Zhu, Yongxin, et al.
Published: (2024)
Wav2Sem: Plug-and-Play Audio Semantic Decoupling for 3D Speech-Driven Facial Animation
by: Li, Hao, et al.
Published: (2025)
by: Li, Hao, et al.
Published: (2025)
Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics
by: Liu, Chen, et al.
Published: (2025)
by: Liu, Chen, et al.
Published: (2025)
Multimodal Fusion Method with Spatiotemporal Sequences and Relationship Learning for Valence-Arousal Estimation
by: Yu, Jun, et al.
Published: (2024)
by: Yu, Jun, et al.
Published: (2024)
Multitask frame-level learning for few-shot sound event detection
by: Zou, Liang, et al.
Published: (2024)
by: Zou, Liang, et al.
Published: (2024)
Learning Musical Representations for Music Performance Question Answering
by: Diao, Xingjian, et al.
Published: (2025)
by: Diao, Xingjian, et al.
Published: (2025)
Learning Self-Supervised Audio-Visual Representations for Sound Recommendations
by: Krishnamurthy, Sudha
Published: (2024)
by: Krishnamurthy, Sudha
Published: (2024)
DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information
by: Nakada, Shota, et al.
Published: (2024)
by: Nakada, Shota, et al.
Published: (2024)
Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing
by: Liu, Zehua, et al.
Published: (2025)
by: Liu, Zehua, et al.
Published: (2025)
Bidirectional Autoregressive Diffusion Model for Dance Generation
by: Zhang, Canyu, et al.
Published: (2024)
by: Zhang, Canyu, et al.
Published: (2024)
RapVerse: Coherent Vocals and Whole-Body Motions Generations from Text
by: Chen, Jiaben, et al.
Published: (2024)
by: Chen, Jiaben, et al.
Published: (2024)
Music Genre Classification using Large Language Models
by: Meguenani, Mohamed El Amine, et al.
Published: (2024)
by: Meguenani, Mohamed El Amine, et al.
Published: (2024)
High-Quality Visually-Guided Sound Separation from Diverse Categories
by: Huang, Chao, et al.
Published: (2023)
by: Huang, Chao, et al.
Published: (2023)
Sonicmesh: Enhancing 3D Human Mesh Reconstruction in Vision-Impaired Environments With Acoustic Signals
by: Liang, Xiaoxuan, et al.
Published: (2024)
by: Liang, Xiaoxuan, et al.
Published: (2024)
Enhancing Dance-to-Music Generation via Negative Conditioning Latent Diffusion Model
by: Sun, Changchang, et al.
Published: (2025)
by: Sun, Changchang, et al.
Published: (2025)
EmoTalker: Emotionally Editable Talking Face Generation via Diffusion Model
by: Zhang, Bingyuan, et al.
Published: (2024)
by: Zhang, Bingyuan, et al.
Published: (2024)
Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models
by: Cappellazzo, Umberto, et al.
Published: (2025)
by: Cappellazzo, Umberto, et al.
Published: (2025)
SSAVSV: Towards Unified Model for Self-Supervised Audio-Visual Speaker Verification
by: Rajasekhar, Gnana Praveen, et al.
Published: (2025)
by: Rajasekhar, Gnana Praveen, et al.
Published: (2025)
Bridging Audio and Vision: Zero-Shot Audiovisual Segmentation by Connecting Pretrained Models
by: Lee, Seung-jae, et al.
Published: (2025)
by: Lee, Seung-jae, et al.
Published: (2025)
DDAVS: Disentangled Audio Semantics and Delayed Bidirectional Alignment for Audio-Visual Segmentation
by: Tian, Jingqi, et al.
Published: (2025)
by: Tian, Jingqi, et al.
Published: (2025)
RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation
by: Pegg, Samuel, et al.
Published: (2023)
by: Pegg, Samuel, et al.
Published: (2023)
A Critical Assessment of Visual Sound Source Localization Models Including Negative Audio
by: Juanola, Xavier, et al.
Published: (2024)
by: Juanola, Xavier, et al.
Published: (2024)
Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models
by: Pianese, Alessandro, et al.
Published: (2024)
by: Pianese, Alessandro, et al.
Published: (2024)
RESOUND: Speech Reconstruction from Silent Videos via Acoustic-Semantic Decomposed Modeling
by: Pham, Long-Khanh, et al.
Published: (2025)
by: Pham, Long-Khanh, et al.
Published: (2025)
Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction
by: Yang, Shu-wen, et al.
Published: (2025)
by: Yang, Shu-wen, et al.
Published: (2025)
Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling
by: Du, Jiarong, et al.
Published: (2025)
by: Du, Jiarong, et al.
Published: (2025)
SoundWeaver: Semantic Warm-Starting for Text-to-Audio Diffusion Serving
by: Barik, Ayush, et al.
Published: (2026)
by: Barik, Ayush, et al.
Published: (2026)
Similar Items
-
Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition
by: Wang, Juncheng, et al.
Published: (2025) -
Input Conditioned Layer Dropping in Speech Foundation Models
by: Hannan, Abdul, et al.
Published: (2025) -
Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks
by: Liu, Chang, et al.
Published: (2025) -
As Good as It KAN Get: High-Fidelity Audio Representation
by: Marszałek, Patryk, et al.
Published: (2025) -
Spiking Structured State Space Model for Monaural Speech Enhancement
by: Du, Yu, et al.
Published: (2023)