:: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhu, Yongxin, Li, Bocheng, Xin, Yifei, Xia, Zhihua, Xu, Linli
Format:	Preprint
Published:	2024
Subjects:	Machine Learning Computer Vision and Pattern Recognition Sound Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2411.02038
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition
by: Wang, Juncheng, et al.
Published: (2025)

Input Conditioned Layer Dropping in Speech Foundation Models
by: Hannan, Abdul, et al.
Published: (2025)

Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks
by: Liu, Chang, et al.
Published: (2025)

As Good as It KAN Get: High-Fidelity Audio Representation
by: Marszałek, Patryk, et al.
Published: (2025)

Spiking Structured State Space Model for Monaural Speech Enhancement
by: Du, Yu, et al.
Published: (2023)

HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation
by: Shan, Sizhe, et al.
Published: (2025)

Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation
by: Kim, Minsu, et al.
Published: (2024)

CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization
by: Bai, Detao, et al.
Published: (2025)

Modeling and Driving Human Body Soundfields through Acoustic Primitives
by: Huang, Chao, et al.
Published: (2024)

Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning
by: Sun, Luoyi, et al.
Published: (2023)

Separate to Collaborate: Dual-Stream Diffusion Model for Coordinated Piano Hand Motion Synthesis
by: Liu, Zihao, et al.
Published: (2025)

Controllable Dance Generation with Style-Guided Motion Diffusion
by: Wang, Hongsong, et al.
Published: (2024)

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
by: Tseng, Yuan, et al.
Published: (2023)

Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer
by: Zhu, Yongxin, et al.
Published: (2024)

Wav2Sem: Plug-and-Play Audio Semantic Decoupling for 3D Speech-Driven Facial Animation
by: Li, Hao, et al.
Published: (2025)

Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics
by: Liu, Chen, et al.
Published: (2025)

Multimodal Fusion Method with Spatiotemporal Sequences and Relationship Learning for Valence-Arousal Estimation
by: Yu, Jun, et al.
Published: (2024)

Multitask frame-level learning for few-shot sound event detection
by: Zou, Liang, et al.
Published: (2024)

Learning Musical Representations for Music Performance Question Answering
by: Diao, Xingjian, et al.
Published: (2025)

Learning Self-Supervised Audio-Visual Representations for Sound Recommendations
by: Krishnamurthy, Sudha
Published: (2024)

DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information
by: Nakada, Shota, et al.
Published: (2024)

Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing
by: Liu, Zehua, et al.
Published: (2025)

Bidirectional Autoregressive Diffusion Model for Dance Generation
by: Zhang, Canyu, et al.
Published: (2024)

RapVerse: Coherent Vocals and Whole-Body Motions Generations from Text
by: Chen, Jiaben, et al.
Published: (2024)

Music Genre Classification using Large Language Models
by: Meguenani, Mohamed El Amine, et al.
Published: (2024)

High-Quality Visually-Guided Sound Separation from Diverse Categories
by: Huang, Chao, et al.
Published: (2023)

Sonicmesh: Enhancing 3D Human Mesh Reconstruction in Vision-Impaired Environments With Acoustic Signals
by: Liang, Xiaoxuan, et al.
Published: (2024)

Enhancing Dance-to-Music Generation via Negative Conditioning Latent Diffusion Model
by: Sun, Changchang, et al.
Published: (2025)

EmoTalker: Emotionally Editable Talking Face Generation via Diffusion Model
by: Zhang, Bingyuan, et al.
Published: (2024)

Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models
by: Cappellazzo, Umberto, et al.
Published: (2025)

SSAVSV: Towards Unified Model for Self-Supervised Audio-Visual Speaker Verification
by: Rajasekhar, Gnana Praveen, et al.
Published: (2025)

Bridging Audio and Vision: Zero-Shot Audiovisual Segmentation by Connecting Pretrained Models
by: Lee, Seung-jae, et al.
Published: (2025)

DDAVS: Disentangled Audio Semantics and Delayed Bidirectional Alignment for Audio-Visual Segmentation
by: Tian, Jingqi, et al.
Published: (2025)

RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation
by: Pegg, Samuel, et al.
Published: (2023)

A Critical Assessment of Visual Sound Source Localization Models Including Negative Audio
by: Juanola, Xavier, et al.
Published: (2024)

Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models
by: Pianese, Alessandro, et al.
Published: (2024)

RESOUND: Speech Reconstruction from Silent Videos via Acoustic-Semantic Decomposed Modeling
by: Pham, Long-Khanh, et al.
Published: (2025)

Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction
by: Yang, Shu-wen, et al.
Published: (2025)

Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling
by: Du, Jiarong, et al.
Published: (2025)

SoundWeaver: Semantic Warm-Starting for Text-to-Audio Diffusion Serving
by: Barik, Ayush, et al.
Published: (2026)