Guardado en:
| Autores principales: | Ahmad, Zeeshan, Bao, Shudi, Chen, Meng |
|---|---|
| Formato: | Preprint |
| Publicado: |
2025
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2505.09091 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
Ejemplares similares
As Good as It KAN Get: High-Fidelity Audio Representation
por: Marszałek, Patryk, et al.
Publicado: (2025)
por: Marszałek, Patryk, et al.
Publicado: (2025)
HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation
por: Shan, Sizhe, et al.
Publicado: (2025)
por: Shan, Sizhe, et al.
Publicado: (2025)
CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation
por: Chen, Yuanhong, et al.
Publicado: (2025)
por: Chen, Yuanhong, et al.
Publicado: (2025)
OmniAudio: Generating Spatial Audio from 360-Degree Video
por: Liu, Huadai, et al.
Publicado: (2025)
por: Liu, Huadai, et al.
Publicado: (2025)
Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs
por: Anand, et al.
Publicado: (2025)
por: Anand, et al.
Publicado: (2025)
JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching
por: Kwon, Mingi, et al.
Publicado: (2025)
por: Kwon, Mingi, et al.
Publicado: (2025)
DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation
por: Zhang, Haomin, et al.
Publicado: (2025)
por: Zhang, Haomin, et al.
Publicado: (2025)
Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis
por: Yang, Qi, et al.
Publicado: (2024)
por: Yang, Qi, et al.
Publicado: (2024)
Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues
por: Chen, Tianxiang, et al.
Publicado: (2024)
por: Chen, Tianxiang, et al.
Publicado: (2024)
Multiple Consistency-guided Test-Time Adaptation for Contrastive Audio-Language Models with Unlabeled Audio
por: Chen, Gongyu, et al.
Publicado: (2024)
por: Chen, Gongyu, et al.
Publicado: (2024)
Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics
por: Liu, Chen, et al.
Publicado: (2025)
por: Liu, Chen, et al.
Publicado: (2025)
DeepSound-V1: Start to Think Step-by-Step in the Audio Generation from Videos
por: Liang, Yunming, et al.
Publicado: (2025)
por: Liang, Yunming, et al.
Publicado: (2025)
Dual Audio-Centric Modality Coupling for Talking Head Generation
por: Fu, Ao, et al.
Publicado: (2025)
por: Fu, Ao, et al.
Publicado: (2025)
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching
por: Wang, Yongqi, et al.
Publicado: (2024)
por: Wang, Yongqi, et al.
Publicado: (2024)
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing
por: Liu, Huadai, et al.
Publicado: (2025)
por: Liu, Huadai, et al.
Publicado: (2025)
Video-to-Audio Generation with Hidden Alignment
por: Xu, Manjie, et al.
Publicado: (2024)
por: Xu, Manjie, et al.
Publicado: (2024)
Audio-Plane: Audio Factorization Plane Gaussian Splatting for Real-Time Talking Head Synthesis
por: Shen, Shuai, et al.
Publicado: (2025)
por: Shen, Shuai, et al.
Publicado: (2025)
Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition
por: Wang, Juncheng, et al.
Publicado: (2025)
por: Wang, Juncheng, et al.
Publicado: (2025)
Conditional GAN for Enhancing Diffusion Models in Efficient and Authentic Global Gesture Generation from Audios
por: Cheng, Yongkang, et al.
Publicado: (2024)
por: Cheng, Yongkang, et al.
Publicado: (2024)
AudioGAN: A Compact and Efficient Framework for Real-Time High-Fidelity Text-to-Audio Generation
por: Chung, HaeChun
Publicado: (2025)
por: Chung, HaeChun
Publicado: (2025)
Global-Local Distillation Network-Based Audio-Visual Speaker Tracking with Incomplete Modalities
por: Li, Yidi, et al.
Publicado: (2024)
por: Li, Yidi, et al.
Publicado: (2024)
Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation
por: Sun, Peiwen, et al.
Publicado: (2024)
por: Sun, Peiwen, et al.
Publicado: (2024)
DDAVS: Disentangled Audio Semantics and Delayed Bidirectional Alignment for Audio-Visual Segmentation
por: Tian, Jingqi, et al.
Publicado: (2025)
por: Tian, Jingqi, et al.
Publicado: (2025)
AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation
por: Wang, Le, et al.
Publicado: (2025)
por: Wang, Le, et al.
Publicado: (2025)
Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction
por: Yang, Shu-wen, et al.
Publicado: (2025)
por: Yang, Shu-wen, et al.
Publicado: (2025)
TapToTab : Video-Based Guitar Tabs Generation using AI and Audio Analysis
por: Ghaleb, Ali, et al.
Publicado: (2024)
por: Ghaleb, Ali, et al.
Publicado: (2024)
ExpGest: Expressive Speaker Generation Using Diffusion Model and Hybrid Audio-Text Guidance
por: Cheng, Yongkang, et al.
Publicado: (2024)
por: Cheng, Yongkang, et al.
Publicado: (2024)
CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization
por: Bai, Detao, et al.
Publicado: (2025)
por: Bai, Detao, et al.
Publicado: (2025)
Efficient Video-to-Audio Generation via Multiple Foundation Models Mapper
por: Chen, Gehui, et al.
Publicado: (2025)
por: Chen, Gehui, et al.
Publicado: (2025)
Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models
por: Choi, Jeongsoo, et al.
Publicado: (2023)
por: Choi, Jeongsoo, et al.
Publicado: (2023)
MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
por: Cheng, Ho Kei, et al.
Publicado: (2024)
por: Cheng, Ho Kei, et al.
Publicado: (2024)
Weakly-supervised Audio Temporal Forgery Localization via Progressive Audio-language Co-learning Network
por: Wu, Junyan, et al.
Publicado: (2025)
por: Wu, Junyan, et al.
Publicado: (2025)
MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition
por: Cappellazzo, Umberto, et al.
Publicado: (2025)
por: Cappellazzo, Umberto, et al.
Publicado: (2025)
Learning to Highlight Audio by Watching Movies
por: Huang, Chao, et al.
Publicado: (2025)
por: Huang, Chao, et al.
Publicado: (2025)
Aligned Better, Listen Better for Audio-Visual Large Language Models
por: Guo, Yuxin, et al.
Publicado: (2025)
por: Guo, Yuxin, et al.
Publicado: (2025)
Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance
por: Hayakawa, Akio, et al.
Publicado: (2025)
por: Hayakawa, Akio, et al.
Publicado: (2025)
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners
por: Xing, Yazhou, et al.
Publicado: (2024)
por: Xing, Yazhou, et al.
Publicado: (2024)
Segment Beyond View: Handling Partially Missing Modality for Audio-Visual Semantic Segmentation
por: Wu, Renjie, et al.
Publicado: (2023)
por: Wu, Renjie, et al.
Publicado: (2023)
Pindrop it! Audio and Visual Deepfake Countermeasures for Robust Detection and Fine Grained-Localization
por: Klein, Nicholas, et al.
Publicado: (2025)
por: Klein, Nicholas, et al.
Publicado: (2025)
ZeroSep: Separate Anything in Audio with Zero Training
por: Huang, Chao, et al.
Publicado: (2025)
por: Huang, Chao, et al.
Publicado: (2025)
Ejemplares similares
-
As Good as It KAN Get: High-Fidelity Audio Representation
por: Marszałek, Patryk, et al.
Publicado: (2025) -
HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation
por: Shan, Sizhe, et al.
Publicado: (2025) -
CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation
por: Chen, Yuanhong, et al.
Publicado: (2025) -
OmniAudio: Generating Spatial Audio from 360-Degree Video
por: Liu, Huadai, et al.
Publicado: (2025) -
Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs
por: Anand, et al.
Publicado: (2025)