Saved in:
| Main Authors: | Li, Yaxuan, Jiang, Han, Ma, Yifei, Qin, Shihua, Woo, Jonghye, Xing, Fangxu |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.06588 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Speech motion anomaly detection via cross-modal translation of 4D motion fields from tagged MRI
by: Liu, Xiaofeng, et al.
Published: (2024)
by: Liu, Xiaofeng, et al.
Published: (2024)
Semi-Supervised Bone Marrow Lesion Detection from Knee MRI Segmentation Using Mask Inpainting Models
by: Qin, Shihua, et al.
Published: (2024)
by: Qin, Shihua, et al.
Published: (2024)
SIREM: Speech-Informed MRI Reconstruction with Learned Sampling
by: Hasan, Md, et al.
Published: (2026)
by: Hasan, Md, et al.
Published: (2026)
TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis
by: Ton, Tri, et al.
Published: (2025)
by: Ton, Tri, et al.
Published: (2025)
MMAudio-LABEL: Audio Event Labeling via Audio Generation for Silent Video
by: Tateishi, Kazuya, et al.
Published: (2026)
by: Tateishi, Kazuya, et al.
Published: (2026)
FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts
by: Li, You, et al.
Published: (2026)
by: Li, You, et al.
Published: (2026)
AudioStory: Generating Long-Form Narrative Audio with Large Language Models
by: Guo, Yuxin, et al.
Published: (2025)
by: Guo, Yuxin, et al.
Published: (2025)
UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking
by: Chu, Xuangeng, et al.
Published: (2025)
by: Chu, Xuangeng, et al.
Published: (2025)
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
by: Cheng, Shihao, et al.
Published: (2026)
by: Cheng, Shihao, et al.
Published: (2026)
Conditional GAN for Enhancing Diffusion Models in Efficient and Authentic Global Gesture Generation from Audios
by: Cheng, Yongkang, et al.
Published: (2024)
by: Cheng, Yongkang, et al.
Published: (2024)
MOVA: Towards Scalable and Synchronized Video-Audio Generation
by: OpenMOSS Team, et al.
Published: (2026)
by: OpenMOSS Team, et al.
Published: (2026)
An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits
by: Li, Kai, et al.
Published: (2022)
by: Li, Kai, et al.
Published: (2022)
Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction
by: Guan, Kaisi, et al.
Published: (2025)
by: Guan, Kaisi, et al.
Published: (2025)
Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction
by: Mu, Zhaoxi, et al.
Published: (2024)
by: Mu, Zhaoxi, et al.
Published: (2024)
VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection
by: Cheng, Hao, et al.
Published: (2025)
by: Cheng, Hao, et al.
Published: (2025)
MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition
by: Cappellazzo, Umberto, et al.
Published: (2025)
by: Cappellazzo, Umberto, et al.
Published: (2025)
Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech Translation
by: Goncalves, Lucas, et al.
Published: (2024)
by: Goncalves, Lucas, et al.
Published: (2024)
Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment
by: Liu, Chen, et al.
Published: (2025)
by: Liu, Chen, et al.
Published: (2025)
Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence
by: Liao, Junchao, et al.
Published: (2026)
by: Liao, Junchao, et al.
Published: (2026)
DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation
by: Zhang, Haomin, et al.
Published: (2025)
by: Zhang, Haomin, et al.
Published: (2025)
Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention
by: Li, Kai, et al.
Published: (2025)
by: Li, Kai, et al.
Published: (2025)
WavFlow: Audio Generation in Waveform Space
by: Zhou, Feiyan, et al.
Published: (2026)
by: Zhou, Feiyan, et al.
Published: (2026)
Audiovisual Masked Autoencoders
by: Georgescu, Mariana-Iuliana, et al.
Published: (2022)
by: Georgescu, Mariana-Iuliana, et al.
Published: (2022)
AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation
by: Wang, Le, et al.
Published: (2025)
by: Wang, Le, et al.
Published: (2025)
Omni2Sound: Towards Unified Video-Text-to-Audio Generation
by: Dai, Yusheng, et al.
Published: (2026)
by: Dai, Yusheng, et al.
Published: (2026)
InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions
by: Wang, Zhenzhi, et al.
Published: (2025)
by: Wang, Zhenzhi, et al.
Published: (2025)
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
by: Araujo, Edson, et al.
Published: (2025)
by: Araujo, Edson, et al.
Published: (2025)
VABench: A Comprehensive Benchmark for Audio-Video Generation
by: Hua, Daili, et al.
Published: (2025)
by: Hua, Daili, et al.
Published: (2025)
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation
by: Zhou, Yupeng, et al.
Published: (2026)
by: Zhou, Yupeng, et al.
Published: (2026)
Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning
by: Zeng, Donghuo, et al.
Published: (2026)
by: Zeng, Donghuo, et al.
Published: (2026)
Large Language Models are Strong Audio-Visual Speech Recognition Learners
by: Cappellazzo, Umberto, et al.
Published: (2024)
by: Cappellazzo, Umberto, et al.
Published: (2024)
TRACE: Training-Free Partial Audio Deepfake Detection via Embedding Trajectory Analysis of Speech Foundation Models
by: Khan, Awais, et al.
Published: (2026)
by: Khan, Awais, et al.
Published: (2026)
CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization
by: Bai, Detao, et al.
Published: (2025)
by: Bai, Detao, et al.
Published: (2025)
Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation
by: Tan, Weiting, et al.
Published: (2025)
by: Tan, Weiting, et al.
Published: (2025)
VinTAGe: Joint Video and Text Conditioning for Holistic Audio Generation
by: Kushwaha, Saksham Singh, et al.
Published: (2024)
by: Kushwaha, Saksham Singh, et al.
Published: (2024)
Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs
by: Cappellazzo, Umberto, et al.
Published: (2025)
by: Cappellazzo, Umberto, et al.
Published: (2025)
Semantics-Aware Human Motion Generation from Audio Instructions
by: Wang, Zi-An, et al.
Published: (2025)
by: Wang, Zi-An, et al.
Published: (2025)
Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation
by: Chen, Yuheng, et al.
Published: (2026)
by: Chen, Yuheng, et al.
Published: (2026)
OmniAudio: Generating Spatial Audio from 360-Degree Video
by: Liu, Huadai, et al.
Published: (2025)
by: Liu, Huadai, et al.
Published: (2025)
Enhancing Dance-to-Music Generation via Negative Conditioning Latent Diffusion Model
by: Sun, Changchang, et al.
Published: (2025)
by: Sun, Changchang, et al.
Published: (2025)
Similar Items
-
Speech motion anomaly detection via cross-modal translation of 4D motion fields from tagged MRI
by: Liu, Xiaofeng, et al.
Published: (2024) -
Semi-Supervised Bone Marrow Lesion Detection from Knee MRI Segmentation Using Mask Inpainting Models
by: Qin, Shihua, et al.
Published: (2024) -
SIREM: Speech-Informed MRI Reconstruction with Learned Sampling
by: Hasan, Md, et al.
Published: (2026) -
TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis
by: Ton, Tri, et al.
Published: (2025) -
MMAudio-LABEL: Audio Event Labeling via Audio Generation for Silent Video
by: Tateishi, Kazuya, et al.
Published: (2026)