Saved in:
| Main Authors: | Hyun-Bin, Oh, Takida, Yuhta, Uesaka, Toshimitsu, Oh, Tae-Hyun, Mitsufuji, Yuki |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.08282 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment
by: Sung-Bin, Kim, et al.
Published: (2024)
by: Sung-Bin, Kim, et al.
Published: (2024)
Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment
by: Senocak, Arda, et al.
Published: (2024)
by: Senocak, Arda, et al.
Published: (2024)
StereoSync: Spatially-Aware Stereo Audio Generation from Video
by: Marinoni, Christian, et al.
Published: (2025)
by: Marinoni, Christian, et al.
Published: (2025)
MMDisCo: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation
by: Hayakawa, Akio, et al.
Published: (2024)
by: Hayakawa, Akio, et al.
Published: (2024)
Sounding Highlights: Dual-Pathway Audio Encoders for Audio-Visual Video Highlight Detection
by: Joo, Seohyun, et al.
Published: (2026)
by: Joo, Seohyun, et al.
Published: (2026)
Schrodinger Audio-Visual Editor: Object-Level Audiovisual Removal
by: Xu, Weihan, et al.
Published: (2025)
by: Xu, Weihan, et al.
Published: (2025)
Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence
by: Liao, Junchao, et al.
Published: (2026)
by: Liao, Junchao, et al.
Published: (2026)
Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis
by: Yang, Qi, et al.
Published: (2024)
by: Yang, Qi, et al.
Published: (2024)
JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation
by: Liu, Kai, et al.
Published: (2026)
by: Liu, Kai, et al.
Published: (2026)
Omni2Sound: Towards Unified Video-Text-to-Audio Generation
by: Dai, Yusheng, et al.
Published: (2026)
by: Dai, Yusheng, et al.
Published: (2026)
Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation
by: Yang, Shiqi, et al.
Published: (2024)
by: Yang, Shiqi, et al.
Published: (2024)
OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text
by: Pian, Weiguo, et al.
Published: (2026)
by: Pian, Weiguo, et al.
Published: (2026)
UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation
by: Li, Hebeizi, et al.
Published: (2026)
by: Li, Hebeizi, et al.
Published: (2026)
Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation
by: Chen, Yuheng, et al.
Published: (2026)
by: Chen, Yuheng, et al.
Published: (2026)
ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling
by: Yang, Jianxuan, et al.
Published: (2026)
by: Yang, Jianxuan, et al.
Published: (2026)
MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization
by: Yang, Jianxuan, et al.
Published: (2025)
by: Yang, Jianxuan, et al.
Published: (2025)
Do Joint Audio-Video Generation Models Understand Physics?
by: Cui, Zijun, et al.
Published: (2026)
by: Cui, Zijun, et al.
Published: (2026)
IsoSignVid2Aud: Sign Language Video to Audio Conversion without Text Intermediaries
by: Kavediya, Harsh, et al.
Published: (2025)
by: Kavediya, Harsh, et al.
Published: (2025)
Sound Sparks Motion: Audio and Text Tuning for Video Editing
by: Razlighi, AmirHossein Naghi, et al.
Published: (2026)
by: Razlighi, AmirHossein Naghi, et al.
Published: (2026)
AudioStory: Generating Long-Form Narrative Audio with Large Language Models
by: Guo, Yuxin, et al.
Published: (2025)
by: Guo, Yuxin, et al.
Published: (2025)
AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models
by: Yang, Jialiang, et al.
Published: (2026)
by: Yang, Jialiang, et al.
Published: (2026)
AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization
by: Jiang, Zhonghua, et al.
Published: (2025)
by: Jiang, Zhonghua, et al.
Published: (2025)
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation
by: Li, Chunyu, et al.
Published: (2026)
by: Li, Chunyu, et al.
Published: (2026)
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
by: Cheng, Shihao, et al.
Published: (2026)
by: Cheng, Shihao, et al.
Published: (2026)
Diffusion Models for Joint Audio-Video Generation
by: La Torre, Alejandro Paredes
Published: (2026)
by: La Torre, Alejandro Paredes
Published: (2026)
Apollo: Unified Multi-Task Audio-Video Joint Generation
by: Wang, Jun, et al.
Published: (2026)
by: Wang, Jun, et al.
Published: (2026)
Audio-Visual World Models: Towards Multisensory Imagination in Sight and Sound
by: Wang, Jiahua, et al.
Published: (2025)
by: Wang, Jiahua, et al.
Published: (2025)
AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers
by: Araujo, Edson, et al.
Published: (2026)
by: Araujo, Edson, et al.
Published: (2026)
OmniForcing: Unleashing Real-time Joint Audio-Visual Generation
by: Su, Yaofeng, et al.
Published: (2026)
by: Su, Yaofeng, et al.
Published: (2026)
MoLT: Mixture of Layer-Wise Tokens for Efficient Audio-Visual Learning
by: Rho, Kyeongha, et al.
Published: (2025)
by: Rho, Kyeongha, et al.
Published: (2025)
TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation
by: Wang, Zhenzhi, et al.
Published: (2025)
by: Wang, Zhenzhi, et al.
Published: (2025)
FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation
by: Tan, Weiting, et al.
Published: (2026)
by: Tan, Weiting, et al.
Published: (2026)
Video-to-Audio Generation with Hidden Alignment
by: Xu, Manjie, et al.
Published: (2024)
by: Xu, Manjie, et al.
Published: (2024)
Temporally Aligned Audio for Video with Autoregression
by: Viertola, Ilpo, et al.
Published: (2024)
by: Viertola, Ilpo, et al.
Published: (2024)
AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer
by: Fang, Pengjun, et al.
Published: (2026)
by: Fang, Pengjun, et al.
Published: (2026)
TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation
by: Liu, Xiangyu, et al.
Published: (2026)
by: Liu, Xiangyu, et al.
Published: (2026)
SonoWorld: From One Image to a 3D Audio-Visual Scene
by: Jin, Derong, et al.
Published: (2026)
by: Jin, Derong, et al.
Published: (2026)
AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control
by: Guo, Xinyue, et al.
Published: (2025)
by: Guo, Xinyue, et al.
Published: (2025)
READ-Net: Clarifying Emotional Ambiguity via Adaptive Feature Recalibration for Audio-Visual Depression Detection
by: Chen, Chenglizhao, et al.
Published: (2026)
by: Chen, Chenglizhao, et al.
Published: (2026)
Diffusion Models as Masked Audio-Video Learners
by: Nunez, Elvis, et al.
Published: (2023)
by: Nunez, Elvis, et al.
Published: (2023)
Similar Items
-
Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment
by: Sung-Bin, Kim, et al.
Published: (2024) -
Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment
by: Senocak, Arda, et al.
Published: (2024) -
StereoSync: Spatially-Aware Stereo Audio Generation from Video
by: Marinoni, Christian, et al.
Published: (2025) -
MMDisCo: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation
by: Hayakawa, Akio, et al.
Published: (2024) -
Sounding Highlights: Dual-Pathway Audio Encoders for Audio-Visual Video Highlight Detection
by: Joo, Seohyun, et al.
Published: (2026)