Saved in:
| Main Authors: | Kim, Geewook, Seo, Minjoon |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.17901 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization
by: Jiang, Zhonghua, et al.
Published: (2025)
by: Jiang, Zhonghua, et al.
Published: (2025)
Do Joint Audio-Video Generation Models Understand Physics?
by: Cui, Zijun, et al.
Published: (2026)
by: Cui, Zijun, et al.
Published: (2026)
UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation
by: Li, Hebeizi, et al.
Published: (2026)
by: Li, Hebeizi, et al.
Published: (2026)
PAVAS: Physics-Aware Video-to-Audio Synthesis
by: Hyun-Bin, Oh, et al.
Published: (2025)
by: Hyun-Bin, Oh, et al.
Published: (2025)
State-Space Hierarchical Compression with Gated Attention and Learnable Sampling for Hour-Long Video Understanding in Large Multimodal Models
by: Kim, Geewook, et al.
Published: (2025)
by: Kim, Geewook, et al.
Published: (2025)
Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence
by: Liao, Junchao, et al.
Published: (2026)
by: Liao, Junchao, et al.
Published: (2026)
JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation
by: Liu, Kai, et al.
Published: (2026)
by: Liu, Kai, et al.
Published: (2026)
Omni2Sound: Towards Unified Video-Text-to-Audio Generation
by: Dai, Yusheng, et al.
Published: (2026)
by: Dai, Yusheng, et al.
Published: (2026)
OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text
by: Pian, Weiguo, et al.
Published: (2026)
by: Pian, Weiguo, et al.
Published: (2026)
AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models
by: Yang, Jialiang, et al.
Published: (2026)
by: Yang, Jialiang, et al.
Published: (2026)
Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation
by: Chen, Yuheng, et al.
Published: (2026)
by: Chen, Yuheng, et al.
Published: (2026)
ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling
by: Yang, Jianxuan, et al.
Published: (2026)
by: Yang, Jianxuan, et al.
Published: (2026)
Aligned Better, Listen Better for Audio-Visual Large Language Models
by: Guo, Yuxin, et al.
Published: (2025)
by: Guo, Yuxin, et al.
Published: (2025)
IsoSignVid2Aud: Sign Language Video to Audio Conversion without Text Intermediaries
by: Kavediya, Harsh, et al.
Published: (2025)
by: Kavediya, Harsh, et al.
Published: (2025)
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation
by: Li, Chunyu, et al.
Published: (2026)
by: Li, Chunyu, et al.
Published: (2026)
MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization
by: Yang, Jianxuan, et al.
Published: (2025)
by: Yang, Jianxuan, et al.
Published: (2025)
AV-Surf: Surface-Enhanced Geometry-Aware Novel-View Acoustic Synthesis
by: Baek, Hadam, et al.
Published: (2025)
by: Baek, Hadam, et al.
Published: (2025)
Sound Sparks Motion: Audio and Text Tuning for Video Editing
by: Razlighi, AmirHossein Naghi, et al.
Published: (2026)
by: Razlighi, AmirHossein Naghi, et al.
Published: (2026)
VAInpaint: Zero-Shot Video-Audio inpainting framework with LLMs-driven Module
by: Wu, Kam Man, et al.
Published: (2025)
by: Wu, Kam Man, et al.
Published: (2025)
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
by: Cheng, Shihao, et al.
Published: (2026)
by: Cheng, Shihao, et al.
Published: (2026)
Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception
by: Ma, Ziyang, et al.
Published: (2025)
by: Ma, Ziyang, et al.
Published: (2025)
ReactMotion: Generating Reactive Listener Motions from Speaker Utterance
by: Luo, Cheng, et al.
Published: (2026)
by: Luo, Cheng, et al.
Published: (2026)
Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs
by: Cappellazzo, Umberto, et al.
Published: (2025)
by: Cappellazzo, Umberto, et al.
Published: (2025)
Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations
by: Yeo, Jeong Hun, et al.
Published: (2025)
by: Yeo, Jeong Hun, et al.
Published: (2025)
Read, Watch and Scream! Sound Generation from Text and Video
by: Jeong, Yujin, et al.
Published: (2024)
by: Jeong, Yujin, et al.
Published: (2024)
On Efficient Language and Vision Assistants for Visually-Situated Natural Language Understanding: What Matters in Reading and Reasoning
by: Kim, Geewook, et al.
Published: (2024)
by: Kim, Geewook, et al.
Published: (2024)
Diffusion Models for Joint Audio-Video Generation
by: La Torre, Alejandro Paredes
Published: (2026)
by: La Torre, Alejandro Paredes
Published: (2026)
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
by: Tian, Zeyue, et al.
Published: (2024)
by: Tian, Zeyue, et al.
Published: (2024)
Anomaly Detection and Localization for Speech Deepfakes via Feature Pyramid Matching
by: Coletta, Emma, et al.
Published: (2025)
by: Coletta, Emma, et al.
Published: (2025)
AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control
by: Guo, Xinyue, et al.
Published: (2025)
by: Guo, Xinyue, et al.
Published: (2025)
MoLT: Mixture of Layer-Wise Tokens for Efficient Audio-Visual Learning
by: Rho, Kyeongha, et al.
Published: (2025)
by: Rho, Kyeongha, et al.
Published: (2025)
Tempo as the Stable Cue: Hierarchical Mixture of Tempo and Beat Experts for Music to 3D Dance Generation
by: Lyu, Guangtao, et al.
Published: (2025)
by: Lyu, Guangtao, et al.
Published: (2025)
Schrodinger Audio-Visual Editor: Object-Level Audiovisual Removal
by: Xu, Weihan, et al.
Published: (2025)
by: Xu, Weihan, et al.
Published: (2025)
Audio-Visual World Models: Towards Multisensory Imagination in Sight and Sound
by: Wang, Jiahua, et al.
Published: (2025)
by: Wang, Jiahua, et al.
Published: (2025)
AudioStory: Generating Long-Form Narrative Audio with Large Language Models
by: Guo, Yuxin, et al.
Published: (2025)
by: Guo, Yuxin, et al.
Published: (2025)
TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation
by: Liu, Xiangyu, et al.
Published: (2026)
by: Liu, Xiangyu, et al.
Published: (2026)
SonoWorld: From One Image to a 3D Audio-Visual Scene
by: Jin, Derong, et al.
Published: (2026)
by: Jin, Derong, et al.
Published: (2026)
Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment
by: Rinaldi, Ivan, et al.
Published: (2026)
by: Rinaldi, Ivan, et al.
Published: (2026)
READ-Net: Clarifying Emotional Ambiguity via Adaptive Feature Recalibration for Audio-Visual Depression Detection
by: Chen, Chenglizhao, et al.
Published: (2026)
by: Chen, Chenglizhao, et al.
Published: (2026)
LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV
by: Liu, Tengfei, et al.
Published: (2026)
by: Liu, Tengfei, et al.
Published: (2026)
Similar Items
-
AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization
by: Jiang, Zhonghua, et al.
Published: (2025) -
Do Joint Audio-Video Generation Models Understand Physics?
by: Cui, Zijun, et al.
Published: (2026) -
UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation
by: Li, Hebeizi, et al.
Published: (2026) -
PAVAS: Physics-Aware Video-to-Audio Synthesis
by: Hyun-Bin, Oh, et al.
Published: (2025) -
State-Space Hierarchical Compression with Gated Attention and Learnable Sampling for Hour-Long Video Understanding in Large Multimodal Models
by: Kim, Geewook, et al.
Published: (2025)