Saved in:
| Main Authors: | Wang, Yanan, Ren, Linjie, Li, Zihao, Wang, Junyi, Gan, Tian |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.15017 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Query-Guided Spatial-Temporal-Frequency Interaction for Music Audio-Visual Question Answering
by: Li, Kun, et al.
Published: (2026)
by: Li, Kun, et al.
Published: (2026)
Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations
by: Li, Linjie, et al.
Published: (2025)
by: Li, Linjie, et al.
Published: (2025)
Visual Spatial Tuning
by: Yang, Rui, et al.
Published: (2025)
by: Yang, Rui, et al.
Published: (2025)
From Sequential to Spatial: Reordering Autoregression for Efficient Visual Generation
by: Wang, Siyang, et al.
Published: (2025)
by: Wang, Siyang, et al.
Published: (2025)
MOSPA: Human Motion Generation Driven by Spatial Audio
by: Xu, Shuyang, et al.
Published: (2025)
by: Xu, Shuyang, et al.
Published: (2025)
Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment
by: Liu, Chen, et al.
Published: (2025)
by: Liu, Chen, et al.
Published: (2025)
PixelWizard: Towards Efficient High-Fidelity Video Generation at Ultra-Large Spatial Resolution
by: Li, Wenxue, et al.
Published: (2026)
by: Li, Wenxue, et al.
Published: (2026)
SpatialViz-Bench: A Cognitively-Grounded Benchmark for Diagnosing Spatial Visualization in MLLMs
by: Wang, Siting, et al.
Published: (2025)
by: Wang, Siting, et al.
Published: (2025)
Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives
by: Li, Daiqiang, et al.
Published: (2026)
by: Li, Daiqiang, et al.
Published: (2026)
pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning
by: Luo, Zhanpeng, et al.
Published: (2026)
by: Luo, Zhanpeng, et al.
Published: (2026)
OmniAudio: Generating Spatial Audio from 360-Degree Video
by: Liu, Huadai, et al.
Published: (2025)
by: Liu, Huadai, et al.
Published: (2025)
Controllable Audio-Visual Viewpoint Generation from 360° Spatial Information
by: Marinoni, Christian, et al.
Published: (2025)
by: Marinoni, Christian, et al.
Published: (2025)
Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation
by: Cho, Jaemin, et al.
Published: (2023)
by: Cho, Jaemin, et al.
Published: (2023)
EMO2: End-Effector Guided Audio-Driven Avatar Video Generation
by: Tian, Linrui, et al.
Published: (2025)
by: Tian, Linrui, et al.
Published: (2025)
AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation
by: Wang, Kai, et al.
Published: (2024)
by: Wang, Kai, et al.
Published: (2024)
Lightweight Spatial Embedding for Vision-based 3D Occupancy Prediction
by: Zhang, Jinqing, et al.
Published: (2024)
by: Zhang, Jinqing, et al.
Published: (2024)
Leveraging the Spatial Hierarchy: Coarse-to-fine Trajectory Generation via Cascaded Hybrid Diffusion
by: Guo, Baoshen, et al.
Published: (2025)
by: Guo, Baoshen, et al.
Published: (2025)
Audio-Guided Visual Perception for Audio-Visual Navigation
by: Wang, Yi, et al.
Published: (2025)
by: Wang, Yi, et al.
Published: (2025)
SSG-Dit: A Spatial Signal Guided Framework for Controllable Video Generation
by: Hu, Peng, et al.
Published: (2025)
by: Hu, Peng, et al.
Published: (2025)
UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation
by: Wang, Jiayun, et al.
Published: (2026)
by: Wang, Jiayun, et al.
Published: (2026)
SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning
by: Li, Yian, et al.
Published: (2026)
by: Li, Yian, et al.
Published: (2026)
VSFormer: Visual-Spatial Fusion Transformer for Correspondence Pruning
by: Liao, Tangfei, et al.
Published: (2023)
by: Liao, Tangfei, et al.
Published: (2023)
Wan-S2V: Audio-Driven Cinematic Video Generation
by: Gao, Xin, et al.
Published: (2025)
by: Gao, Xin, et al.
Published: (2025)
LLaVA-VSD: Large Language-and-Vision Assistant for Visual Spatial Description
by: Jin, Yizhang, et al.
Published: (2024)
by: Jin, Yizhang, et al.
Published: (2024)
OSInsert: Towards High-authenticity and High-fidelity Image Composition
by: Wang, Jingyuan, et al.
Published: (2026)
by: Wang, Jingyuan, et al.
Published: (2026)
DyDiT++: Diffusion Transformers with Timestep and Spatial Dynamics for Efficient Visual Generation
by: Zhao, Wangbo, et al.
Published: (2025)
by: Zhao, Wangbo, et al.
Published: (2025)
Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations
by: Wang, Yuji, et al.
Published: (2025)
by: Wang, Yuji, et al.
Published: (2025)
SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning
by: Jeon, Byungwoo, et al.
Published: (2026)
by: Jeon, Byungwoo, et al.
Published: (2026)
Native Audio-Visual Alignment for Generation
by: Ji, Longbin, et al.
Published: (2026)
by: Ji, Longbin, et al.
Published: (2026)
Real-time High-fidelity Gaussian Human Avatars with Position-based Interpolation of Spatially Distributed MLPs
by: Zhan, Youyi, et al.
Published: (2025)
by: Zhan, Youyi, et al.
Published: (2025)
Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs
by: Zhu, Fangrui, et al.
Published: (2025)
by: Zhu, Fangrui, et al.
Published: (2025)
From Waveforms to Pixels: A Survey on Audio-Visual Segmentation
by: Li, Jia, et al.
Published: (2025)
by: Li, Jia, et al.
Published: (2025)
PreciseCache: Precise Feature Caching for Efficient and High-fidelity Video Generation
by: Wang, Jiangshan, et al.
Published: (2026)
by: Wang, Jiangshan, et al.
Published: (2026)
UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation
by: Li, Hebeizi, et al.
Published: (2026)
by: Li, Hebeizi, et al.
Published: (2026)
Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation
by: Yang, Zhengyuan, et al.
Published: (2023)
by: Yang, Zhengyuan, et al.
Published: (2023)
SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation
by: Pham, Kien T., et al.
Published: (2025)
by: Pham, Kien T., et al.
Published: (2025)
VideoAnchor: Reinforcing Subspace-Structured Visual Cues for Coherent Visual-Spatial Reasoning
by: Wang, Zhaozhi, et al.
Published: (2025)
by: Wang, Zhaozhi, et al.
Published: (2025)
Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency
by: Liu, Tianqi, et al.
Published: (2025)
by: Liu, Tianqi, et al.
Published: (2025)
Language-Guided Joint Audio-Visual Editing via One-Shot Adaptation
by: Liang, Susan, et al.
Published: (2024)
by: Liang, Susan, et al.
Published: (2024)
Spatial Orthogonal Refinement for Robust RGB-Event Visual Object Tracking
by: Huang, Dexing, et al.
Published: (2026)
by: Huang, Dexing, et al.
Published: (2026)
Similar Items
-
Query-Guided Spatial-Temporal-Frequency Interaction for Music Audio-Visual Question Answering
by: Li, Kun, et al.
Published: (2026) -
Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations
by: Li, Linjie, et al.
Published: (2025) -
Visual Spatial Tuning
by: Yang, Rui, et al.
Published: (2025) -
From Sequential to Spatial: Reordering Autoregression for Efficient Visual Generation
by: Wang, Siyang, et al.
Published: (2025) -
MOSPA: Human Motion Generation Driven by Spatial Audio
by: Xu, Shuyang, et al.
Published: (2025)