Saved in:
| Main Authors: | Li, Guangyao, Wang, Xin, Zhu, Wenwu |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.06530 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
by: Du, Henghui, et al.
Published: (2025)
by: Du, Henghui, et al.
Published: (2025)
A Unified Framework for 3D Scene Understanding
by: Xu, Wei, et al.
Published: (2024)
by: Xu, Wei, et al.
Published: (2024)
T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation
by: Cao, Zhe, et al.
Published: (2025)
by: Cao, Zhe, et al.
Published: (2025)
LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV
by: Liu, Tengfei, et al.
Published: (2026)
by: Liu, Tengfei, et al.
Published: (2026)
Think with Grounding: Curriculum Reinforced Reasoning with Video Grounding for Long Video Understanding
by: Chen, Houlun, et al.
Published: (2026)
by: Chen, Houlun, et al.
Published: (2026)
Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
by: Cai, Dongnuan, et al.
Published: (2026)
by: Cai, Dongnuan, et al.
Published: (2026)
Multi-weather Cross-view Geo-localization Using Denoising Diffusion Models
by: Feng, Tongtong, et al.
Published: (2024)
by: Feng, Tongtong, et al.
Published: (2024)
UniScene: Unified Occupancy-centric Driving Scene Generation
by: Li, Bohan, et al.
Published: (2024)
by: Li, Bohan, et al.
Published: (2024)
MGNiceNet: Unified Monocular Geometric Scene Understanding
by: Schön, Markus, et al.
Published: (2024)
by: Schön, Markus, et al.
Published: (2024)
Unified Semantic Transformer for 3D Scene Understanding
by: Koch, Sebastian, et al.
Published: (2025)
by: Koch, Sebastian, et al.
Published: (2025)
AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
by: Choi, Jeongsoo, et al.
Published: (2023)
by: Choi, Jeongsoo, et al.
Published: (2023)
Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes
by: Wang, Yaoting, et al.
Published: (2024)
by: Wang, Yaoting, et al.
Published: (2024)
The Shape of Sight: A Homological Framework for Unifying Visual Perception
by: Li, Xin
Published: (2018)
by: Li, Xin
Published: (2018)
Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation
by: Lu, Yanzuo, et al.
Published: (2025)
by: Lu, Yanzuo, et al.
Published: (2025)
UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization
by: Geng, Tiantian, et al.
Published: (2024)
by: Geng, Tiantian, et al.
Published: (2024)
HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation
by: Zhou, Xin, et al.
Published: (2026)
by: Zhou, Xin, et al.
Published: (2026)
A Unified Framework for Human-centric Point Cloud Video Understanding
by: Xu, Yiteng, et al.
Published: (2024)
by: Xu, Yiteng, et al.
Published: (2024)
UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation
by: Zhang, Chi, et al.
Published: (2025)
by: Zhang, Chi, et al.
Published: (2025)
HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation
by: Zhou, Xin, et al.
Published: (2025)
by: Zhou, Xin, et al.
Published: (2025)
UMCFuse: A Unified Multiple Complex Scenes Infrared and Visible Image Fusion Framework
by: Li, Xilai, et al.
Published: (2024)
by: Li, Xilai, et al.
Published: (2024)
Unified 3D Scene Understanding Through Physical World Modeling
by: Lee, Wanhee, et al.
Published: (2026)
by: Lee, Wanhee, et al.
Published: (2026)
Emotion-Qwen: A Unified Framework for Emotion and Vision Understanding
by: Huang, Dawei, et al.
Published: (2025)
by: Huang, Dawei, et al.
Published: (2025)
ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving
by: Li, Jingyu, et al.
Published: (2025)
by: Li, Jingyu, et al.
Published: (2025)
SceneFactory: A Workflow-centric and Unified Framework for Incremental Scene Modeling
by: Yuan, Yijun, et al.
Published: (2024)
by: Yuan, Yijun, et al.
Published: (2024)
PRISM: A Unified Framework for Photorealistic Reconstruction and Intrinsic Scene Modeling
by: Dirik, Alara, et al.
Published: (2025)
by: Dirik, Alara, et al.
Published: (2025)
Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs
by: Huang, Jincai, et al.
Published: (2026)
by: Huang, Jincai, et al.
Published: (2026)
LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation
by: Wang, Zeyu, et al.
Published: (2025)
by: Wang, Zeyu, et al.
Published: (2025)
MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding
by: Jin, Xin, et al.
Published: (2025)
by: Jin, Xin, et al.
Published: (2025)
UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation
by: Li, Hebeizi, et al.
Published: (2026)
by: Li, Hebeizi, et al.
Published: (2026)
UniDGF: A Unified Detection-to-Generation Framework for Hierarchical Object Visual Recognition
by: Nan, Xinyu, et al.
Published: (2025)
by: Nan, Xinyu, et al.
Published: (2025)
Unified Personalized Understanding, Generating and Editing
by: Zhong, Yu, et al.
Published: (2026)
by: Zhong, Yu, et al.
Published: (2026)
Unified Reward Model for Multimodal Understanding and Generation
by: Wang, Yibin, et al.
Published: (2025)
by: Wang, Yibin, et al.
Published: (2025)
Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks
by: Yang, Min, et al.
Published: (2024)
by: Yang, Min, et al.
Published: (2024)
Human Motion Synthesis in 3D Scenes via Unified Scene Semantic Occupancy
by: Jingyu, Gong, et al.
Published: (2025)
by: Jingyu, Gong, et al.
Published: (2025)
TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes
by: Zhou, Xingcheng, et al.
Published: (2025)
by: Zhou, Xingcheng, et al.
Published: (2025)
InstructAV2AV: Instruction-Guided Audio-Video Joint Editing
by: Zheng, Haojie, et al.
Published: (2026)
by: Zheng, Haojie, et al.
Published: (2026)
MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models
by: Xie, Wulin, et al.
Published: (2025)
by: Xie, Wulin, et al.
Published: (2025)
A Unified Diffusion Framework for Scene-aware Human Motion Estimation from Sparse Signals
by: Tang, Jiangnan, et al.
Published: (2024)
by: Tang, Jiangnan, et al.
Published: (2024)
UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation
by: Li, Yi, et al.
Published: (2025)
by: Li, Yi, et al.
Published: (2025)
MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation
by: Shen, Tao, et al.
Published: (2025)
by: Shen, Tao, et al.
Published: (2025)
Similar Items
-
Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
by: Du, Henghui, et al.
Published: (2025) -
A Unified Framework for 3D Scene Understanding
by: Xu, Wei, et al.
Published: (2024) -
T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation
by: Cao, Zhe, et al.
Published: (2025) -
LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV
by: Liu, Tengfei, et al.
Published: (2026) -
Think with Grounding: Curriculum Reinforced Reasoning with Video Grounding for Long Video Understanding
by: Chen, Houlun, et al.
Published: (2026)