:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Li, Guangyao, Wang, Xin, Zhu, Wenwu
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2603.06530
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
by: Du, Henghui, et al.
Published: (2025)

A Unified Framework for 3D Scene Understanding
by: Xu, Wei, et al.
Published: (2024)

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation
by: Cao, Zhe, et al.
Published: (2025)

LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV
by: Liu, Tengfei, et al.
Published: (2026)

Think with Grounding: Curriculum Reinforced Reasoning with Video Grounding for Long Video Understanding
by: Chen, Houlun, et al.
Published: (2026)

Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
by: Cai, Dongnuan, et al.
Published: (2026)

Multi-weather Cross-view Geo-localization Using Denoising Diffusion Models
by: Feng, Tongtong, et al.
Published: (2024)

UniScene: Unified Occupancy-centric Driving Scene Generation
by: Li, Bohan, et al.
Published: (2024)

MGNiceNet: Unified Monocular Geometric Scene Understanding
by: Schön, Markus, et al.
Published: (2024)

Unified Semantic Transformer for 3D Scene Understanding
by: Koch, Sebastian, et al.
Published: (2025)

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
by: Choi, Jeongsoo, et al.
Published: (2023)

Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes
by: Wang, Yaoting, et al.
Published: (2024)

The Shape of Sight: A Homological Framework for Unifying Visual Perception
by: Li, Xin
Published: (2018)

Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation
by: Lu, Yanzuo, et al.
Published: (2025)

UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization
by: Geng, Tiantian, et al.
Published: (2024)

HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation
by: Zhou, Xin, et al.
Published: (2026)

A Unified Framework for Human-centric Point Cloud Video Understanding
by: Xu, Yiteng, et al.
Published: (2024)

UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation
by: Zhang, Chi, et al.
Published: (2025)

HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation
by: Zhou, Xin, et al.
Published: (2025)

UMCFuse: A Unified Multiple Complex Scenes Infrared and Visible Image Fusion Framework
by: Li, Xilai, et al.
Published: (2024)

Unified 3D Scene Understanding Through Physical World Modeling
by: Lee, Wanhee, et al.
Published: (2026)

Emotion-Qwen: A Unified Framework for Emotion and Vision Understanding
by: Huang, Dawei, et al.
Published: (2025)

ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving
by: Li, Jingyu, et al.
Published: (2025)

SceneFactory: A Workflow-centric and Unified Framework for Incremental Scene Modeling
by: Yuan, Yijun, et al.
Published: (2024)

PRISM: A Unified Framework for Photorealistic Reconstruction and Intrinsic Scene Modeling
by: Dirik, Alara, et al.
Published: (2025)

Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs
by: Huang, Jincai, et al.
Published: (2026)

LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation
by: Wang, Zeyu, et al.
Published: (2025)

MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding
by: Jin, Xin, et al.
Published: (2025)

UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation
by: Li, Hebeizi, et al.
Published: (2026)

UniDGF: A Unified Detection-to-Generation Framework for Hierarchical Object Visual Recognition
by: Nan, Xinyu, et al.
Published: (2025)

Unified Personalized Understanding, Generating and Editing
by: Zhong, Yu, et al.
Published: (2026)

Unified Reward Model for Multimodal Understanding and Generation
by: Wang, Yibin, et al.
Published: (2025)

Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks
by: Yang, Min, et al.
Published: (2024)

Human Motion Synthesis in 3D Scenes via Unified Scene Semantic Occupancy
by: Jingyu, Gong, et al.
Published: (2025)

TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes
by: Zhou, Xingcheng, et al.
Published: (2025)

InstructAV2AV: Instruction-Guided Audio-Video Joint Editing
by: Zheng, Haojie, et al.
Published: (2026)

MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models
by: Xie, Wulin, et al.
Published: (2025)

A Unified Diffusion Framework for Scene-aware Human Motion Estimation from Sparse Signals
by: Tang, Jiangnan, et al.
Published: (2024)

UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation
by: Li, Yi, et al.
Published: (2025)

MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation
by: Shen, Tao, et al.
Published: (2025)